+
Skip to content

Old python version causes silent error in cluster setup #4279

Open
@arbrown

Description

@arbrown

Describe the bug

A match keyword in a setup script causes setup to fail in the schedmd-slurm-gcp-v6-controller module

Steps to reproduce

I followed this tutorial and failed to start up the slurm daemon. The logs indicated it was a python syntax error in cluster-toolkit/community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/util.py.

I believe the image for this node only has python 3.9, and a recent update to the util script uses the match keyword, which is a python 3.10+ feature, which causes setup to fail.

Expected behavior

I expected a running slurm cluster

Actual behavior

*** Slurm instance has not been set up yet... *** message on the login node

Version (gcluster --version)

gcluster version - not built from official release
Built from 'main' branch.
Commit info: v1.54.0-1-g9f697453a
Terraform version: 1.13.0-dev

Blueprint

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

blueprint_name: hpc-slurm-llama

vars:
  project_id:  ## Set GCP Project ID Here ##
  bucket_model: ## Set your bucket name prefix here ##
  deployment_name: hpc-slurm-llama2
  region: us-central1
  zone: us-central1-a
  zone_list: [us-central1-a, us-central1-b, us-central1-c]
  new_image_family: llama2-slurm-v6
  instance_image_custom: true
  disk_size_gb: 200


# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md

deployment_groups:
- group: enable_apis
  modules:

  - id: enable_apis
    source: community/modules/project/service-enablement
    settings:
      gcp_service_list: [
        "cloudresourcemanager.googleapis.com",
        "stackdriver.googleapis.com",
        "iam.googleapis.com",
        "logging.googleapis.com",
        "compute.googleapis.com"
      ]
- group: setup
  modules:

  ## Monitoring
  - id: hpc_dash
    source: modules/monitoring/dashboard
    settings:
      title: HPC
  - id: gpu_dash
    source: modules/monitoring/dashboard
    settings:
      title: GPU
      base_dashboard: Empty
      widgets:
      - |
          {
            "title": "GPU Memory Utilization",
            "xyChart": {
              "dataSets": [
                {
                  "timeSeriesQuery": {
                    "timeSeriesFilter": {
                      "filter": "metric.type=\"agent.googleapis.com/gpu/memory/bytes_used\" resource.type=\"gce_instance\"",
                      "aggregation": {
                        "perSeriesAligner": "ALIGN_MEAN",
                        "crossSeriesReducer": "REDUCE_NONE",
                        "groupByFields": []
                      }
                    }
                  },
                  "plotType": "LINE",
                  "targetAxis": "Y1",
                  "minAlignmentPeriod": "60s"
                }
              ],
              "chartOptions": {
                "mode": "COLOR",
                "displayHorizontal": false
              },
              "thresholds": [],
              "yAxis": {
                "scale": "LINEAR"
              }
            }
          }
      - |
          {
            "title": "GPU Utilization",
            "xyChart": {
              "dataSets": [
                {
                  "timeSeriesQuery": {
                    "prometheusQuery": "avg_over_time(agent_googleapis_com:gpu_utilization{monitored_resource=\"gce_instance\"}[${__interval}])"
                  },
                  "plotType": "LINE",
                  "targetAxis": "Y1"
                }
              ],
              "chartOptions": {
                "mode": "COLOR",
                "displayHorizontal": false
              },
              "thresholds": [],
              "yAxis": {
                "scale": "LINEAR"
              }
            }
          }

  ## network
  - id: network1
    source: modules/network/vpc

  ## Filesystems
  - id: homefs
    source: community/modules/file-system/nfs-server
    use: [network1]
    settings:
      local_mounts: [/home]
      disk_size: 2560
      instance_image:  
        project: "cloud-hpc-image-public"
        family:  "hpc-rocky-linux-8"

  - id: data_bucket
    source: community/modules/file-system/cloud-storage-bucket
    settings:
      name_prefix: $(vars.bucket_model)
      random_suffix: true
      force_destroy: true
      local_mount: /data_bucket
      mount_options: defaults,_netdev,implicit_dirs,allow_other,dir_mode=0777,file_mode=766

  - id: move_files
    source: ./files
    use: [data_bucket]


  ## Install Scripts

  - id: packer_script
    # configure conda environment for llama
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: install-ml-libraries.sh
        content: |
          #!/bin/bash
          # this script is designed to execute on Slurm images published by SchedMD that:
          # - are based on Debian 11 distribution of Linux
          # - have NVIDIA Drivers v530 pre-installed
          # - have CUDA Toolkit 12.1 pre-installed.

          set -e -o pipefail


          CONDA_BASE=/opt/conda

          if [ -d $CONDA_BASE ]; then
                  exit 0
          fi

          DL_DIR=\$(mktemp -d)
          cd $DL_DIR
          curl -O https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
          HOME=$DL_DIR bash Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -b -p $CONDA_BASE
          cd -
          rm -rf $DL_DIR
          unset DL_DIR

          tee /tmp/llama2_env.yml << EOLL
          name: llama2
          channels:
            - conda-forge
            - nvidia
            - nvidia/label/cuda-12.4.0
          dependencies:
            - appdirs
            - loralib
            - black
            - black-jupyter
            - py7zr
            - scipy
            - optimum
            - datasets
            - accelerate
            - peft
            - fairscale
            - fire
            - sentencepiece
            - transformers
            - huggingface_hub
            - git
            - pip
            - pip:
              - bitsandbytes
              - nvidia-cudnn-cu12
              - dataclasses
              - nvidia-nccl-cu12
              - trl
              - torch
              - torchaudio 
              - torchvision
              - nvitop
          EOLL

          source $CONDA_BASE/bin/activate base
          conda env create -n llama2 --file /tmp/llama2_env.yml

  - id: startup_script
    source: modules/scripts/startup-script
    settings:
      install_cloud_ops_agent: false
      runners:
      - type: shell
        destination: startup-script.sh
        content: |
          #!/bin/bash
          CONDA_BASE=/opt/conda
          source $CONDA_BASE/bin/activate base
          conda init --system

          # UnInstall Stackdriver Agent

          sudo systemctl stop stackdriver-agent.service
          sudo systemctl disable stackdriver-agent.service
          curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
          sudo dpkg --configure -a
          sudo bash add-monitoring-agent-repo.sh --uninstall
          sudo bash add-monitoring-agent-repo.sh --remove-repo

          # Install ops-agent

          curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
          sudo bash add-google-cloud-ops-agent-repo.sh --also-install
          sudo service google-cloud-ops-agent start
- group: packer
  modules:
  - id: custom-image
    source: modules/packer/custom-image
    kind: packer
    use:
    - network1
    - packer_script
    settings:
      source_image_project_id: [schedmd-slurm-public]
      source_image_family: slurm-gcp-6-6-debian-11
      disk_size: $(vars.disk_size_gb)
      image_family: $(vars.new_image_family)
      machine_type: c2-standard-8 # building this image does not require a GPU-enabled VM
      state_timeout: 30m


- group: cluster
  modules:

  - id: n1t4_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network1]
    settings:
      zones: $(vars.zone_list)
      node_count_dynamic_max: 1
      bandwidth_tier: gvnic_enabled
      disk_size_gb: $(vars.disk_size_gb)
      enable_public_ips: true
      enable_placement: false
      advanced_machine_features:
        threads_per_core: 1
      machine_type: n1-standard-96
      guest_accelerator:
      - type: nvidia-tesla-t4
        count: 4

      on_host_maintenance: TERMINATE
      instance_image:
        family: $(vars.new_image_family)
        project: $(vars.project_id)
  - id: n1t4_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [n1t4_nodeset]
    settings:
      partition_name: n1t4
      is_default: true
      exclusive: false


  - id: n2_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network1]
    settings:
      zones: $(vars.zone_list)
      node_count_dynamic_max: 1
      bandwidth_tier: gvnic_enabled
      disk_size_gb: $(vars.disk_size_gb)
      enable_public_ips: true
      advanced_machine_features:
        threads_per_core: 1
      machine_type: n2-standard-4
      on_host_maintenance: TERMINATE
      instance_image:
        family: $(vars.new_image_family)
        project: $(vars.project_id)
  - id: n2_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [n2_nodeset]
    settings:
      partition_name: n2
      is_default: true

  - id: g2_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network1]
    settings:
      zones: $(vars.zone_list)
      node_count_dynamic_max: 1
      bandwidth_tier: gvnic_enabled
      disk_size_gb: $(vars.disk_size_gb)
      enable_public_ips: true
      advanced_machine_features:
        threads_per_core: 1
      machine_type: g2-standard-96
      on_host_maintenance: TERMINATE
      instance_image:
        family: $(vars.new_image_family)
        project: $(vars.project_id)
  - id: g2_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [g2_nodeset]
    settings:
      partition_name: g2gpu8
      is_default: false

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
    use: [network1]
    settings:
      name_prefix: login
      machine_type: n2-standard-4
      enable_login_public_ips: true
      instance_image: 
        family: $(vars.new_image_family)
        project: $(vars.project_id)
        

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use:
    - network1
    - n1t4_partition
    - n2_partition
    - g2_partition
    - slurm_login
    - homefs
    - data_bucket
    settings:
      enable_controller_public_ips: true
      controller_startup_script: $(startup_script.startup_script)
      controller_startup_scripts_timeout: 21600
      login_startup_script: $(startup_script.startup_script)
      login_startup_scripts_timeout: 21600
      instance_image: 
        family: $(vars.new_image_family)
        project: $(vars.project_id)

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running gcluster expand your-blueprint.yaml.

Disregard if the bug occurs when running gcluster expand ... as well.

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

blueprint_name: hpc-slurm-llama
ghpc_version: v1.54.0-1-g9f697453a-dirty
vars:
  bucket_model: llama2
  deployment_name: hpc-slurm-llama2
  disk_size_gb: 200
  instance_image_custom: true
  labels:
    ghpc_blueprint: hpc-slurm-llama
    ghpc_deployment: ((var.deployment_name))
  new_image_family: llama2-slurm-v6
  project_id: drewbr-sandbox-69
  region: us-central1
  zone: us-central1-a
  zone_list:
    - us-central1-a
    - us-central1-b
    - us-central1-c
deployment_groups:
  - group: enable_apis
    terraform_providers:
      google:
        source: hashicorp/google
        version: ~> 6.38.0
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      google-beta:
        source: hashicorp/google-beta
        version: ~> 6.38.0
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
    modules:
      - source: community/modules/project/service-enablement
        kind: terraform
        id: enable_apis
        settings:
          gcp_service_list:
            - cloudresourcemanager.googleapis.com
            - stackdriver.googleapis.com
            - iam.googleapis.com
            - logging.googleapis.com
            - compute.googleapis.com
          project_id: ((var.project_id))
  - group: setup
    terraform_providers:
      google:
        source: hashicorp/google
        version: ~> 6.38.0
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      google-beta:
        source: hashicorp/google-beta
        version: ~> 6.38.0
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
    modules:
      - source: modules/monitoring/dashboard
        kind: terraform
        id: hpc_dash
        settings:
          deployment_name: ((var.deployment_name))
          labels: ((var.labels))
          project_id: ((var.project_id))
          title: HPC
      - source: modules/monitoring/dashboard
        kind: terraform
        id: gpu_dash
        settings:
          base_dashboard: Empty
          deployment_name: ((var.deployment_name))
          labels: ((var.labels))
          project_id: ((var.project_id))
          title: GPU
          widgets:
            - |
              {
                "title": "GPU Memory Utilization",
                "xyChart": {
                  "dataSets": [
                    {
                      "timeSeriesQuery": {
                        "timeSeriesFilter": {
                          "filter": "metric.type=\"agent.googleapis.com/gpu/memory/bytes_used\" resource.type=\"gce_instance\"",
                          "aggregation": {
                            "perSeriesAligner": "ALIGN_MEAN",
                            "crossSeriesReducer": "REDUCE_NONE",
                            "groupByFields": []
                          }
                        }
                      },
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "minAlignmentPeriod": "60s"
                    }
                  ],
                  "chartOptions": {
                    "mode": "COLOR",
                    "displayHorizontal": false
                  },
                  "thresholds": [],
                  "yAxis": {
                    "scale": "LINEAR"
                  }
                }
              }
            - |
              {
                "title": "GPU Utilization",
                "xyChart": {
                  "dataSets": [
                    {
                      "timeSeriesQuery": {
                        "prometheusQuery": "avg_over_time(agent_googleapis_com:gpu_utilization{monitored_resource=\"gce_instance\"}[${__interval}])"
                      },
                      "plotType": "LINE",
                      "targetAxis": "Y1"
                    }
                  ],
                  "chartOptions": {
                    "mode": "COLOR",
                    "displayHorizontal": false
                  },
                  "thresholds": [],
                  "yAxis": {
                    "scale": "LINEAR"
                  }
                }
              }
      - source: modules/network/vpc
        kind: terraform
        id: network1
        outputs:
          - name: subnetwork_name
            description: Automatically-generated output exported for use by later deployment groups
            sensitive: true
          - name: subnetwork_self_link
            description: Automatically-generated output exported for use by later deployment groups
            sensitive: true
        settings:
          deployment_name: ((var.deployment_name))
          labels: ((var.labels))
          project_id: ((var.project_id))
          region: ((var.region))
      - source: community/modules/file-system/nfs-server
        kind: terraform
        id: homefs
        use:
          - network1
        outputs:
          - name: network_storage
            description: Automatically-generated output exported for use by later deployment groups
            sensitive: true
        settings:
          deployment_name: ((var.deployment_name))
          disk_size: 2560
          instance_image:
            family: hpc-rocky-linux-8
            project: cloud-hpc-image-public
          labels: ((var.labels))
          local_mounts:
            - /home
          network_self_link: ((module.network1.network_self_link))
          project_id: ((var.project_id))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/file-system/cloud-storage-bucket
        kind: terraform
        id: data_bucket
        outputs:
          - name: network_storage
            description: Automatically-generated output exported for use by later deployment groups
            sensitive: true
        settings:
          deployment_name: ((var.deployment_name))
          force_destroy: true
          labels: ((var.labels))
          local_mount: /data_bucket
          mount_options: defaults,_netdev,implicit_dirs,allow_other,dir_mode=0777,file_mode=766
          name_prefix: ((var.bucket_model))
          project_id: ((var.project_id))
          random_suffix: true
          region: ((var.region))
      - source: ./files
        kind: terraform
        id: move_files
        use:
          - data_bucket
        settings:
          gcs_bucket_path: ((module.data_bucket.gcs_bucket_path))
          project_id: ((var.project_id))
      - source: modules/scripts/startup-script
        kind: terraform
        id: packer_script
        outputs:
          - name: startup_script
            description: Automatically-generated output exported for use by later deployment groups
            sensitive: true
        settings:
          deployment_name: ((var.deployment_name))
          labels: ((var.labels))
          project_id: ((var.project_id))
          region: ((var.region))
          runners:
            - content: "#!/bin/bash\n# this script is designed to execute on Slurm images published by SchedMD that:\n# - are based on Debian 11 distribution of Linux\n# - have NVIDIA Drivers v530 pre-installed\n# - have CUDA Toolkit 12.1 pre-installed.\n\nset -e -o pipefail\n\n\nCONDA_BASE=/opt/conda\n\nif [ -d $CONDA_BASE ]; then\n        exit 0\nfi\n\nDL_DIR=\\$(mktemp -d)\ncd $DL_DIR\ncurl -O https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh\nHOME=$DL_DIR bash Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -b -p $CONDA_BASE\ncd -\nrm -rf $DL_DIR\nunset DL_DIR\n\ntee /tmp/llama2_env.yml << EOLL\nname: llama2\nchannels:\n  - conda-forge\n  - nvidia\n  - nvidia/label/cuda-12.4.0\ndependencies:\n  - appdirs\n  - loralib\n  - black\n  - black-jupyter\n  - py7zr\n  - scipy\n  - optimum\n  - datasets\n  - accelerate\n  - peft\n  - fairscale\n  - fire\n  - sentencepiece\n  - transformers\n  - huggingface_hub\n  - git\n  - pip\n  - pip:\n    - bitsandbytes\n    - nvidia-cudnn-cu12\n    - dataclasses\n    - nvidia-nccl-cu12\n    - trl\n    - torch\n    - torchaudio \n    - torchvision\n    - nvitop\nEOLL\n\nsource $CONDA_BASE/bin/activate base\nconda env create -n llama2 --file /tmp/llama2_env.yml\n"
              destination: install-ml-libraries.sh
              type: shell
      - source: modules/scripts/startup-script
        kind: terraform
        id: startup_script
        outputs:
          - name: startup_script
            description: Automatically-generated output exported for use by later deployment groups
            sensitive: true
        settings:
          deployment_name: ((var.deployment_name))
          install_cloud_ops_agent: false
          labels: ((var.labels))
          project_id: ((var.project_id))
          region: ((var.region))
          runners:
            - content: |
                #!/bin/bash
                CONDA_BASE=/opt/conda
                source $CONDA_BASE/bin/activate base
                conda init --system

                # UnInstall Stackdriver Agent

                sudo systemctl stop stackdriver-agent.service
                sudo systemctl disable stackdriver-agent.service
                curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
                sudo dpkg --configure -a
                sudo bash add-monitoring-agent-repo.sh --uninstall
                sudo bash add-monitoring-agent-repo.sh --remove-repo

                # Install ops-agent

                curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
                sudo bash add-google-cloud-ops-agent-repo.sh --also-install
                sudo service google-cloud-ops-agent start
              destination: startup-script.sh
              type: shell
  - group: packer
    modules:
      - source: modules/packer/custom-image
        kind: packer
        id: custom-image
        use:
          - network1
          - packer_script
        settings:
          deployment_name: ((var.deployment_name))
          disk_size: ((var.disk_size_gb))
          image_family: ((var.new_image_family))
          labels: ((var.labels))
          machine_type: c2-standard-8
          project_id: ((var.project_id))
          source_image_family: slurm-gcp-6-6-debian-11
          source_image_project_id:
            - schedmd-slurm-public
          startup_script: ((module.packer_script.startup_script))
          state_timeout: 30m
          subnetwork_name: ((module.network1.subnetwork_name))
          zone: ((var.zone))
  - group: cluster
    terraform_providers:
      google:
        source: hashicorp/google
        version: ~> 6.38.0
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      google-beta:
        source: hashicorp/google-beta
        version: ~> 6.38.0
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
    modules:
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: n1t4_nodeset
        use:
          - network1
        settings:
          advanced_machine_features:
            threads_per_core: 1
          bandwidth_tier: gvnic_enabled
          disk_size_gb: ((var.disk_size_gb))
          enable_placement: false
          enable_public_ips: true
          guest_accelerator:
            - count: 4
              type: nvidia-tesla-t4
          instance_image:
            family: ((var.new_image_family))
            project: ((var.project_id))
          instance_image_custom: ((var.instance_image_custom))
          labels: ((var.labels))
          machine_type: n1-standard-96
          name: n1t4_nodeset
          node_count_dynamic_max: 1
          on_host_maintenance: TERMINATE
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
          zones: ((var.zone_list))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: n1t4_partition
        use:
          - n1t4_nodeset
        settings:
          exclusive: false
          is_default: true
          nodeset: ((flatten([module.n1t4_nodeset.nodeset])))
          partition_name: n1t4
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: n2_nodeset
        use:
          - network1
        settings:
          advanced_machine_features:
            threads_per_core: 1
          bandwidth_tier: gvnic_enabled
          disk_size_gb: ((var.disk_size_gb))
          enable_public_ips: true
          instance_image:
            family: ((var.new_image_family))
            project: ((var.project_id))
          instance_image_custom: ((var.instance_image_custom))
          labels: ((var.labels))
          machine_type: n2-standard-4
          name: n2_nodeset
          node_count_dynamic_max: 1
          on_host_maintenance: TERMINATE
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
          zones: ((var.zone_list))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: n2_partition
        use:
          - n2_nodeset
        settings:
          is_default: true
          nodeset: ((flatten([module.n2_nodeset.nodeset])))
          partition_name: n2
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: g2_nodeset
        use:
          - network1
        settings:
          advanced_machine_features:
            threads_per_core: 1
          bandwidth_tier: gvnic_enabled
          disk_size_gb: ((var.disk_size_gb))
          enable_public_ips: true
          instance_image:
            family: ((var.new_image_family))
            project: ((var.project_id))
          instance_image_custom: ((var.instance_image_custom))
          labels: ((var.labels))
          machine_type: g2-standard-96
          name: g2_nodeset
          node_count_dynamic_max: 1
          on_host_maintenance: TERMINATE
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
          zones: ((var.zone_list))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: g2_partition
        use:
          - g2_nodeset
        settings:
          is_default: false
          nodeset: ((flatten([module.g2_nodeset.nodeset])))
          partition_name: g2gpu8
      - source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
        kind: terraform
        id: slurm_login
        use:
          - network1
        settings:
          disk_size_gb: ((var.disk_size_gb))
          enable_login_public_ips: true
          instance_image:
            family: ((var.new_image_family))
            project: ((var.project_id))
          instance_image_custom: ((var.instance_image_custom))
          labels: ((var.labels))
          machine_type: n2-standard-4
          name_prefix: login
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
        kind: terraform
        id: slurm_controller
        use:
          - network1
          - n1t4_partition
          - n2_partition
          - g2_partition
          - slurm_login
          - homefs
          - data_bucket
        settings:
          controller_startup_script: ((module.startup_script.startup_script))
          controller_startup_scripts_timeout: 21600
          deployment_name: ((var.deployment_name))
          disk_size_gb: ((var.disk_size_gb))
          enable_controller_public_ips: true
          instance_image:
            family: ((var.new_image_family))
            project: ((var.project_id))
          instance_image_custom: ((var.instance_image_custom))
          labels: ((var.labels))
          login_nodes: ((flatten([module.slurm_login.login_nodes])))
          login_startup_script: ((module.startup_script.startup_script))
          login_startup_scripts_timeout: 21600
          network_storage: ((flatten([module.data_bucket.network_storage, flatten([module.homefs.network_storage])])))
          nodeset: ((flatten([module.g2_partition.nodeset, flatten([module.n2_partition.nodeset, flatten([module.n1t4_partition.nodeset])])])))
          nodeset_dyn: ((flatten([module.g2_partition.nodeset_dyn, flatten([module.n2_partition.nodeset_dyn, flatten([module.n1t4_partition.nodeset_dyn])])])))
          nodeset_tpu: ((flatten([module.g2_partition.nodeset_tpu, flatten([module.n2_partition.nodeset_tpu, flatten([module.n1t4_partition.nodeset_tpu])])])))
          partitions: ((flatten([module.g2_partition.partitions, flatten([module.n2_partition.partitions, flatten([module.n1t4_partition.partitions])])])))
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))

Execution environment

  • OS: glinux
  • Shell: bash
  • go version: go1.24.2 linux/amd64

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载