feat: Add sample for Vertex distributed training #4163

erwinh85 · 2025-07-18T19:38:49Z

REQUIRED: Add a summary of your PR here, typically including why the change is needed and what was changed. Include any design alternatives for discussion purposes.

--- YOUR PR SUMMARY GOES HERE ---

REQUIRED: Fill out the below checklists or remove if irrelevant

If you are opening a PR for Official Notebooks under the notebooks/official folder, follow this mandatory checklist:

Use the notebook template as a starting point.
Follow the style and grammar rules outlined in the above notebook template.
Verify the notebook runs successfully in Colab since the automated tests cannot guarantee this even when it passes.
Passes all the required automated checks. You can locally test for formatting and linting with these instructions.
You have consulted with a tech writer to see if tech writer review is necessary. If so, the notebook has been reviewed by a tech writer, and they have approved it.
This notebook has been added to the CODEOWNERS file under the Official Notebooks section, pointing to the author or the author's team.
The Jupyter notebook cleans up any artifacts it has created (datasets, ML models, endpoints, etc) so as not to eat up unnecessary resources.

If you are opening a PR for Community Notebooks under the notebooks/community folder:

This notebook has been added to the CODEOWNERS file under the Community Notebooks section, pointing to the author or the author's team.
Passes all the required formatting and linting checks. You can locally test with these instructions.

If you are opening a PR for Community Content under the community-content folder:

Make sure your main Content Directory Name is descriptive, informative, and includes some of the key products and attributes of your content, so that it is differentiable from other content
The main content directory has been added to the CODEOWNERS file under the Community Content section, pointing to the author or the author's team.
Passes all the required formatting and linting checks. You can locally test with these instructions.

gemini-code-assist

Summary of Changes

Hello @erwinh85, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, end-to-end sample for performing large-scale distributed training of a Llama 3.1 8B model on Vertex AI. It provides all the necessary components, from custom container creation and configuration to job submission and execution, leveraging NVIDIA's NeMo framework and A3 Mega VMs. The goal is to offer a clear, reproducible guide for users looking to run high-performance LLM pre-training on Google Cloud's AI infrastructure.

Highlights

New Distributed Training Sample: I've added a comprehensive new sample demonstrating distributed training on Google Cloud Vertex AI. This sample focuses on pre-training a Llama 3.1 8B model using NVIDIA's NeMo Framework on A3 Mega VMs (H100 GPUs).
Custom Container Build Process: The PR includes a Dockerfile and cloudbuild.yml to build a custom NeMo-based container image. This image is tailored for Vertex AI, incorporating necessary dependencies and patches to the NeMo framework for optimal performance and compatibility, including handling GCS data access and specific distributed training configurations.
Automated Job Launching: A Python script (launch.py) is provided to simplify the submission of the custom training job to Vertex AI. It reads job parameters from a JSON configuration file, allowing users to easily define machine types, GPU counts, and scheduling strategies (spot, DWS, or reservations).
Robust Distributed Execution: The sample's entrypoint script (run.py) handles environment setup, data download from GCS, and orchestrates the multi-node torchrun command. It includes a retry mechanism with torch.distributed.barrier() to enhance reliability for distributed jobs, especially when using preemptible instances.
NeMo Configuration and Patches: A detailed NeMo configuration (llama3_1_8b_pretrain_a3mega.yaml) is included, defining the Llama 3.1 8B model architecture and training hyperparameters. Several patches are applied to the NeMo framework within the Docker build to address specific integration needs with Vertex AI and improve logging/compatibility.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new sample for distributed training of a Llama 3.1 8B model on Vertex AI. I've found several critical issues, primarily related to incorrect or non-existent package dependencies that will prevent the code from running, as well as a potential bug in the distributed training logic within a patch file. I've also provided suggestions to improve code robustness and adhere to modern practices.

gemini-code-assist · 2025-07-18T19:41:55Z

...ining/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/docker/requirements.txt

+opencv-python-headless==4.11.0.86
+docutils==0.16
+urllib3==2.0.7
+google-cloud-storage==3.0.0


The package version google-cloud-storage==3.0.0 does not exist on PyPI. Correct this to a valid version.

google-cloud-storage==2.17.0

gemini-code-assist · 2025-07-18T19:41:55Z

...ial/training/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/requirements.txt

+absl-py==2.2.2
+annotated-types==0.7.0
+anyio==4.9.0
+black==25.1.0
+cachetools==5.5.2
+certifi==2025.4.26
+charset-normalizer==3.4.2
+click==8.1.8
+docstring_parser==0.16
+google-api-core==2.24.2
+google-auth==2.40.1
+google-cloud-aiplatform==1.92.0
+google-cloud-bigquery==3.31.0
+google-cloud-core==2.4.3
+google-cloud-resource-manager==1.14.2
+google-cloud-storage==2.19.0
+google-crc32c==1.7.1
+google-genai==1.14.0
+google-resumable-media==2.7.2
+googleapis-common-protos==1.70.0
+grpc-google-iam-v1==0.14.2
+grpcio==1.71.0
+grpcio-status==1.71.0
+h11==0.16.0
+httpcore==1.0.9
+httpx==0.28.1
+idna==3.10
+mypy_extensions==1.1.0
+numpy==2.2.5
+packaging==25.0
+pathspec==0.12.1
+platformdirs==4.3.8
+proto-plus==1.26.1
+protobuf==5.29.4
+pyasn1==0.6.1
+pyasn1_modules==0.4.2
+pydantic==2.11.4
+pydantic_core==2.33.2
+python-dateutil==2.9.0.post0
+pytz==2025.2
+requests==2.32.3
+rsa==4.9.1
+shapely==2.1.0
+six==1.17.0
+sniffio==1.3.1
+typing-inspection==0.4.0
+typing_extensions==4.13.2
+urllib3==2.4.0
+websockets==15.0.1


Many package versions specified in this file do not exist on PyPI. Verify and correct all package versions to ones that are available on PyPI.

gemini-code-assist · 2025-07-18T19:41:55Z

...icial/training/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/scripts/run.py

+from absl import app, flags, logging
+from util import cluster_spec
+
+from retrying import retry


The retrying library is imported but not listed as a dependency in docker/requirements.txt. Add retrying to docker/requirements.txt.

gemini-code-assist · 2025-07-18T19:41:55Z

...al/training/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/scripts/launch.py

+    reservation_name = config["reservation_name"]
+    reservation_full_name = (
+        f"projects/{project_id}/zones/{zone}/reservations/{reservation_name}"
+        if "reservation_name" in config
+        else ""
+    )


Accessing config["reservation_name"] directly will raise a KeyError if it's not present. Use config.get("reservation_name") to safely access the value. The same issue exists for config["zone"] on line 101.

Suggested change

reservation_name = config["reservation_name"]

reservation_full_name = (

f"projects/{project_id}/zones/{zone}/reservations/{reservation_name}"

if "reservation_name" in config

else ""

)

reservation_name = config.get("reservation_name")

reservation_full_name = (

f"projects/{project_id}/zones/{zone}/reservations/{reservation_name}"

if reservation_name and zone

else ""

)

gemini-code-assist · 2025-07-18T19:41:55Z

...ex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/scripts/util/cluster_spec_test.py

+class ClusterSpecTest(googletest.TestCase):
+
+    def setUp(self):
+        super().setUp()
+        self.curr_env_var = os.environ.copy()
+
+    def tearDown(self):
+        super().tearDown()
+        os.environ = self.curr_env_var
+
+    def test_get_cluster_spec_from_env_vars(self):
+        os.environ["CLUSTER_SPEC"] = ""
+        os.environ["MASTER_ADDR"] = "127.0.0.1"
+        os.environ["MASTER_PORT"] = "8080"
+        os.environ["RANK"] = "0"
+        os.environ["NNODES"] = "2"
+        cluster_info = cluster_spec.get_cluster_spec()
+        self.assertEqual(cluster_info.primary_node_addr, "127.0.0.1")
+        self.assertEqual(cluster_info.primary_node_port, "8080")
+        self.assertEqual(cluster_info.node_rank, 0)
+        self.assertEqual(cluster_info.num_nodes, 2)
+
+    def test_get_cluster_spec_from_cluster_spec(self):
+        os.environ[
+            "CLUSTER_SPEC"
+        ] = """
+    {
+      "cluster": {
+        "workerpool0": [
+          "127.0.0.1:8080"
+        ],
+        "workerpool1": [
+          "127.0.0.2:8080",
+          "127.0.0.3:8080"
+        ]
+      },
+      "task": {
+        "type": "workerpool1",
+        "index": 0
+      }
+    }
+    """
+        cluster_info = cluster_spec.get_cluster_spec()
+        self.assertEqual(cluster_info.primary_node_addr, "127.0.0.1")
+        self.assertEqual(cluster_info.primary_node_port, "8080")
+        self.assertEqual(cluster_info.node_rank, 1)
+        self.assertEqual(cluster_info.num_nodes, 3)
+
+
+if __name__ == "__main__":
+    googletest.main()


This test file is set up for googletest, which is a C++ testing framework. Use the standard library unittest instead.

import unittest from . import cluster_spec # TODO(styer): Use pytest instead class ClusterSpecTest(unittest.TestCase): def setUp(self): super().setUp() self.curr_env_var = os.environ.copy() def tearDown(self): super().tearDown() os.environ = self.curr_env_var def test_get_cluster_spec_from_env_vars(self): os.environ["CLUSTER_SPEC"] = "" os.environ["MASTER_ADDR"] = "127.0.0.1" os.environ["MASTER_PORT"] = "8080" os.environ["RANK"] = "0" os.environ["NNODES"] = "2" cluster_info = cluster_spec.get_cluster_spec() self.assertEqual(cluster_info.primary_node_addr, "127.0.0.1") self.assertEqual(cluster_info.primary_node_port, "8080") self.assertEqual(cluster_info.node_rank, 0) self.assertEqual(cluster_info.num_nodes, 2) def test_get_cluster_spec_from_cluster_spec(self): os.environ[ "CLUSTER_SPEC" ] = """ { "cluster": { "workerpool0": [ "127.0.0.1:8080" ], "workerpool1": [ "127.0.0.2:8080", "127.0.0.3:8080" ] }, "task": { "type": "workerpool1", "index": 0 } } """ cluster_info = cluster_spec.get_cluster_spec() self.assertEqual(cluster_info.primary_node_addr, "127.0.0.1") self.assertEqual(cluster_info.primary_node_port, "8080") self.assertEqual(cluster_info.node_rank, 1) self.assertEqual(cluster_info.num_nodes, 3) if __name__ == "__main__": unittest.main()

gemini-code-assist · 2025-07-18T19:41:56Z

...s/official/training/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/README.md

+###  4.1. Job Configuration File
+Once the container is built, update the job_config.json to set up the training job.
+File: job_config.json
+```bash


The code block for job_config.json is highlighted as bash, but it contains JSON content. Using json for the language identifier will provide correct syntax highlighting.

Suggested change

```bash

```json

gemini-code-assist · 2025-07-18T19:41:56Z

...s/official/training/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/README.md

+  "zone": "<zone if using reservation>",
+  "bucket": "<bucket>",
+  "dataset_bucket": "github-repo/data/third-party/enwiki-latest-pages-articles",
+  "image_uri": "<docker image uri from artifact registry>>",


Remove the extra > at the end of the placeholder for image_uri.

Suggested change

"image_uri": "<docker image uri from artifact registry>>",

"image_uri": "<docker image uri from artifact registry>",

gemini-code-assist · 2025-07-18T19:41:56Z

...ributed-training/a3mega/llama-3-8b-nemo-pretraining/configs/llama3_1_8b_pretrain_a3mega.yaml

+  ffn_hidden_size: 14336  # 8b: 14336 | 70b: 28672 | 405b: 53248
+  num_attention_heads: 32  # 8b: 32 | 70b: 64 | 405b: 128
+  num_query_groups: 8  # Number of query groups for group query attention. If None, normal attention is used. 8b: 8 | 70b: 8 | 405b: 16
+  init_method_std: 0.01  # Standard deviation of the zero mean normal distribution used for weight initialization.')  8b: 0.01 | 70b: 0.008944 | 405b: 0.02


Remove the stray ') at the end of the comment.

init_method_std: 0.01 # Standard deviation of the zero mean normal distribution used for weight initialization. 8b: 0.01 | 70b: 0.008944 | 405b: 0.02

gemini-code-assist · 2025-07-18T19:41:56Z

...istributed-training/a3mega/llama-3-8b-nemo-pretraining/docker/vertex-dist-recipes.Dockerfile

+RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | \
+    tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
+    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
+    apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
+    apt-get update -y && apt-get install google-cloud-sdk -y && \
+    rm -rf /var/lib/apt/lists*


apt-key is deprecated. Download the key directly to the keyring location instead.

RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | \ tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \ curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg && \ apt-get update -y && apt-get install google-cloud-sdk -y && \ rm -rf /var/lib/apt/lists*

gemini-code-assist · 2025-07-18T19:41:56Z

...istributed-training/a3mega/llama-3-8b-nemo-pretraining/docker/vertex-dist-recipes.Dockerfile

+RUN patch --verbose -l -d /opt/megatron-lm/megatron/core/datasets -p1 -i /workspace/vdt/patches/local_rank.patch; \
+    git -C /workspace/gpu-recipes apply /workspace/vdt/patches/throughput_calc.patch; \
+    git -C /opt/NeMo apply /workspace/vdt/patches/nemo2hf.patch; \
+    git -C /opt/NeMo apply /workspace/vdt/patches/sigabort.patch;


Use && instead of ; to chain commands. This will cause the RUN step to fail immediately if any command returns a non-zero exit code.

RUN patch --verbose -l -d /opt/megatron-lm/megatron/core/datasets -p1 -i /workspace/vdt/patches/local_rank.patch && \ git -C /workspace/gpu-recipes apply /workspace/vdt/patches/throughput_calc.patch && \ git -C /opt/NeMo apply /workspace/vdt/patches/nemo2hf.patch && \ git -C /opt/NeMo apply /workspace/vdt/patches/sigabort.patch

mchrestkha · 2025-07-21T11:05:25Z

...s/official/training/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/README.md

+Run this command to build the container and push it into the Google Artifact Registry.
+
+```bash
+cd "${REPO_ROOT}/a3mega/llama-3-8b-nemo-pretraining"


this is incorrect with the new vertexai-samples repo structure.

mchrestkha · 2025-07-21T11:06:35Z

...s/official/training/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/README.md

+
+
+###  4.1. Job Configuration File
+Once the container is built, update the job_config.json to set up the training job.


no job_config.json file in folder

mchrestkha · 2025-07-21T11:07:10Z

...s/official/training/vertex-distributed-training/a3mega/llama-3-8b-nemo-pretraining/README.md

+  "gpu_type": "NVIDIA_H100_MEGA_80GB",
+  "gpus_per_node": "8",
+  "recipe_name": "llama3_1_8b_pretrain_a3mega",
+  "job_prefix": "mchrestkha-spot-",


remove reference to 'mchrestkha' and replace with blank

…config

mchrestkha · 2025-07-23T18:22:52Z

lgtm. ran 3 successful jobs with spot vms in us-east 4.

erwinh85 requested a review from a team as a code owner July 18, 2025 19:38

gemini-code-assist bot reviewed Jul 18, 2025

View reviewed changes

mchrestkha reviewed Jul 21, 2025

View reviewed changes

erwinh85 added 3 commits July 21, 2025 17:33

feat: Add sample for Vertex distributed training

249963a

refactor: Move distributed training to community content and add job …

c3bc842

…config

fix: Address review comments and update files

bcbbc8d

erwinh85 force-pushed the vertex-distributed-training branch from 4a23b17 to bcbbc8d Compare July 23, 2025 10:25

minor fixes in the script

4c819e5

updated codeowners

04b10b0

gericdong approved these changes Jul 23, 2025

View reviewed changes

erwinh85 merged commit 66ce2fa into GoogleCloudPlatform:main Jul 23, 2025
5 checks passed

	"image_uri": "<docker image uri from artifact registry>>",
	"image_uri": "<docker image uri from artifact registry>",



		### 4.1. Job Configuration File
		Once the container is built, update the job_config.json to set up the training job.

feat: Add sample for Vertex distributed training #4163

feat: Add sample for Vertex distributed training #4163

Uh oh!

Conversation

erwinh85 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

mchrestkha Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

mchrestkha Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

mchrestkha Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

mchrestkha commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

erwinh85 commented Jul 18, 2025 •

edited

Loading