added notebooks and dockerfiles for serving open models on vertexai using vllm custom containers #4148

ravi-dalal · 2025-07-11T19:10:12Z

REQUIRED: Add a summary of your PR here, typically including why the change is needed and what was changed. Include any design alternatives for discussion purposes.

This pull request contains four Colab notebooks that demonstrate how an open large language model (e.g. Llama 3.2) can be deployed on Vertex AI (Endpoints) via custom containers (vLLM) on TPUs, GPUs and CPUs. It also has the Git patch file that updates the vLLM open source for Vertex AI as well as the dockerfiles for three platforms (TPU, GPU and CPU).

REQUIRED: Fill out the below checklists or remove if irrelevant

If you are opening a PR for Official Notebooks under the notebooks/official folder, follow this mandatory checklist:

Use the notebook template as a starting point.
Follow the style and grammar rules outlined in the above notebook template.
Verify the notebook runs successfully in Colab since the automated tests cannot guarantee this even when it passes.
Passes all the required automated checks. You can locally test for formatting and linting with these instructions.
You have consulted with a tech writer to see if tech writer review is necessary. If so, the notebook has been reviewed by a tech writer, and they have approved it.
This notebook has been added to the CODEOWNERS file under the Official Notebooks section, pointing to the author or the author's team.
The Jupyter notebook cleans up any artifacts it has created (datasets, ML models, endpoints, etc) so as not to eat up unnecessary resources.

If you are opening a PR for Community Notebooks under the notebooks/community folder:

This notebook has been added to the CODEOWNERS file under the Community Notebooks section, pointing to the author or the author's team.
Passes all the required formatting and linting checks. You can locally test with these instructions.

If you are opening a PR for Community Content under the community-content folder:

Make sure your main Content Directory Name is descriptive, informative, and includes some of the key products and attributes of your content, so that it is differentiable from other content
The main content directory has been added to the CODEOWNERS file under the Community Content section, pointing to the author or the author's team.
Passes all the required formatting and linting checks. You can locally test with these instructions.

…sing vllm customer containers

gemini-code-assist

Summary of Changes

Hello @ravi-dalal, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a robust solution for serving open large language models on Vertex AI. It provides a set of detailed Colab notebooks and custom container configurations that streamline the deployment of vLLM-powered models across CPU, GPU, and TPU hardware. The changes ensure compatibility with Vertex AI's prediction API, offering flexibility in model sourcing from Hugging Face or Google Cloud Storage, thereby simplifying the path to production for open-source LLMs.

Highlights

Open Model Serving on Vertex AI: This pull request introduces comprehensive support for deploying open large language models (LLMs), specifically Llama 3.2 3B, on Google Cloud Vertex AI Endpoints using vLLM custom containers.
Multi-Platform Notebooks: Four new Colab notebooks are added, demonstrating end-to-end deployment workflows for CPU, GPU, and TPU environments. This includes a specific notebook for loading models from Google Cloud Storage (GCS) on TPUs, alongside Hugging Face model loading for all platforms.
vLLM Customization for Vertex AI: A custom Git patch (vertexai.patch) is included to modify the vLLM API server. This patch adapts vLLM's request and response formats to be fully compatible with Vertex AI's prediction service requirements, ensuring seamless integration.
Containerization and Model Loading: Dedicated Dockerfiles for CPU, GPU, and TPU are provided, along with a custom entrypoint.sh script. This script enhances the container's capability to dynamically download models from either Hugging Face or Google Cloud Storage before starting the vLLM server.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

ravi-dalal · 2025-07-11T19:11:48Z

Tagging tech writer for review as well. @ktonthat please help with the review.

ravi-dalal · 2025-07-11T19:12:09Z

@RajeshThallam PTAL.

gemini-code-assist

Code Review

This pull request introduces a set of notebooks and Dockerfiles to demonstrate serving open models on Vertex AI using vLLM. The changes are comprehensive, covering CPU, GPU, and TPU deployments. I've found several critical issues related to incorrect paths, use of deprecated parameters, and missing resource cleanup in the notebooks, which will cause errors and could lead to unexpected costs. The Dockerfiles also have issues that will prevent them from building successfully. I've provided specific suggestions to address these problems.

gemini-code-assist · 2025-07-11T19:14:13Z

notebooks/official/model_serving/colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb

+   "outputs": [],
+   "source": [
+    "DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-tpu\"\n",
+    "! cd docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} ."


The docker build command will fail due to an incorrect path. The notebook is in the colabs directory, while the Dockerfile is in the docker directory at the same level. The cd command should navigate up one level and then into docker.

! cd ../docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} .

gemini-code-assist · 2025-07-11T19:14:13Z

notebooks/official/model_serving/docker/Dockerfile.gpu

+WORKDIR /workspace
+
+# Download vLLM source code and apply Vertex AI Patch
+RUN git clone https://github.com/vllm-project/vllm.git


The COPY commands on the following lines will fail because the destination directory /workspace/vllm/vertexai/ does not exist after cloning the vllm repository. You need to create this directory before copying files into it.

RUN git clone https://github.com/vllm-project/vllm.git && mkdir -p /workspace/vllm/vertexai

gemini-code-assist · 2025-07-11T19:14:13Z

notebooks/official/model_serving/docker/Dockerfile.tpu

+WORKDIR /workspace
+
+# Download vLLM source code and apply Vertex AI Patch
+RUN git clone https://github.com/vllm-project/vllm.git


The COPY commands on the following lines will fail because the destination directory /workspace/vllm/vertexai/ does not exist after cloning the vllm repository. You need to create this directory before copying files into it.

RUN git clone https://github.com/vllm-project/vllm.git && mkdir -p /workspace/vllm/vertexai

gemini-code-assist · 2025-07-11T19:14:14Z

notebooks/official/model_serving/colabs/vertexai_serving_vllm_cpu_llama3_2_3B.ipynb

+   "outputs": [],
+   "source": [
+    "DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-cpu\"\n",
+    "! cd docker && docker build -f Dockerfile.cpu -t {DOCKER_URI} ."


The docker build command will fail due to an incorrect path. The notebook is in the colabs directory, while the Dockerfile is in the docker directory at the same level. The cd command should navigate up one level and then into docker.

! cd ../docker && docker build -f Dockerfile.cpu -t {DOCKER_URI} .

gemini-code-assist · 2025-07-11T19:14:14Z

notebooks/official/model_serving/colabs/vertexai_serving_vllm_tpu_gcs_llama3_2_3B.ipynb

+   "outputs": [],
+   "source": [
+    "DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-tpu\"\n",
+    "! cd docker && docker build -f /Dockerfile.tpu -t {DOCKER_URI} ."


The docker build command has two issues that will cause it to fail:

The path to the Dockerfile, -f /Dockerfile.tpu, is an absolute path. It should be relative: Dockerfile.tpu.

The cd docker command will fail because the notebook is in the colabs directory. The path should be ../docker.

! cd ../docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} .

gemini-code-assist · 2025-07-11T19:14:16Z

notebooks/official/model_serving/colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb

+    "        \"--max-model-len=2048\",\n",
+    "        \"--enable-prefix-caching\",\n",
+    "        f\"--tensor-parallel-size={machine_type[-2]}\",\n",


Deriving the tensor-parallel-size from the machine type string using machine_type[-2] is clever but brittle and hard to read. A more robust and explicit method would improve maintainability. Consider parsing the string more clearly and adding a comment.

# Extract the number of TPU cores from the machine type string (e.g., 'ct5lp-hightpu-1t' -> '1') tpu_cores = machine_type.split('-')[-1][0] f"--tensor-parallel-size={tpu_cores}",

gemini-code-assist · 2025-07-11T19:14:16Z

notebooks/official/model_serving/docker/Dockerfile.cpu

+    && pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
+
+# Install vLLM
+RUN VLLM_TARGET_DEVICE="cpu" python setup.py install


Using python setup.py install is a legacy practice that is discouraged. The modern, recommended way to install a package from source is by using pip.

RUN VLLM_TARGET_DEVICE="cpu" pip install .

gemini-code-assist · 2025-07-11T19:14:16Z

notebooks/official/model_serving/colabs/vertexai_serving_vllm_tpu_gcs_llama3_2_3B.ipynb

+    "        \"--max-model-len=2048\",\n",
+    "        \"--enable-prefix-caching\",\n",
+    "        f\"--tensor-parallel-size={machine_type[-2]}\",\n",


Deriving the tensor-parallel-size from the machine type string using machine_type[-2] is clever but brittle and hard to read. A more robust and explicit method would improve maintainability. Consider parsing the string more clearly and adding a comment.

# Extract the number of TPU cores from the machine type string (e.g., 'ct5lp-hightpu-1t' -> '1') tpu_cores = machine_type.split('-')[-1][0] f"--tensor-parallel-size={tpu_cores}",

gemini-code-assist · 2025-07-11T19:14:16Z

notebooks/official/model_serving/colabs/vertexai_serving_vllm_gpu_llama3_2_3B.ipynb

+    "model_id = \"meta-llama/Llama-3.2-3B\"  # @param {type:\"string\"}\n",
+    "machine_type = \"g2-standard-8\"  # @param {type:\"string\"}\n",
+    "accelerator_type = \"NVIDIA_L4\"  # @param {type:\"string\"}\n",
+    "accelerator_count = 1  # @param {type:\"string\"}"


The accelerator_count parameter is defined as a string but is used as an integer. For clarity and to prevent potential type-related issues, it's better to define it with the correct type.

accelerator_count = 1 # @param {type:"integer"}

gemini-code-assist · 2025-07-11T19:14:17Z

notebooks/official/model_serving/README.md

+The repository has customization required for serving open models on Vertex AI using [vLLM](https://github.com/vllm-project/vllm.git).
+
+## Using TPU
+This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs.  


There are two trailing spaces at the end of this line. Please remove them for better formatting.

Suggested change

This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs.

This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs.

ravi-dalal added 2 commits July 11, 2025 11:02

added notebooks and dockerfiles for serving open models on vertexai u…

77b443e

…sing vllm customer containers

updated official codeowners

25460e7

ravi-dalal requested a review from a team as a code owner July 11, 2025 19:10

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

ravi-dalal and others added 11 commits July 11, 2025 12:18

fixed linting errors

eb21235

fixed linting errors

4f88b36

fixed linting errors

6c99b82

moved notebooks

613b716

ran linter

9ead577

fixed param type

0f6075d

Merge branch 'GoogleCloudPlatform:main' into main

f43c134

added some formatting

b65a1b2

Merge branch 'GoogleCloudPlatform:main' into main

f69793b

Merge branch 'GoogleCloudPlatform:main' into main

5b5c670

Merge branch 'GoogleCloudPlatform:main' into main

e2b7158

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added notebooks and dockerfiles for serving open models on vertexai using vllm custom containers #4148

added notebooks and dockerfiles for serving open models on vertexai using vllm custom containers #4148

Uh oh!

ravi-dalal commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

ravi-dalal commented Jul 11, 2025

Uh oh!

ravi-dalal commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

gemini-code-assist bot Jul 11, 2025

Uh oh!

Uh oh!

	This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs.
	This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs.

added notebooks and dockerfiles for serving open models on vertexai using vllm custom containers #4148

Are you sure you want to change the base?

added notebooks and dockerfiles for serving open models on vertexai using vllm custom containers #4148

Uh oh!

Conversation

ravi-dalal commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

ravi-dalal commented Jul 11, 2025

Uh oh!

ravi-dalal commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!