这是indexloc提供的服务,不要输入任何密码
Skip to content

added notebooks and dockerfiles for serving open models on vertexai using vllm custom containers #4148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

ravi-dalal
Copy link
Contributor

REQUIRED: Add a summary of your PR here, typically including why the change is needed and what was changed. Include any design alternatives for discussion purposes.


This pull request contains four Colab notebooks that demonstrate how an open large language model (e.g. Llama 3.2) can be deployed on Vertex AI (Endpoints) via custom containers (vLLM) on TPUs, GPUs and CPUs. It also has the Git patch file that updates the vLLM open source for Vertex AI as well as the dockerfiles for three platforms (TPU, GPU and CPU).


REQUIRED: Fill out the below checklists or remove if irrelevant

  1. If you are opening a PR for Official Notebooks under the notebooks/official folder, follow this mandatory checklist:
  • Use the notebook template as a starting point.
  • Follow the style and grammar rules outlined in the above notebook template.
  • Verify the notebook runs successfully in Colab since the automated tests cannot guarantee this even when it passes.
  • Passes all the required automated checks. You can locally test for formatting and linting with these instructions.
  • You have consulted with a tech writer to see if tech writer review is necessary. If so, the notebook has been reviewed by a tech writer, and they have approved it.
  • This notebook has been added to the CODEOWNERS file under the Official Notebooks section, pointing to the author or the author's team.
  • The Jupyter notebook cleans up any artifacts it has created (datasets, ML models, endpoints, etc) so as not to eat up unnecessary resources.

  1. If you are opening a PR for Community Notebooks under the notebooks/community folder:
  • This notebook has been added to the CODEOWNERS file under the Community Notebooks section, pointing to the author or the author's team.
  • Passes all the required formatting and linting checks. You can locally test with these instructions.

  1. If you are opening a PR for Community Content under the community-content folder:
  • Make sure your main Content Directory Name is descriptive, informative, and includes some of the key products and attributes of your content, so that it is differentiable from other content
  • The main content directory has been added to the CODEOWNERS file under the Community Content section, pointing to the author or the author's team.
  • Passes all the required formatting and linting checks. You can locally test with these instructions.

@ravi-dalal ravi-dalal requested a review from a team as a code owner July 11, 2025 19:10
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ravi-dalal, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a robust solution for serving open large language models on Vertex AI. It provides a set of detailed Colab notebooks and custom container configurations that streamline the deployment of vLLM-powered models across CPU, GPU, and TPU hardware. The changes ensure compatibility with Vertex AI's prediction API, offering flexibility in model sourcing from Hugging Face or Google Cloud Storage, thereby simplifying the path to production for open-source LLMs.

Highlights

  • Open Model Serving on Vertex AI: This pull request introduces comprehensive support for deploying open large language models (LLMs), specifically Llama 3.2 3B, on Google Cloud Vertex AI Endpoints using vLLM custom containers.
  • Multi-Platform Notebooks: Four new Colab notebooks are added, demonstrating end-to-end deployment workflows for CPU, GPU, and TPU environments. This includes a specific notebook for loading models from Google Cloud Storage (GCS) on TPUs, alongside Hugging Face model loading for all platforms.
  • vLLM Customization for Vertex AI: A custom Git patch (vertexai.patch) is included to modify the vLLM API server. This patch adapts vLLM's request and response formats to be fully compatible with Vertex AI's prediction service requirements, ensuring seamless integration.
  • Containerization and Model Loading: Dedicated Dockerfiles for CPU, GPU, and TPU are provided, along with a custom entrypoint.sh script. This script enhances the container's capability to dynamically download models from either Hugging Face or Google Cloud Storage before starting the vLLM server.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ravi-dalal
Copy link
Contributor Author

Tagging tech writer for review as well. @ktonthat please help with the review.

@ravi-dalal
Copy link
Contributor Author

@RajeshThallam PTAL.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a set of notebooks and Dockerfiles to demonstrate serving open models on Vertex AI using vLLM. The changes are comprehensive, covering CPU, GPU, and TPU deployments. I've found several critical issues related to incorrect paths, use of deprecated parameters, and missing resource cleanup in the notebooks, which will cause errors and could lead to unexpected costs. The Dockerfiles also have issues that will prevent them from building successfully. I've provided specific suggestions to address these problems.

"outputs": [],
"source": [
"DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-tpu\"\n",
"! cd docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The docker build command will fail due to an incorrect path. The notebook is in the colabs directory, while the Dockerfile is in the docker directory at the same level. The cd command should navigate up one level and then into docker.

! cd ../docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} .

WORKDIR /workspace

# Download vLLM source code and apply Vertex AI Patch
RUN git clone https://github.com/vllm-project/vllm.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The COPY commands on the following lines will fail because the destination directory /workspace/vllm/vertexai/ does not exist after cloning the vllm repository. You need to create this directory before copying files into it.

RUN git clone https://github.com/vllm-project/vllm.git && mkdir -p /workspace/vllm/vertexai

WORKDIR /workspace

# Download vLLM source code and apply Vertex AI Patch
RUN git clone https://github.com/vllm-project/vllm.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The COPY commands on the following lines will fail because the destination directory /workspace/vllm/vertexai/ does not exist after cloning the vllm repository. You need to create this directory before copying files into it.

RUN git clone https://github.com/vllm-project/vllm.git && mkdir -p /workspace/vllm/vertexai

"outputs": [],
"source": [
"DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-cpu\"\n",
"! cd docker && docker build -f Dockerfile.cpu -t {DOCKER_URI} ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The docker build command will fail due to an incorrect path. The notebook is in the colabs directory, while the Dockerfile is in the docker directory at the same level. The cd command should navigate up one level and then into docker.

! cd ../docker && docker build -f Dockerfile.cpu -t {DOCKER_URI} .

"outputs": [],
"source": [
"DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-tpu\"\n",
"! cd docker && docker build -f /Dockerfile.tpu -t {DOCKER_URI} ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The docker build command has two issues that will cause it to fail:

  1. The path to the Dockerfile, -f /Dockerfile.tpu, is an absolute path. It should be relative: Dockerfile.tpu.
  2. The cd docker command will fail because the notebook is in the colabs directory. The path should be ../docker.
! cd ../docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} .

Comment on lines 361 to 363
" \"--max-model-len=2048\",\n",
" \"--enable-prefix-caching\",\n",
" f\"--tensor-parallel-size={machine_type[-2]}\",\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Deriving the tensor-parallel-size from the machine type string using machine_type[-2] is clever but brittle and hard to read. A more robust and explicit method would improve maintainability. Consider parsing the string more clearly and adding a comment.

        # Extract the number of TPU cores from the machine type string (e.g., 'ct5lp-hightpu-1t' -> '1')
        tpu_cores = machine_type.split('-')[-1][0]
        f"--tensor-parallel-size={tpu_cores}",

&& pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu

# Install vLLM
RUN VLLM_TARGET_DEVICE="cpu" python setup.py install
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using python setup.py install is a legacy practice that is discouraged. The modern, recommended way to install a package from source is by using pip.

RUN VLLM_TARGET_DEVICE="cpu" pip install .

Comment on lines 486 to 488
" \"--max-model-len=2048\",\n",
" \"--enable-prefix-caching\",\n",
" f\"--tensor-parallel-size={machine_type[-2]}\",\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Deriving the tensor-parallel-size from the machine type string using machine_type[-2] is clever but brittle and hard to read. A more robust and explicit method would improve maintainability. Consider parsing the string more clearly and adding a comment.

        # Extract the number of TPU cores from the machine type string (e.g., 'ct5lp-hightpu-1t' -> '1')
        tpu_cores = machine_type.split('-')[-1][0]
        f"--tensor-parallel-size={tpu_cores}",

"model_id = \"meta-llama/Llama-3.2-3B\" # @param {type:\"string\"}\n",
"machine_type = \"g2-standard-8\" # @param {type:\"string\"}\n",
"accelerator_type = \"NVIDIA_L4\" # @param {type:\"string\"}\n",
"accelerator_count = 1 # @param {type:\"string\"}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The accelerator_count parameter is defined as a string but is used as an integer. For clarity and to prevent potential type-related issues, it's better to define it with the correct type.

accelerator_count = 1  # @param {type:"integer"}

The repository has customization required for serving open models on Vertex AI using [vLLM](https://github.com/vllm-project/vllm.git).

## Using TPU
This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two trailing spaces at the end of this line. Please remove them for better formatting.

Suggested change
This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs.
This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant