-
Notifications
You must be signed in to change notification settings - Fork 154
added notebooks and dockerfiles for serving open models on vertexai using vllm custom containers #4148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…sing vllm customer containers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ravi-dalal, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request delivers a robust solution for serving open large language models on Vertex AI. It provides a set of detailed Colab notebooks and custom container configurations that streamline the deployment of vLLM-powered models across CPU, GPU, and TPU hardware. The changes ensure compatibility with Vertex AI's prediction API, offering flexibility in model sourcing from Hugging Face or Google Cloud Storage, thereby simplifying the path to production for open-source LLMs.
Highlights
- Open Model Serving on Vertex AI: This pull request introduces comprehensive support for deploying open large language models (LLMs), specifically Llama 3.2 3B, on Google Cloud Vertex AI Endpoints using vLLM custom containers.
- Multi-Platform Notebooks: Four new Colab notebooks are added, demonstrating end-to-end deployment workflows for CPU, GPU, and TPU environments. This includes a specific notebook for loading models from Google Cloud Storage (GCS) on TPUs, alongside Hugging Face model loading for all platforms.
- vLLM Customization for Vertex AI: A custom Git patch (
vertexai.patch
) is included to modify the vLLM API server. This patch adapts vLLM's request and response formats to be fully compatible with Vertex AI's prediction service requirements, ensuring seamless integration. - Containerization and Model Loading: Dedicated Dockerfiles for CPU, GPU, and TPU are provided, along with a custom
entrypoint.sh
script. This script enhances the container's capability to dynamically download models from either Hugging Face or Google Cloud Storage before starting the vLLM server.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
Tagging tech writer for review as well. @ktonthat please help with the review. |
@RajeshThallam PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a set of notebooks and Dockerfiles to demonstrate serving open models on Vertex AI using vLLM. The changes are comprehensive, covering CPU, GPU, and TPU deployments. I've found several critical issues related to incorrect paths, use of deprecated parameters, and missing resource cleanup in the notebooks, which will cause errors and could lead to unexpected costs. The Dockerfiles also have issues that will prevent them from building successfully. I've provided specific suggestions to address these problems.
"outputs": [], | ||
"source": [ | ||
"DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-tpu\"\n", | ||
"! cd docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker build
command will fail due to an incorrect path. The notebook is in the colabs
directory, while the Dockerfile
is in the docker
directory at the same level. The cd
command should navigate up one level and then into docker
.
! cd ../docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} .
WORKDIR /workspace | ||
|
||
# Download vLLM source code and apply Vertex AI Patch | ||
RUN git clone https://github.com/vllm-project/vllm.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The COPY
commands on the following lines will fail because the destination directory /workspace/vllm/vertexai/
does not exist after cloning the vllm
repository. You need to create this directory before copying files into it.
RUN git clone https://github.com/vllm-project/vllm.git && mkdir -p /workspace/vllm/vertexai
WORKDIR /workspace | ||
|
||
# Download vLLM source code and apply Vertex AI Patch | ||
RUN git clone https://github.com/vllm-project/vllm.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The COPY
commands on the following lines will fail because the destination directory /workspace/vllm/vertexai/
does not exist after cloning the vllm
repository. You need to create this directory before copying files into it.
RUN git clone https://github.com/vllm-project/vllm.git && mkdir -p /workspace/vllm/vertexai
"outputs": [], | ||
"source": [ | ||
"DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-cpu\"\n", | ||
"! cd docker && docker build -f Dockerfile.cpu -t {DOCKER_URI} ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker build
command will fail due to an incorrect path. The notebook is in the colabs
directory, while the Dockerfile
is in the docker
directory at the same level. The cd
command should navigate up one level and then into docker
.
! cd ../docker && docker build -f Dockerfile.cpu -t {DOCKER_URI} .
"outputs": [], | ||
"source": [ | ||
"DOCKER_URI = f\"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-gcp-tpu\"\n", | ||
"! cd docker && docker build -f /Dockerfile.tpu -t {DOCKER_URI} ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker build
command has two issues that will cause it to fail:
- The path to the Dockerfile,
-f /Dockerfile.tpu
, is an absolute path. It should be relative:Dockerfile.tpu
. - The
cd docker
command will fail because the notebook is in thecolabs
directory. The path should be../docker
.
! cd ../docker && docker build -f Dockerfile.tpu -t {DOCKER_URI} .
" \"--max-model-len=2048\",\n", | ||
" \"--enable-prefix-caching\",\n", | ||
" f\"--tensor-parallel-size={machine_type[-2]}\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deriving the tensor-parallel-size
from the machine type string using machine_type[-2]
is clever but brittle and hard to read. A more robust and explicit method would improve maintainability. Consider parsing the string more clearly and adding a comment.
# Extract the number of TPU cores from the machine type string (e.g., 'ct5lp-hightpu-1t' -> '1')
tpu_cores = machine_type.split('-')[-1][0]
f"--tensor-parallel-size={tpu_cores}",
&& pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu | ||
|
||
# Install vLLM | ||
RUN VLLM_TARGET_DEVICE="cpu" python setup.py install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" \"--max-model-len=2048\",\n", | ||
" \"--enable-prefix-caching\",\n", | ||
" f\"--tensor-parallel-size={machine_type[-2]}\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deriving the tensor-parallel-size
from the machine type string using machine_type[-2]
is clever but brittle and hard to read. A more robust and explicit method would improve maintainability. Consider parsing the string more clearly and adding a comment.
# Extract the number of TPU cores from the machine type string (e.g., 'ct5lp-hightpu-1t' -> '1')
tpu_cores = machine_type.split('-')[-1][0]
f"--tensor-parallel-size={tpu_cores}",
"model_id = \"meta-llama/Llama-3.2-3B\" # @param {type:\"string\"}\n", | ||
"machine_type = \"g2-standard-8\" # @param {type:\"string\"}\n", | ||
"accelerator_type = \"NVIDIA_L4\" # @param {type:\"string\"}\n", | ||
"accelerator_count = 1 # @param {type:\"string\"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repository has customization required for serving open models on Vertex AI using [vLLM](https://github.com/vllm-project/vllm.git). | ||
|
||
## Using TPU | ||
This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two trailing spaces at the end of this line. Please remove them for better formatting.
This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs. | |
This [colab notebook](colabs/vertexai_serving_vllm_tpu_llama3_2_3B.ipynb) shows how Llama 3.2 3B model can be deployed (downloaded from Hugging Face) to Vertex AI Endpoint using this repository on TPUs. |
REQUIRED: Add a summary of your PR here, typically including why the change is needed and what was changed. Include any design alternatives for discussion purposes.
This pull request contains four Colab notebooks that demonstrate how an open large language model (e.g. Llama 3.2) can be deployed on Vertex AI (Endpoints) via custom containers (vLLM) on TPUs, GPUs and CPUs. It also has the Git patch file that updates the vLLM open source for Vertex AI as well as the dockerfiles for three platforms (TPU, GPU and CPU).
REQUIRED: Fill out the below checklists or remove if irrelevant
Official Notebooks
under the notebooks/official folder, follow this mandatory checklist:Official Notebooks
section, pointing to the author or the author's team.Community Notebooks
under the notebooks/community folder:Community Notebooks
section, pointing to the author or the author's team.Community Content
under the community-content folder:Content Directory Name
is descriptive, informative, and includes some of the key products and attributes of your content, so that it is differentiable from other contentCommunity Content
section, pointing to the author or the author's team.