Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Serve Diffusion Transformer models using xDiT container on Cloud GPUs

xDiT is an open-source library that accelerates inference for Diffusion Transformer (DiT) models by using parallelism and optimization techniques. These techniques enable a scalable multi-GPU setup for demanding workloads. This page demonstrates how to deploy DiT models by using xDiT and Cloud GPUs on Vertex AI.

For more information about xDiT, see the xDiT GitHub project.

Benefits

The following list describes the key benefits for using xDiT to serve DiT models on Vertex AI:

Up to three times faster generation: Generate high-resolution images and videos in a fraction of the time compared to other serving solutions.
Scalable multi-GPU support: Efficiently distribute workloads across multiple GPUs for optimal performance.
- Hybrid parallelism: xDiT supports various parallel processing approaches, such as unified sequence parallelism, PipeFusion, CFG parallelism, and data parallelism. These methods can be combined in a unique recipe to optimize performance.
Optimized single-GPU performance: xDiT provides faster inference even on a single GPU.
- GPU acceleration: xDiT incorporates several kernel acceleration methods and uses techniques from DiTFastAttn to speed up inference on a single GPU.
Easy deployment: Get started quickly with one-click deployment or Colab Enterprise notebooks in Vertex AI Model Garden.

Supported models

xDiT is available for certain DiT model architectures in Vertex AI Model Garden such as Flux.1 Schnell, CogVideoX-2b, and Wan2.1 text-to-video model variants. To see if a DiT model supports xDiT in Model Garden, view its model card in Model Garden.

Hybrid parallelism for multi-GPU performance:

xDiT uses a combination of parallelism techniques to maximize performance on multi-GPU setups. These techniques work together to distribute the workload and optimize resource utilization:

Unified sequence parallelism: This technique splits the input data (such as splitting an image into patches) across multiple GPUs, reducing the memory usage and improving scalability.
PipeFusion: PipeFusion divides the DiT model into stages and assigns each stage to a different GPU, enabling parallel processing of different parts of the model.
CFG parallelism: This technique specifically optimizes models by using classifier-free guidance, a common method for controlling the style and content of generated images. It parallelizes the computation of conditional and unconditional branches, leading to faster inference.
Data Parallelism: This method replicates the entire model on each GPU, with each GPU processing a different batch of input data, increasing the overall throughput of the system.

For more information about performance improvements, see xDiT's report on Flux.1 Schnell or CogVideoX-2b. Google was able to reproduce these results on Vertex AI Model Garden.

Single GPU acceleration

The xDiT library provides benefits for single-GPU serving by using torch.compile and onediff to enhance runtime speed on GPUs. These techniques can also be used in conjunction with hybrid parallelism.

xDiT also has an efficient attention computation technique, called DiTFastAttn, to address DiT's computational bottleneck. For now, this technique is only available to use for single GPU setups or in conjunction with data parallelism.

Get started in Model Garden

The xDiT optimized Cloud GPU serving container is provided within Vertex AI Model Garden. For supported models, deployments use this container when you use one-click deployments or the Colab Enterprise notebook examples.

The following examples use the Flux.1-schnell model to demonstrate how to deploy a DiT model on a xDiT container.

Use one-click deployment

You can deploy a custom Vertex AI endpoint with the xDiT container by using a model card.

Navigate to the model card page and click Deploy.
For the model variation to use, select a machine type to use for your deployment.
Click Deploy to begin the deployment process. You receive two email notifications; one when the model is uploaded and another when the endpoint is ready.

Use the Colab Enterprise notebook

For flexibility and customization, use the Colab Enterprise notebook examples to deploy a Vertex AI endpoint with the xDiT container by using the Vertex AI SDK for Python.

Navigate to the model card page and click Open notebook.
Select the Vertex Serving notebook. The notebook is opened in Colab Enterprise.
Run through the notebook to deploy a model by using the xDiT container and send prediction requests to the endpoint. The code snippet for the deployment is as follows:

XDIT_DOCKER_URI=us-docker.pkg.dev/deeplearning-platform-release/vertex-model-garden/xdit-serve.cu125.0-2.ubuntu2204.py310

serving_env = {
    "MODEL_ID": "black-forest-labs/FLUX.1-schnell",
    "TASK": "text-to-image",
    "DEPLOY_SOURCE": "notebook",
    "N_GPUS": "2",
    "ULYSSES_DEGREE": "1",
    "RING_DEGREE": "2",
    "PIPEFUSION_PARALLEL_DEGREE": "1",
    "USE_TORCH_COMPILE": "true",
}

model = aiplatform.Model.upload(
    display_name=model_name,
    serving_container_image_uri=XDIT_DOCKER_URI,
    serving_container_ports=[7080],
    serving_container_predict_route="/predict",
    serving_container_health_route="/health",
    serving_container_environment_variables=serving_env,
)

model.deploy(
    endpoint=endpoint,
    machine_type="a3-highgpu-2g",
    accelerator_type="NVIDIA_H100_80GB",
    accelerator_count=2,
    deploy_request_timeout=1800,
    service_account=SERVICE_ACCOUNT,
)

Environment variables

MODEL_ID: Specifies the ID of the DiT model to deploy, such as black-forest-labs/FLUX.1-schnell.
TASK: Defines the task the model performs, such as text-to-image-flux-xdit.
N_GPUS: Sets the number of GPUs to use for inference.
ULYSSES_DEGREE, RING_DEGREE, PIPEFUSION_PARALLEL_DEGREE: Control the parallelism techniques that xDiT uses. For details about each argument, see xDiT arguments.
USE_TORCH_COMPILE: Enables single-GPU acceleration by using torch.compile.

xDiT arguments

xDiT offers a range of server arguments that can be configured to optimize performance for specific use cases. These arguments are set as environment variables during deployment. The following list are the key arguments you might need to configure:

Model Configuration

MODEL_ID (string): Specifies the model identifier to load. This should match the model name in your registry or path.

Runtime Optimization Arguments

N_GPUS (integer): Specifies the number of GPUs to use for inference. The default value is 1.
WARMUP_STEPS (integer): Number of warmup steps required before inference begins. This is particularly important when PipeFusion is enabled to ensure stable performance. The default value is 1.
USE_PARALLEL_VAE (boolean): Enables efficient processing of high-resolution images (greater than 2048 pixels) by parallelizing the VAE component across devices. This prevents OOM issues for large images. The default value is false.
USE_TORCH_COMPILE (boolean): Enables single-GPU acceleration through torch.compile, providing kernel-level optimizations for improved performance. The default value is false.
USE_ONEDIFF (boolean): Enables OneDiff compilation acceleration technology to optimize GPU kernel execution speed. The default value is false.

Data Parallel Arguments

DATA_PARALLEL_DEGREE (integer): Sets the degree of data parallelism. Leave empty to disable or set to selected parallel degree.
USE_CFG_PARALLEL (boolean): Enables parallel computation for classifier-free guidance (CFG), also known as Split Batch. When enabled, the constant parallelism degree is 2. Set to true when using CFG for controlling output style and content. The default value is false.

Sequence Parallel Arguments (USP - Unified Sequence Parallelism)

ULYSSES_DEGREE (integer): Sets the Ulysses degree for the unified sequence parallel approach, which combines DeepSpeed-Ulysses and Ring-Attention. This controls the all-to-all communication pattern. Leave empty to use default.
RING_DEGREE (integer): Sets the Ring degree for peer-to-peer communication in sequence parallelism. Works in conjunction with ULYSSES_DEGREE to form the 2D process mesh. Leave empty to use default.

Tensor Parallel Arguments

TENSOR_PARALLEL_DEGREE (integer): Sets the degree of tensor parallelism, which splits model parameters across devices along feature dimensions to reduce memory costs per device. Leave empty to disable.
SPLIT_SCHEME (string): Defines how to split the model tensors across devices (e.g., by attention heads, hidden dimensions). Leave empty for the default splitting scheme.

Ray Distributed Arguments

USE_RAY (boolean): Enables Ray distributed execution framework for scaling computations across multiple nodes. The default value is false.
RAY_WORLD_SIZE (integer): Total number of processes in the Ray cluster. The default value is 1.
VAE_PARALLEL_SIZE (integer): Number of processes dedicated to VAE parallel processing when using Ray. The default value is 0.
DIT_PARALLEL_SIZE (integer): Number of processes dedicated to DiT backbone parallel processing when using Ray. The default value is 0.

PipeFusion Parallel Arguments

PIPEFUSION_PARALLEL_DEGREE (integer): Sets the degree of parallelism for PipeFusion, a sequence-level pipeline parallelism that takes advantage of the input temporal redundancy characteristics of diffusion models. Higher values increase parallelism but require more memory. The default value is 1.
NUM_PIPELINE_PATCH (integer): Number of patches to split the sequence into for pipeline processing. Leave empty for automatic determination.
ATTN_LAYER_NUM_FOR_PP (string): Specifies which attention layers to use for pipeline parallelism. Can be comma-separated (e.g., "10,9") or space-separated (e.g., "10 9"). Leave empty to use all layers.

Memory Optimization Arguments

ENABLE_MODEL_CPU_OFFLOAD (boolean): Offloads model weights to CPU memory when not in use, reducing GPU memory usage at the cost of increased latency. The default value is false.
ENABLE_SEQUENTIAL_CPU_OFFLOAD (boolean): Sequentially offloads model layers to CPU during forward pass, enabling inference of models larger than GPU memory. The default value is false.
ENABLE_TILING (boolean): Reduces GPU memory usage by decoding the VAE component one tile at a time. This argument is useful for larger images or videos and to prevent out-of-memory errors. The default value is false.
ENABLE_SLICING (boolean): Reduces GPU memory usage by splitting the input tensor into slices for VAE decoding. The default value is false.

DiTFastAttn Arguments (Attention Optimization)

USE_FAST_ATTN (boolean): Enables DiTFastAttn acceleration for single-GPU inference, utilizing Input Temporal Reduction to reduce computational complexity. The default value is false.
N_CALIB (integer): Number of calibration samples for DiTFastAttn optimization. The default value is 8.
THRESHOLD (float): Similarity threshold for Temporal Similarity Reduction in DiTFastAttn. The default value is 0.5.
WINDOW_SIZE (integer): Window size for Window Attention with Residual Caching to reduce spatial redundancy. The default value is 64.
COCO_PATH (string): Path to COCO dataset for DiTFastAttn calibration. Required when USE_FAST_ATTN is true. Leave empty if not using.

Cache Optimization Arguments

USE_CACHE (boolean): Enables general caching mechanisms to reduce redundant computations. The default value is false.
USE_TEACACHE (boolean): Enables TeaCache optimization method for caching intermediate results. The default value is false.
USE_FBCACHE (boolean): Enables First-Block-Cache optimization method. The default value is false.

Precision Optimization Arguments

USE_FP8_T5_ENCODER (boolean): Enables FP8 (8-bit floating point) precision for the T5 text encoder, reducing memory usage and potentially improving throughput with minimal quality impact. The default value is false.

For a full list of arguments, see the xFuserArgs class in the xDiT GitHub project.