Integreted with features like continuous batching, paged attention, chunked prefill, prefix caching, token throttling, pipeline parallelism, expert parallelsim and tensor parallelism, gLLM provides basic functionality (offline/online inference and interactive chat) to deploy distributed LLMs (supported in huggingface) inference. gLLM provides equivalent or superior offline/online inference speed with mainstream inference engine and minimal (~6k loc) code base. You can also see gLLM as a LLM inference playground for doing experiment or academic research.
Latest News 🔥
- [2025/07/12]: FP8 quantization for Qwen3/2.5 is supported 🎉
- [2025/06/27]: gLLM is accepted by SC'25. Congratulations 🥰
- [2025/06/21]: Expert parallelism is integrated 😍
- [2025/06/14]: Tensor parallelism is now integrated, allowing joint deploying with pipeline parallelism 😎
- [2025/05/05]: MoE architecture is supported. Try Qwen2/3 MoE models 🤩
- [2025/04/29]: Qwen3 day 1 support. Come and try Qwen3 🎉
- [2025/04/27]: gLLM is open sourced 🌏
- [2025/04/27]: We support multi-node deployments. You can serve your model across different machines 😊
- [2025/04/21]: We release our paper on arXiv:2504.14775 🥳
- [2025/03/15]: Chunked prefill has been integrated. You can input any length of text you want 🤗
- [2025/03/01]: Pipeline parallelism has been integrated. You can run any size of model you want 😆
- [2025/02/27]: We apply numerous optimizations which lowers CPU overhead a lot 👏
pip install torch==2.5.1
pip install -v -e .
python examples/chat.py --model $MODEL_PATH
python examples/batch_inference.py --model $MODEL \
--share-gpt-path $SHARE_GPT_PATH --num-prompt $NUM_PROMPT \
--gpu-memory-util $GPU_MEMORY_UTIL
python benchmarks/benchmark_throughput.py --model $MODEL \
--dataset $SHAREGPT_PATH --num-prompt $NUM_PROMPT --backend gllm \
--gpu-memory-util $GPU_MEMORY_UTIL
# To see the description of args, run 'python -m gllm.entrypoints.api_server -h'
python -m gllm.entrypoints.api_server --port $PORT --model-path $MODEL_PATH \
--enable-prefix-caching --pp $PP --tp $TP
Experimental feature
gLLM can be launched in three modes: (1) normal
, used for single-node multiple GPUs (2) master
, used for multi-node deployment (3) slave
, used for multi-node deployment.
To launch master gLLM instance
python -m gllm.entrypoints.api_server --port $PORT --master-port $MASTER_PORT \
--model-path $MODEL_PATH --pp $PP --launch-mode master --worker-ranks $RANKS
To launch slave gLLM instance
python -m gllm.entrypoints.api_server --host $HOST \
--master-addr $MASTER_ADDR --master-port $MASTER_PORT \
--model-path $MODEL_PATH --pp $PP --launch-mode slave --worker-ranks $RANKS
There are something you need to care about
- Make sure
$MASTER_PORT
and$MASTER_ADDR
in slave instance can be matched to that in master instance - Make sure slave instance can set up connection with master instance using
$MASTER_ADDR
- Make sure master instance can set up connection with slave instance using
$HOST
- Make sure
$PP
can be matched to$RANKS
in slave or master instance- For example, we want to launch two gLLM instances,
$PP
is set to4
,$RANKS
in master is set to0,1
, then$RANKS
in slave must set to2,3
- For example, we want to launch two gLLM instances,
- Make sure set environment variable
NCCL_SOCKET_IFNAME
NCCL_IB_DISABLE
properly
# Launch server first
python examples/client.py --port $PORT
# Launch server first
python examples/chat_client.py --port $PORT
# Launch server first
python benchmarks/benchmark_serving.py --backend $BACKEND --model $MODEL \
--dataset-name $DATASET_NAME --dataset-path $DATASET_PATH \
--num-prompts $NUM_PROMPTS --port $PORT --trust-remote-code \
--request-rate $REQUEST_RATE
# Launch server first
python benchmarks/benchmark_prefix_serving.py \
--trust-remote-code --backend $BACKEND --dataset $SHAREGPT_PATH \
--model $MODEL --num-max-users $NUM_USERS \
--num-min-rounds $NUM_MIN_ROUNDS \
--num-max-rounds $NUM_MAX_ROUNDS \
--port $PORT
# Launch server first
python benchmarks/evaluate_MMLU_pro.py --model $MODEL
- Qwen Series: Qwen3, Qwen2.5, Qwen2
- Llama Series: Llama3.2, Llama3.1, Llama3, Llama2 and deepseek-coder
- Mixtral Series: Mixtral-8x7B, Mixtral-8x22B
- ChatGLM Series: Glm4 and Chatglm3
- fp8
- Support more models
- Support more quantization methods
@misc{guo2025gllmglobalbalancedpipeline,
title={gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling},
author={Tianyu Guo and Xianwei Zhang and Jiangsu Du and Zhiguang Chen and Nong Xiao and Yutong Lu},
year={2025},
eprint={2504.14775},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2504.14775},
}
We studied the architecture and implemented code reuse from these existing projects: vLLM, SGLang and TD-Pipe.