max serve --model openai-community/gpt2 --custom-architectures ../max-gpt-2
- GPU support
- Paged KV Caching
- Flash Attention
GPU: Nvidia RTX 5090
Input prompt: 1st paragraph of lorem ipsum
Prompt processing: 3.7K tok/s
Token generation: 14.9 tok/s
Prompt processing: 30.7K tok/s
Token generation: 250.1 tok/s