Official PyTorch implementation for the paper Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache
Install packages from requirements.txt
:
pip install -r requirements.txt
To try inference described in the paper you can run jupyter notebooks from notebooks/ folder:
Simple example with minimal prompt: basic_example.ipynb
Hogwild! Inference with full prompt: full_example.ipynb
Minimal colab example with Llama-3.2 3B and very limited collaboration: colab_example.ipynb
To use fast inference kernels, go to the inference_lib
folder and run:
pip install -e . # ensure you have nvcc cuda compiler in PATH or export CUDACXX=/TODO/path/to/nvcc
to install the necessary module.
You can test it using the notebook hogwild_with_fast_kernels.ipynb
.
Kernels were optimized for the L40 and similar GPUs.
If you found this work useful, please consider citing:
@misc{rodionov2025hogwildinferenceparallelllm,
title={Hogwild! Inference: Parallel LLM Generation via Concurrent Attention},
author={Gleb Rodionov and Roman Garipov and Alina Shutova and George Yakushev and Vage Egiazarian and Anton Sinitsin and Denis Kuznedelev and Dan Alistarh},
year={2025},
eprint={2504.06261},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.06261},
}