Tsinghua University, University of Science and Technology Beijing
chenyingfa1999@gmail.com, wuyutong_yuna@163.com
This repository contains the code and models used in the EMNLP 2025 paper Cost-Optimal Grouped-Query Attention for Long-Context Modeling.
The main research question of the paper:
To avoid sweeping all combinations of model sizes and GQA configurations, we present a threestep search procedure. Our approach is empirically validated on models up to 1.2B parameters. Empirical results show that the widely used Llama-3 GQA configuration (Grattafiori et al., 2024) is highly suboptimal at 128K (which is the context length supported by Llama-3).
Please refer to the README.md file inside the src
folder.
@inproceedings{chen2025cost-optimal-gqa,
title={Cost-Optimal Grouped-Query Attention for Long-Context Modeling},
author={Yingfa Chen and Yutong Wu and Chenyang Song and Zhen Leng Thai and Xingyu Shen and Xu Han and Zhiyuan Liu and Maosong Sun},
year={2025},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
}