Cost-Optimal Grouped-Query Attention for Long-Context Modeling

Yingfa Chen*, Yutong Wu*, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun
Tsinghua University, University of Science and Technology Beijing
chenyingfa1999@gmail.com, wuyutong_yuna@163.com

This repository contains the code and models used in the EMNLP 2025 paper Cost-Optimal Grouped-Query Attention for Long-Context Modeling.

Main Results

The main research question of the paper:

Given an expected inference context length and target loss, how can GQA be configured to minimize inference costs while achieving that loss?*

To avoid sweeping all combinations of model sizes and GQA configurations, we present a threestep search procedure. Our approach is empirically validated on models up to 1.2B parameters. Empirical results show that the widely used Llama-3 GQA configuration (Grattafiori et al., 2024) is highly suboptimal at 128K (which is the context length supported by Llama-3).

How to Run the Code

Please refer to the README.md file inside the src folder.

How to Cite

@inproceedings{chen2025cost-optimal-gqa,
    title={Cost-Optimal Grouped-Query Attention for Long-Context Modeling}, 
    author={Yingfa Chen and Yutong Wu and Chenyang Song and Zhen Leng Thai and Xingyu Shen and Xu Han and Zhiyuan Liu and Maosong Sun},
    year={2025},
    booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cost-Optimal Grouped-Query Attention for Long-Context Modeling

Main Results

How to Run the Code

How to Cite

About

Uh oh!

Releases

Packages

Languages

thunlp/cost-optimal-gqa

Folders and files

Latest commit

History

Repository files navigation

Cost-Optimal Grouped-Query Attention for Long-Context Modeling

Main Results

How to Run the Code

How to Cite

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages