+
Skip to content

thunlp/cost-optimal-gqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Cost-Optimal Grouped-Query Attention for Long-Context Modeling


Yingfa Chen*, Yutong Wu*, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun
Tsinghua University, University of Science and Technology Beijing
chenyingfa1999@gmail.com, wuyutong_yuna@163.com

This repository contains the code and models used in the EMNLP 2025 paper Cost-Optimal Grouped-Query Attention for Long-Context Modeling.

Main Results

The main research question of the paper:

Given an expected inference context length and target loss, how can GQA be configured to minimize inference costs while achieving that loss?*

To avoid sweeping all combinations of model sizes and GQA configurations, we present a threestep search procedure. Our approach is empirically validated on models up to 1.2B parameters. Empirical results show that the widely used Llama-3 GQA configuration (Grattafiori et al., 2024) is highly suboptimal at 128K (which is the context length supported by Llama-3).

Figure 1 of the paper

Figure 2 of the paper

How to Run the Code

Please refer to the README.md file inside the src folder.

How to Cite

@inproceedings{chen2025cost-optimal-gqa,
    title={Cost-Optimal Grouped-Query Attention for Long-Context Modeling}, 
    author={Yingfa Chen and Yutong Wu and Chenyang Song and Zhen Leng Thai and Xingyu Shen and Xu Han and Zhiyuan Liu and Maosong Sun},
    year={2025},
    booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
}

About

The code for the paper "Cost-Optimal Grouped-Query Attention for Long-Context Modeling"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载