RefreshKV: Updating Small KV Cache During Long-form Generation

Xu, Fangyuan; Goyal, Tanya; Choi, Eunsol

Computer Science > Computation and Language

arXiv:2411.05787 (cs)

[Submitted on 8 Nov 2024 (v1), last revised 3 Mar 2025 (this version, v2)]

Title:RefreshKV: Updating Small KV Cache During Long-form Generation

Authors:Fangyuan Xu, Tanya Goyal, Eunsol Choi

View PDF HTML (experimental)

Abstract:Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is to construct a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks. Lastly, we show that continued pretraining with our inference setting brings further gains in performance.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2411.05787 [cs.CL]
	(or arXiv:2411.05787v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.05787

Submission history

From: Fangyuan Xu [view email]
[v1] Fri, 8 Nov 2024 18:57:07 UTC (1,732 KB)
[v2] Mon, 3 Mar 2025 18:23:47 UTC (890 KB)

Computer Science > Computation and Language

Title:RefreshKV: Updating Small KV Cache During Long-form Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:RefreshKV: Updating Small KV Cache During Long-form Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators