September 18, 2025
Key-value (KV) cache compression has emerged as a critical technique for reducing the memory and latency overhead of autoregressive language models during inference. Prior approaches predominantly rely on query-key attention scores to rank and evict
cached tokens, assuming that attention intensity correlates with semantic importance. However, this heuristic overlooks the contribution of value vectors, which directly influence the attention output. In this paper, we propose CurDKV
, a
novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition. Our approach approximates the dominant subspace of the attention output \(\mathrm{softmax}(QK^\top)V\), ensuring that the retained tokens best preserve the model’s predictive behavior. Theoretically, we show that attention score approximation does not guarantee output preservation, and demonstrate
that CUR-based selection minimizes end-to-end attention reconstruction loss. Empirically, CurDKV
achieves up to \(9.6\%\) higher accuracy than state-of-the-art methods like SnapKV and ChunkKV under aggressive
compression budgets on LLaMA and Mistral, while maintaining compatibility with FlashAttention and Grouped Query Attention. In addition to improved accuracy, CurDKV
reduces generation latency by up to 40% at high compression, offering a
practical speed-accuracy tradeoff.
Transformer-based large language models (LLMs) have achieved remarkable performance across a wide range of natural language understanding and generation tasks [1]–[3]. However, their inference-time memory consumption remains a significant challenge, particularly for application requiring long-context information. A major source of this overhead is the Key-Value (KV) cache, which stores the key and value vectors corresponding to previously generated tokens. These vectors are used to compute attention outputs without recomputing intermediate states, but their memory footprint grows linearly with sequence length. For instance, as highlighted by [4], LLM-8B, which generats a sequence of 2M tokens, can consume up to 256GB of GPU memory while storing the KV cache.
Typically, KV cache is populated in two stages: during the prefill phase, where a long-context input is encoded in parallel and all token-level KV vectors are stored; and during the generation phase, where tokens are produced autoregressively and KV entries are appended one step at a time. In particular, the prefill phase dominates memory usage, especially when handling large input contexts. This has motivated extensive research [5] into KV compression techniques that reduce the number of cached entries while preserving output quality. Existing KV compression methods prioritize tokens based on attention scores. For instance, SnapKV [6] and H2O [7] estimate token importance by accumulating attention weights across heads and layers. While such heuristics are computationally cheap, they only capture query-key alignment and neglect the downstream contribution of the value vectors. Moreover, these methods mostly overlook the fact that tokens with low attention scores can still carry rich semantic information through their value vectors. Recent work formalizes this mismatch through the concept of eviction loss [4], which quantifies the performance degradation from evicting specific KV entries. Critically, eviction loss is not always correlated with attention score (c.f. Figure 1), highlighting the need for value-aware selection strategies.
In this work, we introduce CurDKV
(CUR Decomposition for KV Compression), a principled and efficient KV compression method that circumvents these limitations by leveraging the structure of the original
value and key matrices rather than relying on attention scores. CurDKV
assigns importance to each key and value based on its leverage score, computed as the squared \(\ell_2\) norm of the
corresponding left-singular vector of the matrix in consideration. These scores reflect each token’s contribution to the attention output and naturally align with the goal of minimizing eviction loss. To improve scalability, we also introduce a fast
approximation based on random Gaussian projections, where the value and key matrices are projected into a lower-dimensional subspace before computing leverage scores. This maintains relative importance across tokens while significantly reducing
computational cost. We further cast the KV selection problem as a CUR decomposition task on the attention matrix product, selecting key and value vectors that best reconstruct either the attention matrix \(QK^\top\) or the final output \(\mathrm{softmax}(QK^\top)V\). Unlike prior work, our formulation directly targets the fidelity of the attention output, leading to a more semantically consistent
compression strategy. As shown in Figure 2, CurDKV
achieves significantly lower reconstruction loss on both the \(QK^\top\) matrix and the final \(\mathrm{softmax}(QK^\top)V\) product, demonstrating improved preservation of contextual semantics post-compression.
We evaluate CurDKV
on two popular long-context benchmarks – LongBench [8] and Ruler [9], spanning 24 tasks with LLaMA-3.1-8B-Instruct [2] and Mistral-7B-Instruct [3]. Our experiments demonstrate that CurDKV
consistently
outperforms existing attention-based KV compression baselines such as SnapKV [6], ChunkKV [10], and Streaming LLM [11]. In particular, CurDKV
achieves
significantly lower degradation in generation accuracy at aggressive compression ratios (e.g., \(90\%\) cache reduction). With LLaMA, CurDKV
surpasses SnapKV by up to \(9.6\%\)
in average task score, and achieves higher fidelity (similarity as the full cache model) across all types of long-context-dependent tasks. Similar gains are observed with Mistral, where CurDKV
maintains over \(95\%\) average fidelity even under constrained memory budgets. These results validate that our value-centric, CUR-based strategy more effectively preserves semantic attention outputs than methods relying solely on query-key
attention scores.1
Figure 2: Preservation of the \(QK^\top\) and \(\mathrm{softmax}(QK^\top)V\) matrices under different KV compression strategies. CurDKV
achieves the lowest
eviction loss, indicating more faithful reconstruction of intermediate attention. We compute the reconstruction loss for LLaMA-3.1-8B-Instruct, averaged over all layers and attention heads for a sample text obtained from https://en.wikipedia.org/wiki/Attention_Is_All_You_Need..
A widely used class of KV compression methods rank cached tokens by heuristics and retain the top-\(k\) scoring entries. H2O [7] uses observed attention scores from specific query-key interactions to estimate per-token importance and prunes less relevant tokens online. However, it requires access to attention weights, which breaks compatibility with efficient kernels like FlashAttention [12]. SnapKV [6] computes cumulative attention scores and selects the highest-ranked tokens, assuming that frequently attended tokens are most informative. Scissorhands [13] further introduces temporal windows to detect consistently attended tokens. PyramidKV [14] proposes a hierarchical scoring scheme across layers and heads, combining multi-level top-\(k\) selection with scaling-aware normalization. ChunkKV [10] compresses the cache by segmenting it into chunks and retaining only top-ranked entries per chunk to ensure temporal coverage. While these approaches are efficient and easy to deploy, they focus on approximating \(QK^\top\) and neglect the downstream contribution of value vectors in \(\mathrm{softmax}(QK^\top)V\), the final attention output propagated to the subsequent layers. Beyond static heuristics, adaptive methods attempt to allocate compression budgets more intelligently across heads or layers. AdaKV [4] computes attention statistics such as entropy and average weights to estimate the relative importance of each head and distributes the token retention budget accordingly. While more flexible, these approaches still rely on attention-derived metrics and do not explicitly account for the downstream impact of compression on the attention output.
A critical practical consideration is compatibility with modern inference backends such as FlashAttention [12], which avoids materializing the full attention matrix by fusing softmax and matrix multiplications. Methods such as H2O [7], which rely on observed attention weights, are incompatible with FlashAttention and similar kernels that trade explicit attention computation for efficiency. Another major limitation with the majority of the existing KV compression methods is their inability to work with Group Query Attention (GQA). GQA [15] has been widely adopted in recent LLM architectures such as LLaMA [2] and Mistral [3] due to its ability to reduce the memory footprint of the KV cache by sharing keys and values across multiple heads. However, most existing KV cache compression techniques, including SnapKV and PyramidKV, do not support GQA natively. These methods redundantly replicate the compressed KV cache across heads within a group, thus forfeiting GQA’s inherent efficiency. To address this limitation, [4] proposed a GQA-compatible eviction framework that computes average attention weights within each group to identify important tokens. This enhancement enables state-of-the-art KV compression methods to function effectively in GQA-enabled models and achieve significant cache size reductions while preserving generation quality.
The generation process of an LLM is autoregressive in nature, i.e., each generation step relies on the outputs of all previous steps. Suppose \(X\in\mathbb{R}^{n\times d}\) be the matrix containing the hidden states of a layer in an LLM upto a time step. Further, suppose \(x\in\mathbb{R}^{1\times d}\) be the hidden state of the last generated token, which is used as an input for the current generation step. Suppose, there are \(h\) attention heads. For a head \(i\in[1, h]\), the query, key and value matrices, \(W_i^Q\), \(W_i^K\) and \(W_i^V\) (all in \(\mathbb{R}^{d\times d_h}\)) map hidden states \(X\) to query, key and value states; specifically, the following are computed: \[\begin{align} Q_i = XW_i^Q, K_i = XW_i^K, V_i = XW_i^V \end{align}\] For the current time step, the states \(K_i\) and \(V_i\) constitute the KV cache elements for head \(i\). KV caching refers to caching \(K_i\) and \(V_i\) in memory to avoid recomputing them at every generation step. After computing the output for the current time step, the KV cache is updated accordingly: \[\begin{align} K_i = Cat[K_i : xW_i^K], V_i = Cat[V_i : xW_i^V] \end{align}\] KV cache compression techniques aim to retain only the most relevant key and value states from \(K_i\) and \(V_i\), without significant loss in generation quality. Particularly, we show that the loss on the attention output is upper bounded by the Frobenius distance between the original value matrix and it’s compression:
Lemma 1. Let \(Q, K, V\in\mathbb{R}^{n\times d}\) be the query, key and value matrices in the attention computation. Further, let \(K', V'\in\mathbb{R}^{n\times d}\) be sub-matrices obtained by zeroing-out a subset of rows of \(K\) and \(V\) respectively. Then, 2 \[\begin{align} \lVert\text{softmax}(QK^T)V - \text{softmax}(QK'^T)V'\rVert_F\le \sqrt{n}\lVert V - V'\rVert_F + 2\sqrt{n}\lVert V'\rVert_F \end{align}\]
Note that for higher compression levels, the second term (\(\lVert V'\rVert_F\)) becomes negligible; in that case, the role of the approximation \(V'\) to \(V\) becomes significant. On the other hand, the contribution of \(Q\), \(K\) and \(K'\) is absorbed by the \(\text{softmax}\) operator. This motivates us to design a value-guided compression technique.
Let \(A\in\mathbb{R}^{n\times d}\) be a matrix. The problem of low-rank decomposition of \(A\) seeks to factor \(A\) as a product of low-rank matrices, and is a well-studied problem with wide applications in mathematical modeling and data compression [16]. The optimal solution to this problem is the well-known singular value decomposition (SVD), which represents \(A\) in the form \(A = U\Sigma V^*\), with \(U\) and \(V\) being semi-unitary matrices, and \(\Sigma\) a square diagonal matrix of dimension equal to the rank of \(A\). While the SVD is widely adopted in data analysis problems, when applied to the context of the KV cache compression problem, it presents an inherent challenge; namely, the column/row vectors of \(U\) and \(V\) are not representative of the column/row vectors of \(A\) itself. Another suboptimal low-rank decomposition, called the CUR decomposition, resolves this limitation: it is an approximation of the form \(A \approx CUR\), where \(C\) is a subset of columns of \(A\), \(R\) is a subset of rows of \(A\) and \(U\) is a low-rank matrix. Here, while the matrices \(C\) and \(R\) are representative of the underlying data, the matrix \(U\) optimizes the approximation to \(A\). The CUR decomposition is known to be a good low-rank approximation to the original matrix [16]–[18].
We utilize leverage scores, a popular technique used to compute CUR factorizations based on identifying key elements of the singular vectors of a matrix [18]. As before, let \(A = U\Sigma V^*\) be the SVD of \(A\in\mathbb{R}^{n\times d}\), where \(U\in\mathbb{R}^{n\times r}\), \(\Sigma\in\mathbb{R}^{r\times r}\) and \(V\in\mathbb{R}^{d\times r}\) with \(r = \text{rank}(A)\). The idea of leverage scores is to compute importance scores for each row and column of \(A\), which are then used to sample the rows and columns that then constitute the matrices \(R\) and \(C\) of the CUR decomposition respectively. Particularly, to rank the importance of rows of \(A\), we define the row leverage scores \(l_{r, j}\) (for the \(j\)th row of \(A\)) as \[\begin{align} \label{eq:row95lev95score} l_{r, j} := \lVert U(j, :)\rVert_2 \end{align}\tag{1}\] Similarly, to rank the importance of columns of \(A\), the column leverage scores \(l_{c, j}\) (for the \(j\)th column of \(A\)) are defined analogously by \[\begin{align} \label{eq:col95lev95score} l_{c, j} := \lVert V^*(:, j)\rVert_2 = \lVert V(j, :)\rVert_2 \end{align}\tag{2}\] Since \(U\) and \(V\) contain the left and right singular vectors of \(A\), it easily follows that \(\sum_{j = 1}^n l^2_{r, j} = n\) and \(\sum_{j = 1}^d l^2_{c, j} = d\). In turn, these scores lead to the distributions \(\{\frac{l^2_{r, j}}{n}\}\) and \(\{\frac{l^2_{c, j}}{d}\}\) from which the matrices \(R\) and \(C\) can be sampled. Alternatively, static methods use the top-\(k\) scores from either distribution to sample the corresponding matrices.
CurDKV
: Algorithm DesignWe now introduce CurDKV
, our proposed KV cache compression method, which uses combined leverage scores of key and value matrices to compress the KV matrix. CurDKV
can be applied during the pre-filling as well as the token
generation phase, is easy to implement, and seamlessly integrates with inference acceleration methods such as FlashAttention [12] and GQA [15]. Algorithm 3 provides the pseudocode of CurDKV
. In
the following discussion, we assume the attention setup as in GQA [15], wherein a group of attention heads uses a common key and value matrix.
Particularly, we assume that there are \(g\) groups, with the key and value matrices of group \(i\), denoted by \(K_i\in\mathbb{R}^{n\times d}\) and \(V_i\in\mathbb{R}^{n\times d}\), respectively. As usual, \(n\) represents the number of keys (or values) in consideration, and \(d\) is the dimension of the space
of vectors. The combined key and value matrices across all groups is denoted by \(K\in\mathbb{R}^{g\times n\times d}\) and \(V\in\mathbb{R}^{g\times n\times d}\).
As shown in Algorithm 3, CurDKV
takes as input the key matrix \(K\in\mathbb{R}^{g\times n\times d}\) across all attention groups, the corresponding value matrix \(V\in \mathbb{R}^{g\times n\times d}\), the group budget \(k\in [n]\) representing the level of compression, a projection dimension \(r\) and the number
of initial attention sinks \(s\) to be preserved. It outputs compressed key and value matrices \(K'\in\mathbb{R}^{g\times k\times d}\) and \(V'\in\mathbb{R}^{g\times k\times d}\), which are then used to compute the output of the attention module. At a high-level, CurDKV
computes approximate leverage scores for each row of the key
and value matrices \(K_i\in\mathbb{R}^{n\times d}\) and \(V_i\in\mathbb{R}^{n\times d}\) within a single group \(i\) (lines 3-5). The leverage scores for
corresponding keys and values are then combined, followed by normalization (lines 6-7). Finally, the computed leverage scores are then used to retrieve the top-\(k\) keys and values from \(K_i\) and \(V_i\) (lines 9-11). In addition, CurDKV
also preserves the initial \(s\) tokens (lines 8, 10) due to the prevalence of attention sinks
[11].
The computation of exact leverage scores as in Equations 1 and 2 requires the computation of the SVD of the matrices in consideration, which turns out to be a bottleneck for the
KV compression problem. To that end, CurDKV
projects \(K_i\) and \(V_i\) to matrices \(K_iG\) and \(V_iG\),
respectively, where \(G\in\mathbb{R}^{d\times r}\) is a random matrix, each of whose entries are sampled from a normal \(\mathcal{N}(0, \frac{1}{r})\) distribution; in our implementation, we
use \(r = 20\) 3. After this projection step, the norms of the rows of \(K_iG\) and \(V_iG\) are used as proxies for the actual leverage scores (lines 3-5 of Algorithm 3).
CurDKV
.Along with CurDKV
, we also implement AdaCurDKV
, an adaptive variation of CurDKV
. AdaCurDKV
adaptively achieves head-specific compression by choosing the top-\(k\) key
and value vectors across all heads in a layer, where the selection is based on the computed leverage scores. As in the standard adaptive implementation introduced by [4], a safeguard parameter \(\alpha\) is used to ensure a minimum fraction of key/value vectors are preserved for each head (a default value of \(\alpha = 0.20\) is used in our implementation).4
For the empirical analyses, following the contemporary studies, we use two instruction-tuned LLMs – LLaMA-3.1-8B-Instruct [2] and Mistral-7B-Instruct-v0.3 [3]. All the evaluations are done on two widely popular long-context benchmarks – LongBench [8] and Ruler [9]. LongBench benchmark contains a total 16 tasks, covering various task domains including single-document QA, multiple-document QA, summarization, few-shot learning, synthetic tasks and code completion. From Ruler benchmark, we consider a total eight needle-in-a-haystack tasks with a maximum context length of 16K. Details of the LongBench and Ruler tasks and the prompt templates can be found in the appendix. For all these 24 tasks, we evaluate only on the more challenging question-agnostic setting [19], where the questions are omitted during compression and only the context is compressed. As argued by [4], this setup mimics more challenging scenarios, where the compression method is unaware of the questions being passed to the model during inference.
We compare CurDKV
with various competitive KV compression baselines, including SnapKV [6], ChunkKV [10], LLM Streamline [11] and
Ada-SnapKV [4]. We consider an additional baseline, KNorm [19] that uses key norm as a proxy to determine which keys to evict during compression. H2O [7] is purposefully omitted from the evaluation as it throws out-of-memory error when run on a single NVIDIA A100-80GB GPU card (as it does not naively support flashattention, therefore requiring \(4\times\) more cache memory on long sequences). All these baselines were run for different compression (eviction) ratios \(\{30\%, 50\%, 70\%, 90\%\}\). Following [11], we use attention sinks of size \(s=4\) for all the baselines (all the baseline numbers are produced in-house).
r0.6
1-4 Task Type | Task | LLaMA-8B | Mistral-7B |
---|---|---|---|
1-4 [0]*Single-Doc QA | NrtvQA | 30.7 | 28.4 |
Qasper | 47.2 | 40.3 | |
MF-en | 55.6 | 51.9 | |
[0]*Multi-Doc QA | HotpotQA | 59.5 | 48.9 |
2WikiMQA | 51.8 | 37.4 | |
Musique | 32.6 | 28.0 | |
[0]*Summarization | Gov Report | 35.0 | 34.7 |
QMSum | 25.1 | 25.5 | |
Multi News | 26.8 | 26.8 | |
[0]*Few-shot Learning | TREC | 29.5 | 55.8 |
TriviaQA | 85.7 | 85.0 | |
SAMSum | 38.7 | 20.8 | |
[0]*Synthetic | Pcount | 10.7 | 5.1 |
Pre | 100.0 | 98.0 | |
[0]*Code | Lcc | 53.9 | 49.8 |
RB-P | 47.6 | 56.1 | |
Average | 45.7 | 43.3 |
We report the results obtained with full cache (0% compression) for LLaMA-8B and Mistral-7B in Table 1. Table 2 summarizes the results obtained with LLaMA-8B and Mistral-7B with KV cache compressed at 30% and 90% with different baselines.
On LLaMA-8B, CurDKV
outperforms SnapKV by +3.6% at 30% compression (48.9% vs. 45.3%) and by +2.0% points at 90% (35.7% vs. 33.7%). On Mistral-7B, CurDKV
shows a similar trend, surpassing SnapKV by +2.9% at 30% (45.6% vs. 42.7%)
and by +0.5% at 90% (33.2% vs. 32.7%). These gains are robust across task types, including multi-hop QA, summarization, and code completion. Norm-based heuristics such as Knorm and ChunkKV fail to consistently preserve semantic fidelity, often
underperforming even under moderate compression. This is most evident in Mistral-7B’s few-shot learning tasks (e.g., TriviaQA, SAMSum), where CurDKV
maintains over 89% performance, other baselines trail by 5–10%.
We compare AdaCurDKV
, our adaptive compression strategy with its static variant (CurDKV
) and the attention-based adaptive baseline, AdaSnapKV. Across models and retention levels, AdaCurDKV
consistently outperforms
AdaSnapKV, while performing competitively with CurDKV
. With LLaMA-8B under 30% KV compression, AdaCurDKV
achieves the highest average score (49.1%), outperforming AdaSnapKV (45.5%) by +3.6% and CurDKV
(48.9%) by +0.02%.
At 90% retention, AdaCurDKV
maintains a strong average score of 35.1%, closely trailing CurDKV
(35.7%) and surpassing AdaSnapKV (34.2%). For Mistral-7B, AdaCurDKV
achieves 45.2% performance at 30% retention,
outperforming AdaSnapKV (42.1%), and remaining close to CurDKV
(45.6%). Although AdaCurDKV
’s average score drops to 29.1% at 90% retention, it remains competitive and stable across all task types. While CurDKV
typically
attains the best or second-best overall score, AdaCurDKV
provides additional robustness by allocating head-wise budgets proportional to leverage score mass, leading to improved performance on head-sensitive tasks, especially in summarization
and multi-hop QA.
Task | ChunkKV | Knorm | Streaming LLM | SnapKV | CurDKV |
AdaSnapKV | AdaCurDKV |
|
---|---|---|---|---|---|---|---|---|
Non-Adaptive Methods | Adaptive Methods | |||||||
30% | NrtvQA | 30.3 / 25.7 | 30.6 / 23.8 | 26.8 / 24.4 | 30.2 / 26.4 | / | 29.7 / 25.4 | / |
Qasper | 45.7 / 36.3 | 44.4 / 35.3 | 43.4 / 36.1 | 46.5 / 35.5 | 48.2 / 39.4**** | 45.8 / 35.7 | 48.6 / 41.4**** | |
MF-en | 50.2 / 48.7 | 55.2 / 48.7 | 36.0 / 34.4 | 53.8 / 49.0 | 56.3 / 52.9**** | 55.3 / 48.4 | 56.9 / 51.0**** | |
HotpotQA | 58.0 / 49.0 | 57.2 / 48.7 | 52.1 / 43.9 | 58.4 / 48.8 | 57.8 / 49.2 | 59.9 / 47.3**** | 59.7 / 46.6 | |
2WikiMQA | 49.6 / 37.3 | 49.4 / 36.6 | 42.5 / 31.6 | 49.3 / 37.2 | 51.3 / 41.5 | 50.1 / 37.2 | 52.5 / 39.6 | |
Musique | 30.0 / 25.4 | 33.6 / 24.5 | 28.9 / 20.7 | 31.0 / 25.4 | 33.7 / 29.4 | 32.5 / 28.1 | 32.9 / 28.9 | |
Gov Report | 33.9 / 34.1 | 34.0 / 34.0 | 31.7 / 32.9 | 33.6 / 33.6 | 34.5 / 35.9 | 34.1 / 34.1 | 35.0 / 34.7 | |
QMSum | 24.0 / 24.6 | 24.7 / 24.2 | 23.5 / 23.8 | 23.9 / 24.2 | 25.0 / 25.1 | 24.6 / 24.8 | 25.2 / 25.3 | |
Multi News | 26.8 / 26.2 | 26.5 / 26.3 | 26.3 / 26.3 | 26.6 / 26.0 | 27.2 / 26.9 | 26.7 / 26.2 | 27.1 / 27.3 | |
TREC | 25.0 / 50.9 | 66.0 / 49.3 | 32.5 / 59.5 | 31.0 / 51.5 | 67.0 / 60.0 | 30.5 / 52.0 | 63.5 / 73.0 | |
TriviaQA | 85.5 / 84.8 | 87.8 / 85.6 | 91.7 / 76.2 | 86.4 / 85.4 | 92.2 / 89.0 | 86.4 / 88.1 | 92.6 / 89.3 | |
SAMSum | 39.6 / 20.8 | 36.5 / 33.9 | 38.7 / 19.1 | 40.1 / 21.0 | 41.3 / 41.1 | 40.0 / 22.4 | 41.5 / 46.6 | |
Pcount | 11.1 / 5.6 | 13.1 / 4.1 | 8.3 / 2.2 | 12.2 / 5.2 | 13.7 / 5.7 | 12.2 / 5.1 | 12.1 / 4.5 | |
Pre | 99.5 / 97.5 | 98.0 / 80.8 | 68.5 / 68.0 | 100.0 / 97.0 | 100.0 / 96.5 | 100.0 / 98.0 | 99.5 / 97.0 | |
Lcc | 52.5 / 51.6 | 36.7 / 37.2 | 47.1 / 51.1 | 54.0 / 51.5 | 52.1 / 51.1 | 52.7 / 50.7 | 52.8 / 51.5 | |
RB-P | 48.3 / 56.2 | 48.3 / 54.7 | 47.4 / 54.9 | 48.1 / 56.6 | 50.0 / 57.1 | 47.4 / 57.0 | 50.1 / 58.2 | |
Average | 44.4 / 42.2 | 46.4 / 40.5 | 40.3 / 37.8 | 45.3 / 42.1 | 48.9 / 45.6 | 45.5 / 41.6 | 48.9 / 46.4 | |
90% | NrtvQA | 23.1 / 18.7 | 21.9 / 13.0 | 21.6 / 18.4 | 24.3 / 17.7 | 27.7 / 14.2 | 25.3 / 18.3 | 27.0 / 12.1 |
Qasper | 20.0 / 12.7 | 15.9 / 6.0 | 18.4 / 13.1 | 21.1 / 13.6 | 28.5 / 26.2 | 21.6 / 15.2 | 28.6 / 26.8 | |
MF-en | 23.6 / 26.2 | 29.2 / 25.0 | 23.0 / 23.1 | 23.4 / 30.3 | 32.6 / 35.0 | 26.2 / 32.0 | 30.2 / 35.2 | |
HotpotQA | 43.1 / 37.3 | 40.1 / 30.9 | 37.2 / 31.2 | 44.5 / 33.5 | 47.4 / 30.7 | 46.9 / 38.1 | 46.1 / 34.8 | |
2WikiMQA | 22.4 / 21.9 | 20.1 / 27.8 | 22.4 / 20.1 | 24.6 / 23.1 | 31.8 / 25.4 | 27.2 / 24.0 | 33.6 / 31.7 | |
Musique | 16.9 / 16.9 | 12.8 / 13.2 | 14.0 / 13.9 | 20.0 / 18.3 | 20.7 / 13.4 | 18.1 / 17.3 | 22.3 / 18.9 | |
Gov Report | 25.0 / 25.5 | 26.0 / 26.0 | 24.8 / 25.2 | 25.3 / 25.4 | 26.9 / 23.3 | 25.3 / 25.3 | 26.9 / 26.2 | |
QMSum | 19.5 / 19.8 | 21.0 / 18.9 | 19.1 / 19.8 | 19.7 / 20.2 | 22.6 / 21.8 | 20.9 / 20.4 | 22.2 / 21.3 | |
Multi News | 17.1 / 17.9 | 20.7 / 19.8 | 20.1 / 20.1 | 20.8 / 20.8 | 23.2 / 21.6 | 20.6 / 20.7 | 23.0 / 21.5 | |
TREC | 11.0 / 15.8 | 47.5 / 26.5 | 28.0 / 35.0 | 33.5 / 37.8 | 22.5 / 27.3 | 33.0 / 44.0 | 19.5 / 33.0 | |
TriviaQA | 84.8 / 87.0 | 68.4 / 85.4 | 90.7 / 52.4 | 82.9 / 87.9 | 91.5 / 73.5 | 82.4 / 87.6 | 92.3 / 76.3 | |
SAMSum | 31.6 / 29.7 | 28.7 / 38.9 | 34.4 / 31.8 | 38.5 / 28.8 | 37.5 / 40.9 | 39.3 / 30.4 | 33.7 / 38.7 | |
Pcount | 5.0 / 3.6 | 12.0 / 2.6 | 4.0 / 4.0 | 6.0 / 3.5 | 5.5 / 3.3 | 8.0 / 3.9 | 7.9 / 2.4 | |
Pre | 49.5 / 42.0 | 24.0 / 10.0 | 16.0 / 15.5 | 55.0 / 53.0 | 61.5 / 15.5 | 56.5 / 67.5 | 55.5 / 23.0 | |
Lcc | 50.6 / 51.6 | 23.1 / 22.2 | 52.3 / 50.6 | 51.0 / 52.6 | 39.4 / 39.7 | 49.2 / 53.0 | 39.0 / 41.1 | |
RB-P | 49.3 / 54.9 | 49.7 / 55.1 | 52.9 / 53.3 | 49.2 / 54.8 | 52.2 / 53.6 | 47.3 / 55.0 | 54.2 / 54.6 | |
Average | 30.8 / 30.1 | 28.8 / 26.3 | 29.9 / 26.7 | 33.7 / 32.6 | 35.7 / 29.1 | 34.2 / 33.2 | 35.1 / 31.1 |
We conduct ablations (c.f. Figure 4) to assess two core design choices in CurDKV
: the source of leverage scores and the use of random projections. At 30% compression, different leverage modules (key, value,
key-value product) perform similarly, but at 90%, value-centric and combined variants show better robustness than key-only. Random projections offer marginal gains overall, with slightly improved stability at high compression (more elaborated discussion in
Appendix 1.0.8).
r0.34
1-3 SubTask | LLaMA-8B | Mistral-7B |
---|---|---|
1-3 S-1 | 100.0 | 95.2 |
S-2 | 100.0 | 98.3 |
S-3 | 100.0 | 100.0 |
MK-1 | 100.0 | 97.9 |
MK-2 | 100.0 | 97.8 |
MK-3 | 97.9 | 80.9 |
MQ | 99.5 | 85.6 |
MV | 100.0 | 88.8 |
Average | 99.6 | 92.8 |
We further evaluate CurDKV
and AdaCurDKV
on the Ruler benchmark, focusing on needle-in-a-haystack (NIAH) subtasks, which measure a model’s ability to recover rare, high-salience facts embedded in long distractor contexts. These
tasks are particularly sensitive to the retention of semantically critical key-value tokens. Full cache results are highlighted in Table 3.
Task | ChunkKV | Knorm | Streaming LLM | SnapKV | CurDKV |
AdaSnapKV | AdaCurDKV |
|
---|---|---|---|---|---|---|---|---|
Non-Adaptive Methods | Adaptive Methods | |||||||
30% | S-1 | 100.0 / 93.6 | 100.0 / 96.8 | 62.9 / 61.3 | 100.0 / 54.8 | 100.0 / 74.2 | 100.0 / 64.5 | 100.0 / 85.5 |
S-2 | 100.0 / 91.2 | 100.0 / 12.3 | 64.9 / 64.9 | 100.0 / 63.2 | 100.0 / 91.2 | 100.0 / 80.7 | 100.0 / 96.5 | |
S-3 | 96.1 / 56.9 | 51.0 / 0.0 | 78.4 / 78.4 | 25.5 / 2.0 | 94.1 / 72.6 | 82.4 / 13.7 | 88.2 / 60.8 | |
MK-1 | 100.0 / 89.6 | 91.7 / 0.0 | 70.8 / 64.6 | 100.0 / 41.7 | 100.0 / 91.7 | 100.0 / 77.1 | 100.0 / 93.8 | |
MK-2 | 57.8 / 57.8 | 20.0 / 0.0 | 71.1 / 68.9 | 71.1 / 44.4 | 100.0 / 93.3 | 97.8 / 84.4 | 100.0 / 84.4 | |
MK-3 | 44.7 / 46.8 | 17.0 / 2.1 | 55.3 / 31.9 | 53.2 / 29.8 | 95.7 / 23.4 | 89.4 / 68.1 | 93.6 / 44.7 | |
MQ | 100.0 / 85.1 | 95.7 / 1.9 | 74.0 / 70.7 | 99.5 / 31.7 | 99.5 / 82.2 | 100.0 / 58.7 | 100.0 / 83.7 | |
MV | 97.4 / 88.8 | 98.7 / 1.3 | 68.4 / 68.4 | 94.1 / 31.6 | 100.0 / 93.4 | 100.0 / 64.5 | 100.0 / 95.4 | |
Average | 87.0 / 76.2 | 71.8 / 14.3 | 68.2 / 63.6 | 80.4 / 37.4 | 98.7 / 77.8 | 96.2 / 64.0 | 97.7 / 80.6 | |
90% | S-1 | 100.0 / 88.7 | 100.0 / 46.8 | 6.5 / 6.5 | 72.6 / 35.5 | 91.9 / 30.7 | 85.5 / 46.8 | 96.8 / 16.1 |
S-2 | 38.6 / 3.5 | 1.8 / 0.0 | 14.0 / 12.3 | 38.6 / 3.5 | 57.9 / 5.3 | 36.8 / 3.5 | 63.2 / 5.3 | |
S-3 | 17.7 / 2.0 | 0.0 / 0.0 | 11.8 / 11.8 | 2.0 / 2.0 | 0.0 / 0.0 | 2.0 / 2.0 | 0.0 / 0.0 | |
MK-1 | 29.2 / 12.5 | 0.0 / 0.0 | 10.4 / 8.3 | 14.6 / 4.2 | 39.6 / 8.3 | 20.8 / 6.3 | 39.6 / 6.3 | |
MK-2 | 8.9 / 6.7 | 0.0 / 0.0 | 8.9 / 8.9 | 8.9 / 0.0 | 4.4 / 0.0 | 11.1 / 2.2 | 4.4 / 0.0 | |
MK-3 | 8.5 / 8.5 | 2.1 / 0.0 | 12.8 / 0.0 | 2.1 / 0.0 | 0.0 / 0.0 | 4.3 / 2.1 | 0.0 / 0.0 | |
MQ | 28.4 / 13.9 | 4.3 / 0.0 | 9.6 / 6.7 | 15.4 / 11.5 | 38.5 / 11.1 | 14.9 / 11.5 | 56.7 / 1.4 | |
MV | 30.9 / 17.8 | 3.3 / 0.0 | 7.2 / 7.2 | 21.7 / 15.8 | 45.4 / 12.5 | 19.7 / 15.8 | 52.0 / 9.9 | |
Average | 32.8 / 19.2 | 13.9 / 5.8 | 10.1 / 7.7 | 22.0 / 9.1 | 34.7 / 7.8 | 24.4 / 11.3 | 39.1 / 3.9 |
Figure 4: Accuracy of LLaMA-3.1-8B-Instruct on LongBench tasks for different ablations of CurDKV
. Full results reported in Table 17 in Appendix 1.0.8..
As shown in Table 4, CurDKV
and AdaCurDKV
achieve the highest average score (98.7% and 97.7%) under 30% compression,
outperforming all baselines. CurDKV
substantially outperform norm-based (ChunkKV: 87.0%, Knorm: 71.8%) and attention-based approaches (SnapKV: 80.4%, StreamingLLM: 68.2%). On particularly hard retrieval subtasks like MK-3 and MQ,
AdaCurDKV
scores 93.6% and 100.0% respectively, indicating its strong capacity to retain tokens that encode key factual content. In S-3 – a shallow, multi-hop setting, AdaCurDKV
scores 88.2% compared to 82.4% (AdaSnapKV) and 94.1%
(CurDKV
), reflecting robustness in both adaptive and static modes. At 90% compression, performance drops sharply for most baselines, with norm-based methods like Knorm (13.9%) and StreamingLLM (10.1%) failing to preserve fidelity. In contrast,
CurDKV
(34.7%) and AdaCurDKV
(39.1%) maintain significantly higher performance, with AdaCurDKV
again achieving the best overall average. Notably, on MQ and MV, AdaCurDKV
reaches 56.7% and 52.0% respectively,
far ahead of all other methods, highlighting its advantage on tasks requiring deeper contextual aggregation. On S-2, AdaCurDKV
outperforms both AdaSnapKV and CurDKV
(63.2% vs. 36.8% and 57.9%), showcasing the benefit of adaptive
head-wise allocation in early-token regimes. Albeit the strong performance across different subtasks, we observe marginal underperformance on a subset of NIAH tasks with Mistral at \(90\%\) compression ratio. This can be
attributed to Mistral’s use of sliding-window attention which reduces head specialization and makes head-local token selection less effective. In such cases, heuristics like ChunkKV, which retain tokens based on uniform position or locality, can
incidentally preserve critical information. Nonetheless, CurDKV
continues to show superior performance on more challenging retrieval subtasks, indicating its robustness under complex attention patterns.
Figure 5: Accuracy of LLaMA-3.1-8B-Instruct on LongBench tasks for different ablations of CurDKV
. Full results reported in Table 17 in Appendix 1.0.8..
We perform two ablation studies to investigate the key design decisions behind CurDKV
: (1) the module on which leverage scores are computed, and (2) the effect of using randomized projections for score approximation.
We first ablate the module on which leverage scores are computed. Figure 4 shows the average performance (task-level scores reported in Table in Appendix) when using key-only, value-only, and the default key-value product (elementwise product of key and value leverage scores). All three variants perform similarly at 30% compression. However, at 90% compression, the differences become more pronounced: key-only scores degrade significantly, while value and key-value product maintain higher median accuracy and robustness. This confirms that value information is more predictive of output quality, and that combining key and value structure yields the most reliable token selection strategy.
Next, we evaluate the role of randomized leverage score approximation. This determines whether the key and value matrices are projected to a low-rank space via Gaussian random matrices before computing leverage scores. we observe marginal differences between the two variants. At 30% compression, both versions perform similarly in median accuracy, though random projection slightly improves the worst-case performance. At 90% compression, however, random projections result in modest improvements in median and lower-quartile accuracy. These gains suggest that random projections can help stabilize leverage score estimation under extreme compression, but their effect is secondary to the choice of leverage module.
CurDKV
To evaluate the practical benefits of KV compression beyond accuracy, we analyze CurDKV
in terms of memory usage, prefill latency, and generation latency across varying sequence lengths and compression ratios. As expected,
CurDKV
yields (c.f. Figure 6a) a near-linear reduction in KV cache size with increasing compression ratio. For example, at 80% compression, memory usage drops from 15.6 GB to under 3 GB for a 128K-token
context. This linear trend holds consistently across all tested sequence lengths. Since CurDKV
compresses the cache on a per-layer and per-head basis, these reductions are directly proportional to the retained key-value tokens, enabling
predictable memory scaling in long-context scenarios. Figure 6 (b) shows a slight upward trend in prefill latency at higher compression ratios, especially for longer sequences. This is because CurDKV
computes
leverage scores and performs token selection during the prefill phase, introducing modest computational overhead. For instance, with 128K tokens, prefill time increases from approximately 10 seconds (no compression) to around 14–15 seconds at high
compression levels. However, the added cost remains constant beyond 40–60% compression, indicating amortization of overhead due to batch-efficient scoring. Interestingly, generation latency (Figure 6 (c)) decreases with
compression for all sequence lengths, particularly for longer inputs. This is attributed to the smaller number of cached tokens being read during autoregressive decoding. With 128K-token contexts, generation time reduces from 10 seconds (no compression) to
under 6 seconds at 80% compression. The effect is more muted for shorter sequences, but consistently observable. These properties make CurDKV
a practical and scalable solution for deployment-time KV cache optimization in LLMs.
In this paper we proposed CurDKV
, a value-centric KV cache compression method based on CUR decomposition, and AdaCurDKV
, its adaptive variant. Unlike prior approaches that rely on attention scores, our method selects key-value
tokens using leverage scores derived from the value matrix, with efficient approximations via random projections. Experiments on LongBench and Ruler benchmarks showed that CurDKV
consistently outperformed existing methods across tasks and
compression ratios, while remaining compatible with FlashAttention and Grouped Query Attention.
While CurDKV
offers a principled and effective approach to KV cache compression, it relies on static token selection during the prefill phase, which may limit its responsiveness to dynamically emerging information needs during generation.
Future work could explore query-aware token selection, hybrid token-chunk strategies, or lightweight learned components to improve performance in semantically sparse settings. Extending CurDKV
to support dynamic compression during generation is
another promising direction to better adapt to evolving query demands.
Proof. Let \(A = \text{softmax}(QK^T)V\), and let \(A' = \text{softmax}(QK'^T)V'\). In this proof, we will upper bound \(\lVert A - A'\rVert_F\). Throughout the proof, we also assume that \(K\) and \(K'\) have the same dimensions, which is achieved by adding zero rows to \(K'\) as necessary (and the same holds for \(V\) and \(V'\)).
First, we have \[\begin{align} A - A' &= \text{softmax}(QK^T)V - \text{softmax}(QK^T)V' + \text{softmax}(QK^T)V' - \text{softmax}(QK'^T)V'\\ &= \text{softmax}(QK^T)(V - V') + (\text{softmax}(QK^T) - \text{softmax}(QK'^T))V' \end{align}\] and hence using the triangle inequality for norms, we get \[\begin{align} \lVert A - A'\rVert_F &\le \lVert\text{softmax}(QK^T)(V - V')\rVert_F + \lVert(\text{softmax}(QK^T) - \text{softmax}(QK'^T))V'\rVert_F \end{align}\] Next, if \(\lVert \cdot\rVert_\text{op}\) is the operator norm, then we have that \(\lVert AB\rVert_F\le \lVert A\rVert_\text{op}\lVert B\rVert_F\). Applying this to the two terms on the RHS above, we get \[\begin{align} \lVert A - A'\rVert_F &\le \lVert \text{softmax}(QK^T)\rVert_\text{op}\lVert V - V'\rVert_F + \lVert \text{softmax}(QK^T) - \text{softmax}(QK'^T)\rVert_\text{op}\lVert V'\rVert_F \\ &\le \lVert \text{softmax}(QK^T)\rVert_\text{op}\lVert V - V'\rVert_F + (\lVert\text{softmax}(QK^T) \rVert_\text{op} + \lVert\text{softmax}(QK'^T)\rVert_\text{op})\lVert V'\rVert_F \end{align}\] where in the second step above, we have again used the triangle inequality for norms. Now, we know that both \(\text{softmax}(QK^T)\) and \(\text{softmax}(QK'^T)\) are row-stochastic matrices; hence, the \(\lVert\cdot\rVert_\text{op}\) norm of both these matrices is exactly \(\sqrt{n}\). So, we get that \[\begin{align} \lVert A - A'\rVert_F \le \sqrt{n}\lVert V - V'\rVert_F + 2\sqrt{n}\lVert V'\rVert_F \end{align}\] completing the proof of the claim. ◻
LongBench [8] is a set of multitasks designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning. Tasks include single-document question-answering, mutli-document question answering, summarization, few-shot learning, synthetic tasks and code generation. Below we present a detailed overview of all tasks:
Single-Doc QA: Models are required to obtain the answer from a single source document. Test samples are derived from a variety of datasets, including NarrativeQA [20], Qasper [21], and MultiFieldQA [22]. Templates for these tasks can be found in Table 5.
Multi-Doc QA: Models are required to extract and combine information from multiple source documents to obtain the answer. Data samples are derived from three multi-hop QA datasets: HotpotQA [23], 2WikiMultihopQA [24] and MuSiQue [25]. Templates for these tasks can be found in Table 6.
Summarization: These tasks require a comprehensive understanding of the context. The data samples are obtained from the GovReport [26] and QMSum [27] datasets. Templates for these tasks can be found in Table 7.
Few-shot Learning: These tasks have classification, summarization and reading-comprehension tasks integrated within them to maintain task diversity. The TREC dataset [28] is used for classification problems. The SAMSum dataset [29] is used for summarization tasks. Finally, TriviaQA [30] is utilized for comprehension tasks. Templates for these tasks can be found in Table 8.
Synthetic Tasks: These tasks assess a model’s ability to handle specific scenarios and patterns. In LongBench, the PassageRetrieval-en and PassageCount datasets are used. Templates for these tasks can be found in Table 9.
Code Completion Tasks: These tasks are meant to assist users by completing code based on a user’s input and context. The LCC dataset from the original Long Code Completion dataset [31] as well as the RepoBench-P dataset [32] is used. Templates for these tasks can be found in Table 10.
NarrativeQA | Task Template: |
You are given a story, which can be either a novel or a movie script, and a question. Answer the question as concisely as you can, using a single phrase if possible. Do not provide any explanation. | |
Story: {context} | |
Now, answer the question based on the story as concisely as you can, using a single phrase if possible. Do not provide any explanation. | |
Question: {question} | |
Qasper | Task Template: |
You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no" or "unanswerable". Do not provide any explanation. | |
Article: {context} | |
Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no" or "unanswerable". Do not provide any explanation | |
Question: {question} | |
MultifieldQA EN | Task Template: Read the following text and answer briefly. |
{context} | |
Now, answer the following question based on the above text, only give me the answer and do not output any other words. | |
Question: {question} |
HotpotQA | Task Template: |
Answer the question based on the given passages. Only give me the answer and do not output any other words. | |
The following are given passages. | |
{context} | |
Answer the question based on the given passages. Only give me the answer and do not output any other words. | |
Question: {question} | |
2WikimQA | Task Template: |
Answer the question based on the given passages. Only give me the answer and do not output any other words. | |
The following are given passages. | |
{context} | |
Answer the question based on the given passages. Only give me the answer and do not output any other words. | |
Question: {question} | |
Musique | Task Template: |
Answer the question based on the given passages. Only give me the answer and do not output any other words. | |
The following are given passages. | |
{context} | |
Answer the question based on the given passages. Only give me the answer and do not output any other words. | |
Question: {question} |
Gov Report | Task Template: |
You are given a report by a government agency. Write a one-page summary of the report. | |
Report: | |
{context} | |
Now, write a one-page summary of the report. | |
QMSum | Task Template: |
You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences. | |
Transcript | |
{context} | |
Now, answer the query based on the above meeting transcript in one or more sentences. | |
Query: {question} | |
Multi News | Task Template: |
You are given several news passages. Write a one-page summary of all news. | |
News | |
{context} | |
Now, write a one-page summary of all the news. |
TREC | Task Template: |
Please determine the type of the question below. Here are some examples of questions. | |
{context} | |
{question} | |
TriviaQA | Task Template: |
Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples. | |
{context} | |
{question} | |
SAMSum | Task Template: |
Summarize the dialogue into a few short sentences. The following are some examples. | |
{context} | |
{question} |
Passage Count | Task Template: |
There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total? | |
{context} | |
Please enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on. | |
Passage Retrieval EN | Task Template: |
Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from. | |
{context} | |
The following is an abstract. | |
{question} | |
Please enter the number of the paragraph that the abstract is from. The answer format must be like “Paragraph 1”, “Paragraph 2”, etc. |
Lcc | Task Template: |
Please complete the code given below. | |
{context} | |
Next line of code: | |
Repobench-P | Task Template: |
Please complete the code given below. | |
{context} | |
{question} | |
Next line of code: |
Ruler [9] is a collection of synthetic examples to evaluate long-context language models, containing four task categories to test behaviours beyond simple retrieval from context. Here is a detailed description of each task category:
Singe Needle-In-A-Haystack (S-): In these tasks, a keyword sentence, called the "needle", is embedded within a lengthy text, called the "haystack". The objective of the task is to retrieve the needle from the context.
Multi-keys NIAH (MK-): These tasks are similar to S- tasks, with the difference being the presence of multiple "needles" inserted into the "haystack". However, the task is to retrieve only one of them.
Multi-values NIAH (MV-): Here, multiple "needles" in the form of key-value pairs are hidden in the "haystack". The task is to retrieve all values associated to a single key.
Multi-queries NIAH (MQ-): Here, multiple "needles" in the form of key-value pairs are inserted into the "haystack". All "needles" corresponding to the keys specified in the queries are to be retrieved.
Templates for these tasks can be found in Tables 11 and 12 respectively.
S-1 | Task Template: |
Some special magic numbers are hidden within the following text. Make sure to memorize it. I will quiz you about the numbers afterwards. | |
The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. | |
...... One of the special magic numbers for {word} is: {number}. ...... | |
What is the special magic number for {word} mentioned in the provided text? | |
The special magic number for {word} mentioned in the provided text is | |
S-2 | Task Template: |
Some special magic numbers are hidden within the following text. Make sure to memorize it. I will quiz you about the numbers afterwards. | |
Paul Graham Essays. | |
...... One of the special magic numbers for {word} is: {number}. ...... | |
What is the special magic number for {word} mentioned in the provided text? | |
The special magic number for {word} mentioned in the provided text is | |
S-3 | Task Template: |
Some special magic numbers are hidden within the following text. Make sure to memorize it. I will quiz you about the numbers afterwards. | |
Paul Graham Essays. | |
...... One of the special magic numbers for {word} is: {number}. ...... | |
What is the special magic number for {word} mentioned in the provided text? | |
The special magic number for {word} mentioned in the provided text is | |
MK-1 | Task Template: |
Some special magic numbers are hidden within the following text. Make sure to memorize it. I will quiz you about the numbers afterwards. | |
Paul Graham Essays. | |
...... One of the special magic numbers for {word-1} is: {number-1}. ...... | |
...... One of the special magic numbers for {word-2} is: {number-2}. ...... | |
...... One of the special magic numbers for {word-3} is: {number-3}. ...... | |
...... One of the special magic numbers for {word-4} is: {number-4}. ...... | |
What is the special magic number for {word-4} mentioned in the provided text? | |
The special magic number for {word-4} mentioned in the provided text is | |
MK-2 | Task Template: |
Some special magic numbers are hidden within the following text. Make sure to memorize it. I will quiz you about the numbers afterwards. | |
Paul Graham Essays. | |
...... One of the special magic numbers for {word-1} is: {number-1}. ...... | |
...... One of the special magic numbers for {word-2} is: {number-2}. ...... | |
...... One of the special magic numbers for {word-x} is: {number-x}. ...... | |
...... One of the special magic numbers for {word-n-1} is: {number-n-1}. ...... | |
...... One of the special magic numbers for {word-n} is: {number-n}. ...... | |
What is the special magic number for {word-x} mentioned in the provided text? | |
The special magic number for {word-x} mentioned in the provided text is | |
MK-3 | Task Template: |
Some special uuids are hidden within the following text. Make sure to memorize it. I will quiz you about the uuids afterwards. | |
Paul Graham Essays. | |
...... One of the special magic uuids for {uuid-1} is: {uuid-1}. ...... | |
...... One of the special magic uuid for {uuid-2} is: {uuid-2}. ...... | |
...... One of the special magic uuid for {uuid-x} is: {uuid-x}. ...... | |
...... One of the special magic uuid for {uuid-n-1} is: {uuid-n-1}. ...... | |
...... One of the special magic uuid for {uuid-n} is: {uuid-n}. ...... | |
What is the special magic number for {uuid-x} mentioned in the provided text? | |
The special magic number for {uuid-x} mentioned in the provided text is |
MV | Task Template: |
Some special magic numbers are hidden within the following text. Make sure to memorize it. I will quiz you about the numbers afterwards. | |
Paul Graham Essays. | |
...... One of the special magic numbers for {word} is: {number-1}. ...... | |
...... One of the special magic numbers for {word} is: {number-2}. ...... | |
...... One of the special magic numbers for {word} is: {number-3}. ...... | |
...... One of the special magic numbers for {word} is: {number-4}. ...... | |
What are all the special magic numbers for {word} mentioned in the provided text? | |
The special magic numbers for {word} mentioned in the provided text are | |
MQ | Task Template: |
Some special magic numbers are hidden within the following text. Make sure to memorize it. I will quiz you about the numbers afterwards. | |
Paul Graham Essays. | |
...... One of the special magic numbers for {word=1} is: {number-1}. ...... | |
...... One of the special magic numbers for {word-2} is: {number-2}. ...... | |
...... One of the special magic numbers for {word-3} is: {number-3}. ...... | |
...... One of the special magic numbers for {word-4} is: {number-4}. ...... | |
What are all the special magic numbers for {word-1}, {word-2}, {word-3}, and {word-4} mentioned in the provided text? | |
The special magic numbers for {word-1}, {word-2}, {word-3}, and {word-4} mentioned in the provided text are |
Task | ChunkKV | Knorm | Streaming LLM | SnapKV | CurDKV |
AdaSnapKV | AdaCurDKV |
|
---|---|---|---|---|---|---|---|---|
Non-Adaptive Methods | Adaptive Methods | |||||||
50% | NrtvQA | 28.8 | 28.6 | 25.2 | 29.2 | 30.4 | 29.2 | 30.6 |
Qasper | 39.6 | 40.8 | 37.1 | 41.1 | 49.0 | 43.4 | 48.1 | |
MF-en | 43.7 | 49.3 | 32.0 | 45.4 | 54.9 | 51.5 | 52.4 | |
HotpotQA | 55.8 | 55.3 | 50.1 | 57.3 | 57.2 | 56.0 | 55.6 | |
2WikiMQA | 49.6 | 44.7 | 38.0 | 50.6 | 49.1 | 51.1 | 49.3 | |
Musique | 27.9 | 26.4 | 23.7 | 30.7 | 34.7 | 30.0 | 32.0 | |
Gov Report | 32.7 | 32.0 | 30.7 | 31.9 | 33.4 | 32.0 | 32.9 | |
QMSum | 23.0 | 24.3 | 21.9 | 23.8 | 24.7 | 24.3 | 25.2 | |
Multi News | 25.6 | 25.4 | 25.6 | 25.4 | 26.5 | 25.9 | 26.6 | |
TREC | 23.0 | 60.8 | 31.0 | 36.5 | 68.0 | 36.0 | 67.5 | |
TriviaQA | 84.9 | 88.4 | 92.0 | 85.5 | 92.3 | 86.1 | 92.9 | |
SAMSum | 39.8 | 34.3 | 37.5 | 40.0 | 40.7 | 40.5 | 41.2 | |
Pcount | 10.6 | 10.6 | 7.5 | 11.1 | 12.8 | 11.1 | 11.1 | |
Pre | 97.0 | 86.5 | 54.0 | 100.0 | 98.5 | 100.0 | 99.0 | |
Lcc | 50.9 | 32.6 | 48.8 | 53.3 | 49.9 | 51.1 | 50.4 | |
RB-P | 49.2 | 49.3 | 48.5 | 47.7 | 50.4 | 47.4 | 52.3 | |
Average | 42.6 | 43.1 | 37.7 | 44.3 | 48.3 | 44.7 | 47.9 | |
70% | NrtvQA | 30.9 | 25.9 | 24.2 | 27.3 | 30.5 | 29.1 | 30.4 |
Qasper | 31.6 | 32.5 | 27.8 | 30.8 | 43.2 | 34.9 | 44.2 | |
MF-en | 34.6 | 42.7 | 26.7 | 37.0 | 45.1 | 41.9 | 46.5 | |
HotpotQA | 53.6 | 49.1 | 43.5 | 54.8 | 51.7 | 55.6 | 54.6 | |
2WikiMQA | 38.5 | 33.3 | 30.9 | 46.9 | 45.2 | 47.5 | 45.5 | |
Musique | 25.8 | 19.7 | 20.8 | 27.5 | 29.3 | 29.9 | 30.7 | |
Gov Report | 30.6 | 29.4 | 29.2 | 29.9 | 31.2 | 29.3 | 30.7 | |
QMSum | 22.3 | 23.5 | 21.1 | 22.6 | 24.4 | 22.9 | 24.5 | |
Multi News | 23.5 | 23.8 | 24.2 | 23.8 | 26.0 | 24.4 | 25.9 | |
TREC | 15.0 | 54.5 | 31.5 | 39.0 | 58.0 | 34.0 | 63.7 | |
TriviaQA | 84.0 | 74.9 | 91.5 | 84.5 | 92.0 | 85.3 | 91.9 | |
SAMSum | 36.3 | 27.7 | 36.3 | 41.5 | 41.7 | 41.8 | 41.2 | |
Pcount | 7.6 | 9.1 | 6.3 | 10.1 | 10.8 | 10.1 | 9.7 | |
Pre | 92.5 | 58.5 | 36.5 | 93.0 | 98.5 | 96.0 | 98.5 | |
Lcc | 51.0 | 28.5 | 50.9 | 53.2 | 46.7 | 50.6 | 46.4 | |
RB-P | 49.4 | 48.6 | 49.9 | 47.8 | 51.6 | 45.9 | 52.8 | |
Average | 39.2 | 36.4 | 34.4 | 41.9 | 45.4 | 42.4 | 46.1 |
Task | ChunkKV | Knorm | Streaming LLM | SnapKV | CurDKV |
AdaSnapKV | AdaCurDKV |
|
---|---|---|---|---|---|---|---|---|
Non-Adaptive Methods | Adaptive Methods | |||||||
50% | NrtvQA | 24.6 | 20.5 | 23.5 | 24.3 | 25.6 | 24.4 | 25.1 |
Qasper | 30.4 | 29.0 | 30.8 | 32.6 | 39.4 | 30.3 | 39.2 | |
MF-en | 44.1 | 42.5 | 30.8 | 47.1 | 51.3 | 48.5 | 49.4 | |
HotpotQA | 47.0 | 45.5 | 40.8 | 47.2 | 42.7 | 48.5 | 44.8 | |
2WikiMQA | 35.0 | 31.3 | 30.5 | 36.5 | 39.5 | 36.8 | 37.7 | |
Musique | 25.1 | 22.5 | 18.6 | 23.0 | 27.7 | 26.0 | 26.9 | |
Gov Report | 32.4 | 30.0 | 31.7 | 31.6 | 33.2 | 31.7 | 31.4 | |
QMSum | 23.5 | 23.2 | 22.9 | 23.7 | 24.9 | 24.0 | 24.9 | |
Multi News | 25.2 | 24.7 | 25.1 | 25.2 | 26.6 | 25.5 | 26.7 | |
TREC | 48.8 | 40.7 | 56.0 | 50.0 | 69.0 | 51.0 | 72.0 | |
TriviaQA | 86.8 | 86.9 | 69.9 | 86.5 | 87.4 | 87.7 | 85.2 | |
SAMSum | 22.3 | 36.6 | 21.3 | 21.4 | 45.7 | 22.7 | 45.4 | |
Pcount | 7.0 | 4.2 | 1.5 | 4.7 | 5.9 | 5.8 | 6.6 | |
Pre | 96.0 | 67.5 | 53.5 | 96.5 | 78.0 | 97.5 | 68.5 | |
Lcc | 52.4 | 32.8 | 52.4 | 52.2 | 47.8 | 52.1 | 57.4 | |
RB-P | 55.5 | 54.3 | 53.9 | 56.8 | 55.3 | 57.3 | 58.7 | |
Average | 41.0 | 37.0 | 35.2 | 41.2 | 43.8 | 41.9 | 43.7 | |
70% | NrtvQA | 21.3 | 16.5 | 20.4 | 23.0 | 21.02 | 23.2 | 19.1 |
Qasper | 21.4 | 19.2 | 21.6 | 24.2 | 35.1 | 25.1 | 35.0 | |
MF-en | 38.2 | 34.1 | 27.8 | 41.9 | 46.8 | 42.1 | 46.0 | |
HotpotQA | 44.1 | 38.9 | 38.1 | 43.9 | 38.2 | 45.5 | 34.3 | |
2WikiMQA | 33.6 | 26.0 | 26.9 | 29.2 | 32.3 | 31.7 | 33.9 | |
Musique | 20.9 | 18.9 | 16.7 | 21.2 | 18.0 | 21.9 | 15.3 | |
Gov Report | 29.9 | 27.1 | 29.6 | 29.7 | 27.0 | 29.2 | 23.3 | |
QMSum | 21.9 | 21.6 | 22.1 | 22.2 | 23.6 | 22.4 | 23.1 | |
Multi News | 23.5 | 22.6 | 23.6 | 23.7 | 25.4 | 24.3 | 25.1 | |
TREC | 38.3 | 36.0 | 46.0 | 46.0 | 65.5 | 48.8 | 66.0 | |
TriviaQA | 87.6 | 86.3 | 62.8 | 86.8 | 59.4 | 87.0 | 52.4 | |
SAMSum | 25.4 | 39.6 | 25.7 | 22.2 | 42.4 | 23.0 | 42.1 | |
Pcount | 5.6 | 3.3 | 2.5 | 5.9 | 5.1 | 4.2 | 5.7 | |
Pre | 91.0 | 38.3 | 36.0 | 92.8 | 25.5 | 94.5 | 11.0 | |
Lcc | 54.1 | 29.5 | 52.2 | 53.54 | 43.6 | 53.0 | 55.3 | |
RB-P | 55.3 | 55.0 | 53.1 | 56.0 | 52.2 | 56.5 | 57.0 | |
Average | 38.2 | 32.0 | 31.6 | 38.9 | 35.1 | 39.5 | 34.0 |
SubTask | ChunkKV | Knorm | Streaming LLM | SnapKV | CurDKV |
AdaSnapKV | AdaCurDKV |
|
---|---|---|---|---|---|---|---|---|
Non-Adaptive Methods | Adaptive Methods | |||||||
50% | S-1 | 100.0 | 100.0 | 38.7 | 100.0 | 100.0 | 100.0 | 100.0 |
S-2 | 100.0 | 96.5 | 40.4 | 98.3 | 100.0 | 100.0 | 100.0 | |
S-3 | 88.2 | 9.8 | 51.0 | 11.8 | 25.5 | 60.8 | 54.9 | |
MK-1 | 95.8 | 70.8 | 56.3 | 100.0 | 100.0 | 100.0 | 100.0 | |
MK-2 | 40.0 | 6.7 | 48.9 | 40.0 | 82.2 | 93.3 | 95.6 | |
MK-3 | 29.8 | 12.8 | 36.2 | 19.2 | 61.7 | 46.8 | 68.1 | |
MQ | 95.2 | 71.6 | 50.5 | 85.6 | 99.0 | 98.6 | 98.1 | |
MV | 84.2 | 80.3 | 48.0 | 80.9 | 96.7 | 100.0 | 100.0 | |
Average | 79.2 | 56.1 | 46.2 | 67.0 | 83.1 | 87.4 | 89.6 | |
70% | S-1 | 100.0 | 100.0 | 16.1 | 93.6 | 100.0 | 100.0 | 100.0 |
S-2 | 89.5 | 50.9 | 26.3 | 86.0 | 98.3 | 100.0 | 98.3 | |
S-3 | 70.6 | 2.0 | 31.4 | 3.9 | 9.8 | 15.7 | 13.7 | |
MK-1 | 79.2 | 33.3 | 35.4 | 75.0 | 87.5 | 97.9 | 97.9 | |
MK-2 | 20.0 | 0.0 | 22.2 | 24.4 | 44.4 | 53.3 | 57.8 | |
MK-3 | 17.0 | 8.5 | 19.2 | 8.5 | 23.4 | 0.0 | 4.3 | |
MQ | 79.8 | 26.4 | 33.7 | 49.5 | 70.2 | 94.7 | 94.7 | |
MV | 56.6 | 40.8 | 25.0 | 42.8 | 62.5 | 92.8 | 91.5 | |
Average | 64.1 | 32.7 | 26.2 | 48.0 | 62.0 | 69.3 | 69.8 |
SubTask | ChunkKV | Knorm | Streaming LLM | SnapKV | CurDKV |
AdaSnapKV | AdaCurDKV |
|
---|---|---|---|---|---|---|---|---|
Non-Adaptive Methods | Adaptive Methods | |||||||
50% | S-1 | 87.1 | 95.2 | 37.1 | 50.0 | 69.4 | 69.4 | 71.0 |
S-2 | 86.0 | 0.0 | 40.4 | 45.6 | 93.0 | 56.1 | 89.5 | |
S-3 | 33.3 | 0.0 | 51.0 | 2.0 | 21.6 | 2.0 | 15.7 | |
MK-1 | 77.1 | 0.0 | 50.0 | 20.8 | 79.2 | 43.8 | 77.1 | |
MK-2 | 35.6 | 0.0 | 48.9 | 24.4 | 15.6 | 53.3 | 33.3 | |
MK-3 | 25.5 | 0.0 | 14.9 | 12.8 | 2.1 | 38.3 | 10.6 | |
MQ | 79.8 | 0.5 | 40.9 | 16.4 | 70.7 | 33.7 | 70.2 | |
MV | 79.0 | 0.7 | 47.4 | 19.1 | 82.2 | 32.9 | 81.6 | |
Average | 62.9 | 12.0 | 41.3 | 23.9 | 54.2 | 41.2 | 56.1 | |
70% | S-1 | 88.7 | 80.7 | 14.5 | 45.2 | 19.4 | 69.4 | 12.9 |
S-2 | 56.1 | 0.0 | 26.3 | 24.6 | 68.4 | 40.4 | 52.6 | |
S-3 | 11.8 | 0.0 | 31.4 | 2.0 | 0.0 | 2.0 | 0.0 | |
MK-1 | 35.4 | 0.0 | 27.1 | 18.8 | 35.4 | 20.8 | 37.5 | |
MK-2 | 13.3 | 0.0 | 17.8 | 4.4 | 0.0 | 24.4 | 2.2 | |
MK-3 | 17.0 | 0.0 | 10.6 | 8.5 | 0.0 | 19.2 | 0.0 | |
MQ | 48.6 | 0.0 | 26.0 | 12.0 | 64.4 | 13.9 | 26.9 | |
MV | 44.7 | 0.0 | 25.0 | 15.1 | 63.8 | 17.8 | 43.4 | |
Average | 39.5 | 10.1 | 22.3 | 16.3 | 31.4 | 26.0 | 21.9 |
Task | Key | Key+random | Key-value | Key-value+random | Value | Value+random | |
30% | NrtvQA | 24.0 | 24.9 | 31.6 | 31.9 | 30.6 | 30.1 |
Qasper | 24.8 | 27.5 | 49.5 | 48.1 | 48.4 | 48.5 | |
MF-en | 40.9 | 43.2 | 57.1 | 55.6 | 56.3 | 56.9 | |
HotpotQA | 51.6 | 54.5 | 58.0 | 56.9 | 57.0 | 58.0 | |
2WikiMQA | 38.7 | 44.9 | 48.6 | 48.4 | 50.3 | 50.7 | |
Musique | 23.4 | 28.3 | 31.0 | 34.2 | 33.2 | 32.7 | |
Gov Report | 30.9 | 32.0 | 33.7 | 33.6 | 34.6 | 34.3 | |
QMSum | 24.3 | 24.4 | 24.8 | 25.2 | 25.1 | 24.7 | |
Multi News | 10.6 | 11.1 | 26.6 | 26.8 | 26.7 | 26.9 | |
TREC | 33.0 | 25.0 | 65.5 | 54.5 | 66.5 | 62.5 | |
TriviaQA | 91.0 | 90.9 | 92.2 | 92.1 | 92.4 | 91.7 | |
SAMSum | 36.8 | 38.8 | 41.6 | 39.9 | 40.2 | 40.0 | |
Pcount | 2.5 | 7.0 | 8.0 | 9.2 | 11.4 | 12.1 | |
Pre | 81.5 | 92.5 | 99.0 | 98.0 | 100.0 | 99.0 | |
Lcc | 52.9 | 53.5 | 51.1 | 51.6 | 50.7 | 51.4 | |
RB-P | 51.7 | 50.4 | 50.1 | 50.0 | 49.9 | 49.4 | |
Average | 38.7 | 40.6 | 48.0 | 47.2 | 48.3 | 48.0 | |
90% | NrtvQA | 20.4 | 21.7 | 28.5 | 20.9 | 27.8 | 25.4 |
Qasper | 15.3 | 15.6 | 28.5 | 25.8 | 29.7 | 24.9 | |
MF-en | 26.7 | 28.4 | 53.3 | 53.2 | 34.7 | 34.3 | |
HotpotQA | 25.1 | 28.0 | 38.4 | 39.7 | 47.2 | 46.9 | |
2WikiMQA | 20.9 | 21.7 | 33.3 | 30.4 | 30.7 | 31.1 | |
Musique | 9.5 | 10.8 | 20.0 | 18.7 | 19.7 | 21.9 | |
Gov Report | 14.7 | 19.1 | 24.1 | 24.0 | 26.8 | 25.9 | |
QMSum | 18.9 | 19.9 | 21.5 | 21.8 | 22.3 | 22.3 | |
Multi News | 8.6 | 15.2 | 22.1 | 21.9 | 23.1 | 22.8 | |
TREC | 1.5 | 6.5 | 7.0 | 15.3 | 22.5 | 16.0 | |
TriviaQA | 87.0 | 88.4 | 89.9 | 90.6 | 90.7 | 91.7 | |
SAMSum | 32.5 | 32.2 | 30.4 | 27.9 | 37.2 | 33.2 | |
Pcount | 0.0 | 0.0 | 1.0 | 1.0 | 5.7 | 5.6 | |
Pre | 3.5 | 6.0 | 25.0 | 26.0 | 61.0 | 49.0 | |
Lcc | 33.5 | 32.0 | 39.7 | 38.2 | 34.9 | 35.3 | |
RB-P | 55.3 | 55.1 | 54.0 | 53.6 | 49.6 | 51.9 | |
Average | 23.3 | 25.0 | 32.3 | 31.8 | 35.2 | 33.7 |
We evaluate the performance of CurDKV
and AdaCurDKV
against baseline compression methods at intermediate compression ratios of 50% and 70%, using both the LLaMA-3.1-8B and Mistral-7B in Table 13 and Table 14, respectively.
On LLaMA, CurDKV
achieves the highest average accuracy (48.3), outperforming all non-adaptive baselines such as SnapKV (44.3) and Knorm (43.1), as well as Streaming LLM (37.7). The adaptive variant AdaCurDKV
also performs
competitively, with an average score of 47.9, surpassing AdaSnapKV (44.7). At 70% compression, CUR-based approaches continue to perform favorably. On LLaMA, CurDKV
again achieves the highest average (45.4), ahead of SnapKV (41.9) and Knorm
(36.4), confirming its robustness across moderate compression levels. AdaCurDKV
further improves upon AdaSnapKV (42.4) with an average of 46.1, marking consistent gains across adaptive strategies.
On Mistral, CurDKV
again delivers the best average score (43.8), outperforming SnapKV (41.2) and ChunkKV (41.0). AdaCurDKV
(43.7) also surpasses AdaSnapKV (41.9), particularly benefiting from adaptive budget reallocation on
heterogeneous tasks such as TREC (72.0) and SAMSum (45.4). Across both models, CUR-based methods show superior stability, particularly under structured reasoning and information-dense settings.
Table 15 and 16 report the results of different KV compression models with LLaMA and Mistral models, respectively, at 50% and 70% compression ratios.
With LLaMA, at 50% compression, CurDKV
achieves a strong average of 83.1, outperforming all non-adaptive baselines, including ChunkKV (79.2) and SnapKV (67.0). The adaptive variant, AdaCurDKV
further improves performance to
89.6, benefiting from its dynamic budget allocation across heads. Notably, AdaCurDKV
achieves perfect accuracy (100.0) on 6 out of 8 subtasks, demonstrating its ability to preserve high-salience tokens even at moderate compression. At 70%
compression, the relative advantage of CUR-based methods persists. CurDKV
scores an average of 62.0, substantially ahead of SnapKV (48.0) and Knorm (32.7), however, falling short from the best non-adaptive baseline ChunkKV (64.1).
AdaCurDKV
continues to perform best overall with an average of 69.8, highlighting its robustness in token-starved regimes. These results reinforce the effectiveness of CUR-based compression for preserving semantically critical tokens required
for pinpoint retrieval.
On Mistral at 50% compression, CurDKV
achieves an average score of 54.2, considerably higher than SnapKV (23.9), Knorm (12.0), and Streaming LLM (41.3). AdaCurDKV
pushes this further to 56.1, showing consistent gains across
nearly all subtasks.
Albeit the minor performance drops, these experiments confirm that CUR-based selection, especially when adaptively guided—provides a robust and effective mechanism for maintaining retrieval fidelity under tight memory constraints, outperforming both norm- and attention-based KV selection across two architectures and multiple compression levels.
We perform two ablation studies to investigate the key design decisions behind CurDKV
: (1) the module on which leverage scores are computed, and (2) the effect of using randomized projections for score approximation. While in our algorithm
we use combined leverage scores computed using both key and value vectors (line 5 of Algorithm 3), in our ablations we study the affect of using only the key or value vectors to compute leverage scores. We also run experiments
wherein leverage scores are computed on the original key/value matrices without the use of the Gaussian projection (i.e., we drop lines 3-4 of Algorithm 3)
We first ablate the module on which leverage scores are computed. Figure 4 shows the average performance (task-level scores reported in Table 17) when using key-only, value-only, and the default key-value product (elementwise product of key and value leverage scores). All three variants perform similarly at 30% compression. However, at 90% compression, the differences become more pronounced: key-only scores degrade significantly, while value and key-value product maintain higher median accuracy and robustness. This confirms that value information is more predictive of output quality, and that combining key and value structure yields the most reliable token selection strategy.
Next, we evaluate the role of randomized leverage score approximation. This determines whether the key and value matrices are projected to a low-rank space via Gaussian random matrices before computing leverage scores. We observe marginal differences between the two variants. At 30% compression, both versions perform similarly in median accuracy, though random projection slightly improves the worst-case performance. At 90% compression, however, random projections result in modest improvements in median and lower-quartile accuracy. These gains suggest that random projections can help stabilize leverage score estimation under extreme compression, but their effect is secondary to the choice of leverage module.
We analyze the runtime behavior of CurDKV
and SnapKV under varying compression ratios and input lengths, as illustrated in Figures 6 and 7. We focus on three key metrics:
KV cache size, prefill time, and generation time.
Both methods exhibit nearly linear reduction in KV cache size as compression increases. For example, with a 128K token input, the cache size reduces from approximately 16 GB at 0% compression to under 2 GB at 80% compression for both
CurDKV
and SnapKV (Figures 6a and 7a). This confirms that CUR-based token selection inherits the cache sparsity advantages of heuristic methods like SnapKV.
SnapKV maintains flat prefill latency regardless of compression. For instance, at 128K tokens, prefill time remains consistently around 10 seconds across all compression levels (Figure 7b). CurDKV
by
contrast, incurs higher latency due to leverage score computation, rising from 10 seconds at 0% compression to about 14-15 seconds at 60–80% compression for 128K tokens (Figure 6b). For shorter sequences like 4K or 16K,
the difference is marginal, under 1 second for both methods.
CurDKV
demonstrates a stronger reduction in generation time, particularly for long sequences. For 128K-token inputs, generation latency drops from \(\sim\)12 seconds at 0% compression to just \(\sim\)4.5 seconds at 80% compression (Figure 6c). SnapKV also sees a drop from \(\sim\)15 seconds to \(\sim\)6 seconds, but
exhibits irregular behavior for mid-length sequences (e.g., 64K shows a flat profile between 40–80% compression in Figure 7c).
While SnapKV is faster to prefill, CurDKV
yields larger reductions in generation time, especially for long sequences (e.g., 128K tokens) and high compression levels. Its additional prefill cost roughly 3–5 seconds at worst is offset by its
smoother and more aggressive runtime scaling, making it a more efficient choice for long-context inference when accuracy and latency matter.
We have uploaded the source code and datasets as supplementary; we are committed to release them upon acceptance of the paper.↩︎
The Gaussian projection is motivated by the construction in Theorem 19 of [17].↩︎
Refer to https://github.com/NVIDIA/kvpress/blob/main/kvpress/presses/adakv_press.py for an implementation of the adaptive compression logic.↩︎