Speeding up long-context LLMs by indexing once — "CLSA" cross-layer sparse attention
Long-context inference is bottlenecked by decoding efficiency, especially for reasoning models that emit long chains of thought. Existing sparse attention faces an efficiency-quality trade-off. CLSA, built on KV-sharing (YOCO), shares not just the KV cache but the routing index across layers — computing top-k selection once and reusing it. At 128K context it reaches up to 7.6x decoding speedup and 17.1x overall throughput.
Paper overview (our summary)
- Field (arXiv category)cs.CL(+2)
- AuthorsYutao Sun, Yanqi Zhang, Li Dong, et al. (5)
- Submitted2026-06-04
- arXiv ID2606.06467v1
Key points
- Targets the decoding-efficiency bottleneck of long-context LLMs (especially reasoning models)
- Shares the routing index (not just the KV cache) across layers, on top of KV-sharing (YOCO)
- A single indexer computes top-k once and reuses it — keeping selectivity while amortizing routing
- Jointly improves pre-filling, KV-cache storage, and long-context decoding
- Up to 7.6x decoding speedup and 17.1x overall throughput at 128K context
This work (CLSA) speeds up inference for LLMs handling long contexts.
Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse-attention methods face a practical efficiency-quality trade-off: structured block-sparse methods accelerate strongly but lose noticeable quality, while token-sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive.
The authors propose cross-layer sparse attention (CLSA), built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, preserving the fine-grained selectivity of token-sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly — pre-filling, KV-cache storage, and long-context decoding.
Across short- and long-context benchmarks, CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context — a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.
Why it matters
Research directly relevant to cutting inference cost for long-context and reasoning models. For those tracking KV-cache optimization, sparse attention, and LLM inference efficiency, a useful read on designs that balance quality and speed.
FAQ
What is sparse attention?
Why does it help reasoning models?
Sources (primary)
Source: arXiv (descriptive metadata is CC0 public domain). Summaries are our own; see arXiv for the original text and PDF.
- arXiv abstract page (original, official)
- PDF (arXiv)
- arXiv ID: 2606.06467