cs.CL cs.AI cs.LG

Speeding up long-context LLMs by indexing once — "CLSA" cross-layer sparse attention

cs.CL Yutao Sun, Yanqi Zhang, Li Dong, et al. (5) Jun 2026

Long-context inference is bottlenecked by decoding efficiency, especially for reasoning models that emit long chains of thought. Existing sparse attention faces an efficiency-quality trade-off. CLSA, built on KV-sharing (YOCO), shares not just the KV cache but the routing index across layers — computing top-k selection once and reusing it. At 128K context it reaches up to 7.6x decoding speedup and 17.1x overall throughput.

Paper overview (our summary)

  • Field (arXiv category)cs.CL(+2)
  • AuthorsYutao Sun, Yanqi Zhang, Li Dong, et al. (5)
  • Submitted2026-06-04
  • arXiv ID2606.06467v1

Key points

  • Targets the decoding-efficiency bottleneck of long-context LLMs (especially reasoning models)
  • Shares the routing index (not just the KV cache) across layers, on top of KV-sharing (YOCO)
  • A single indexer computes top-k once and reuses it — keeping selectivity while amortizing routing
  • Jointly improves pre-filling, KV-cache storage, and long-context decoding
  • Up to 7.6x decoding speedup and 17.1x overall throughput at 128K context

This work (CLSA) speeds up inference for LLMs handling long contexts.

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse-attention methods face a practical efficiency-quality trade-off: structured block-sparse methods accelerate strongly but lose noticeable quality, while token-sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive.

The authors propose cross-layer sparse attention (CLSA), built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, preserving the fine-grained selectivity of token-sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly — pre-filling, KV-cache storage, and long-context decoding.

Across short- and long-context benchmarks, CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context — a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.

Why it matters

Research directly relevant to cutting inference cost for long-context and reasoning models. For those tracking KV-cache optimization, sparse attention, and LLM inference efficiency, a useful read on designs that balance quality and speed.

FAQ

What is sparse attention?
Instead of attending to every token, it focuses on a selected subset to cut computation. It helps with long contexts, but the selection affects the quality-speed balance.
Why does it help reasoning models?
Reasoning models generate long chains of thought, making decoding heavy. Reusing the index across layers reduces overhead and greatly speeds long-context generation.

Sources (primary)

Source: arXiv (descriptive metadata is CC0 public domain). Summaries are our own; see arXiv for the original text and PDF.

#AI#arXiv#Research paper#LLM#Long context#Inference efficiency
Disclaimer: This site independently summarizes and classifies information based on official data sources. Always verify the latest and accurate information with the official sources. Content on finance, health, legal, and security is information, not advice. This site is not an official website of the U.S. government.