cs.CL cs.AI cs.LG

Speeding up long-context LLMs by indexing once — "CLSA" cross-layer sparse attention

cs.CL ・Yutao Sun, Yanqi Zhang, Li Dong, et al. (5) ・Jun 2026

Long-context inference is bottlenecked by decoding efficiency, especially for reasoning models that emit long chains of thought. Existing sparse attention faces an efficiency-quality trade-off. CLSA, built on KV-sharing (YOCO), shares not just the KV cache but the routing index across layers — computing top-k selection once and reusing it. At 128K context it reaches up to 7.6x decoding speedup and 17.1x overall throughput.

Paper overview (our summary)

Field (arXiv category)cs.CL（+2）
AuthorsYutao Sun, Yanqi Zhang, Li Dong, et al. (5)
Submitted2026-06-04
arXiv ID2606.06467v1

Key points

Targets the decoding-efficiency bottleneck of long-context LLMs (especially reasoning models)
Shares the routing index (not just the KV cache) across layers, on top of KV-sharing (YOCO)
A single indexer computes top-k once and reuses it — keeping selectivity while amortizing routing
Jointly improves pre-filling, KV-cache storage, and long-context decoding
Up to 7.6x decoding speedup and 17.1x overall throughput at 128K context

This work (CLSA) speeds up inference for LLMs handling long contexts.

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse-attention methods face a practical efficiency-quality trade-off: structured block-sparse methods accelerate strongly but lose noticeable quality, while token-sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive.

The authors propose cross-layer sparse attention (CLSA), built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, preserving the fine-grained selectivity of token-sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly — pre-filling, KV-cache storage, and long-context decoding.

Across short- and long-context benchmarks, CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context — a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.

Why it matters

Research directly relevant to cutting inference cost for long-context and reasoning models. For those tracking KV-cache optimization, sparse attention, and LLM inference efficiency, a useful read on designs that balance quality and speed.

FAQ

What is sparse attention?

Instead of attending to every token, it focuses on a selected subset to cut computation. It helps with long contexts, but the selection affects the quality-speed balance.

Why does it help reasoning models?

Reasoning models generate long chains of thought, making decoding heavy. Reusing the index across layers reduces overhead and greatly speeds long-context generation.

Sources (primary)

Source: arXiv (descriptive metadata is CC0 public domain). Summaries are our own; see arXiv for the original text and PDF.

arXiv abstract page (original, official)
PDF (arXiv)
arXiv ID: 2606.06467

#AI#arXiv#Research paper#LLM#Long context#Inference efficiency

← Back to AI research paper watch