Training RNNs without recurrence — "Supervised Memory Training (SMT)," parallelizable across time
Standard RNN training (BPTT) is sequential in time, hard to parallelize, and suffers vanishing/exploding gradients on long ranges. SMT reduces RNN training to supervised learning on one-step memory-transition labels (m_t, x_{t+1})→m_{t+1}, sidestepping recurrent credit propagation entirely — enabling time-parallel training with an O(1) gradient path between any two tokens. It beats BPTT on language and pixel-sequence modeling (MIT, Isola lab).
Paper overview (our summary)
- Field (arXiv category)cs.LG(+1)
- AuthorsAkarsh Kumar, Phillip Isola
- Submitted2026-06-04
- arXiv ID2606.06479v1
Key points
- Addresses BPTT being sequential, hard to parallelize, and prone to vanishing/exploding gradients
- Reduces RNN training to supervised learning on one-step memory-transition labels (SMT)
- Memory labels from a Transformer encoder trained on a predictive-state objective
- Time-parallel training with an O(1) gradient path between any two tokens, no unrolling
- Beats BPTT on language and pixel-sequence modeling (MIT, Isola lab)
This work (SMT) trains recurrent neural networks (RNNs) without using recurrent credit propagation.
Training RNNs requires assigning credit across long sequences of computation. Standard backpropagation through time (BPTT) handles this poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations hard to learn.
The proposed Supervised Memory Training (SMT) sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory-transition labels (m_t, x_{t+1}) → m_{t+1}. These memory labels are acquired by training a Transformer-based encoder on a predictive-state objective — retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable O(1)-length gradient path between any two tokens, without ever unrolling the RNN.
The authors find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel-sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.
Why it matters
A read on long-sequence processing and efficient sequence-model training (RNNs/state-space models). Sidestepping BPTT for parallel training is a useful signal for those tracking the scaling of non-Transformer architectures.
FAQ
Why does parallelism matter?
Does it replace Transformers?
Sources (primary)
Source: arXiv (descriptive metadata is CC0 public domain). Summaries are our own; see arXiv for the original text and PDF.
- arXiv abstract page (original, official)
- PDF (arXiv)
- arXiv ID: 2606.06479