cs.LG cs.AI

Training RNNs without recurrence — "Supervised Memory Training (SMT)," parallelizable across time

cs.LG ・Akarsh Kumar, Phillip Isola ・Jun 2026

Standard RNN training (BPTT) is sequential in time, hard to parallelize, and suffers vanishing/exploding gradients on long ranges. SMT reduces RNN training to supervised learning on one-step memory-transition labels (m_t, x_{t+1})→m_{t+1}, sidestepping recurrent credit propagation entirely — enabling time-parallel training with an O(1) gradient path between any two tokens. It beats BPTT on language and pixel-sequence modeling (MIT, Isola lab).

Paper overview (our summary)

Field (arXiv category)cs.LG（+1）
AuthorsAkarsh Kumar, Phillip Isola
Submitted2026-06-04
arXiv ID2606.06479v1

Key points

Addresses BPTT being sequential, hard to parallelize, and prone to vanishing/exploding gradients
Reduces RNN training to supervised learning on one-step memory-transition labels (SMT)
Memory labels from a Transformer encoder trained on a predictive-state objective
Time-parallel training with an O(1) gradient path between any two tokens, no unrolling
Beats BPTT on language and pixel-sequence modeling (MIT, Isola lab)

This work (SMT) trains recurrent neural networks (RNNs) without using recurrent credit propagation.

Training RNNs requires assigning credit across long sequences of computation. Standard backpropagation through time (BPTT) handles this poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations hard to learn.

The proposed Supervised Memory Training (SMT) sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory-transition labels (m_t, x_{t+1}) → m_{t+1}. These memory labels are acquired by training a Transformer-based encoder on a predictive-state objective — retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable O(1)-length gradient path between any two tokens, without ever unrolling the RNN.

The authors find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel-sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.

Why it matters

A read on long-sequence processing and efficient sequence-model training (RNNs/state-space models). Sidestepping BPTT for parallel training is a useful signal for those tracking the scaling of non-Transformer architectures.

FAQ

Why does parallelism matter?

RNNs normally compute step-by-step in time, which is slow to train at scale. SMT trains time-parallel without unrolling, potentially training RNNs with Transformer-like efficiency.

Does it replace Transformers?

The focus is training RNNs efficiently. It is foundational work relevant to the training efficiency and long-range modeling of stateful/sequence models (e.g., state-space models).

Sources (primary)

Source: arXiv (descriptive metadata is CC0 public domain). Summaries are our own; see arXiv for the original text and PDF.

arXiv abstract page (original, official)
PDF (arXiv)
arXiv ID: 2606.06479

#AI#arXiv#Research paper#RNN#Machine learning#Model training

← Back to AI research paper watch