cs.RO cs.AI

A robot policy with controllable speed — "TempoVLA," a speed-controllable Vision-Language-Action model

cs.RO ・Dong Jing, Jingchen Nie, Tianqi Zhang, et al. (7) ・Jun 2026

Manipulation alternates between low-risk transit (fast) and high-risk contact (slow, precise), yet existing Vision-Language-Action models (VLAs) inherit a single fixed speed from demonstrations. TempoVLA notes that the magnitude of each predicted action already governs speed, and controls execution speed via an explicit condition — combining a data-side variable-speed trajectory augmentation (VSTA) with model-side speed conditioning to control both acceleration and deceleration.

Paper overview (our summary)

Field (arXiv category)cs.RO（+1）
AuthorsDong Jing, Jingchen Nie, Tianqi Zhang, et al. (7)
Submitted2026-06-04
arXiv ID2606.06491v1

Key points

Addresses VLAs being locked to a single demonstration speed (deceleration unexplored)
Controls speed via an explicit condition, noting predicted-action magnitude governs speed (TempoVLA)
Data side: variable-speed trajectory augmentation (VSTA) re-times demos preserving motion semantics
Model side: a speed-conditioning mechanism fed to the policy
Both-direction control; with a large multimodal model, accelerates low-risk and decelerates high-risk phases

This work (TempoVLA) lets a robot policy vary its motion speed by situation.

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs (model compression, KV-cache reuse, reinforcement learning) only shift the policy from one fixed speed to another, leaving deceleration almost unexplored.

The authors observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. TempoVLA, a single VLA whose speed is set by an explicit condition, combines two coupled components: (1) a data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstrations to any target speed by merging or splitting actions while preserving motion semantics; and (2) a model-side conditioning mechanism that feeds the speed to the policy. Statistics show VSTA reaches the requested speed with negligible motion error.

Experiments in simulation and on real-world tasks show TempoVLA achieves flexible speed control in both directions, while VSTA also boosts default (1×) performance via better data utilization. Cooperating with a large multimodal model, TempoVLA realizes dynamic speed control — accelerating through low-risk phases and decelerating for high-risk ones.

Why it matters

A read on robot manipulation, VLAs, embodied AI, and industrial automation. Tackling phase-appropriate speed control points to directions for the safety and efficiency of manipulation AI.

FAQ

Why is speed control important?

Different phases need different speeds — fast for transit, slow and precise for contact (e.g., insertion). It balances safety and efficiency better than a single-speed VLA.

What is a VLA?

A Vision-Language-Action model that outputs a robot's actions directly from images and language instructions.

Sources (primary)

Source: arXiv (descriptive metadata is CC0 public domain). Summaries are our own; see arXiv for the original text and PDF.

arXiv abstract page (original, official)
PDF (arXiv)
arXiv ID: 2606.06491

#AI#arXiv#Research paper#Robotics#VLA#Embodied AI

← Back to AI research paper watch