Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

Abstract

Many robotic tasks require short-term memory, whether it’s retrieving an object that’s no longer visible or turning off an appliance after a set period. Yet, most visuomotor policies trained via imitation learning remain myopic, relying only on immediate sensory input without using past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which selectively filters retrieved information to suppress irrelevant details, improving performance by reducing the spurious correlations between the history and current action prediction, and (ii) a hierarchical architecture that first compresses local interactions into compact tokens and then integrates them to capture temporally extended dependencies, improving its compute and memory footprint. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes, an order of magnitude longer than previous approaches. To systematically evaluate memory in visuomotor control, we introduce ReMemBench—a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory—designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including transformer-based visuomotor policies with short-term memory, recurrent architectures, and other transformer variants—achieving an absolute improvement of 5%–12% over the strongest baseline in ReMemBench and real-world evaluations. On the standard RoBoCasa and LIBERO benchmarks, it achieves absolute improvements of 11%–15% over its no-memory variant and fine-tuned Vision-Language-Action baselines such as GR00T-N1-3B and OpenVLA, despite not leveraging any large-scale pretraining. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory–augmented visuomotor policies that scale to long-horizon tasks.

Method Overview

PRISM applies gated attention to filter information retrieved from history and hierarchical summarization to scale attention over long interaction histories, improving causal transformer policies trained with behavior cloning by improving robustness to noisy histories and reducing computation.

ReMemBench Categories

ReMemBench is designed to evaluate short-term memory in visuomotor policies. Guided by the cognitive science literature, we decompose short-term memory into several functional categories. Diversity in categories promotes developments in general memory mechanisms and not custom, non-generalizable solutions for a particular task. Each category is instantiated with two household-manipulation tasks. Below are videos of each category.

Spatial Memory

Prospective Memory

Object-Associative Memory

Object-Set Memory

Real-World Rollouts

We evaluate PRISM on a real-world adaptation of 'Wash and Return to Container' task from ReMemBench. Below is the visualization of two successful rollouts of PRISM.

Results Overview

How does PRISM compare to prior works on imitation learning with partial observability?

PRISM significantly outperforms prior approaches on ReMemBench, achieving 41% average success rate compared to Long Short-Term Memory (12%), Mamba (25%), Transformer XL (15%), Gated Transformer XL (26%), Linear Attention (22%), Past-Token Prediction (15%), and Scene Memory Transformer (36%). Recurrent models struggle with long-horizon credit assignment due to vanishing gradients, while attention-based methods fail to effectively filter irrelevant information from memory.

Group	Method	ReMemBench Task Categories				Avg
Group	Method	Spatial	Prospective	Object-Associative	Object-Set	Avg
ReMemBench (Four Tasks)	Long Short-Term Memory	0.12	0.10	0.12	0.13	0.12 ± 0.04
	Mamba	0.20	0.30	0.23	0.27	0.25 ± 0.05
	Transformer XL	0.15	0.13	0.17	0.15	0.15 ± 0.02
	Gated Transformer XL	0.23	0.30	0.25	0.27	0.26 ± 0.03
	Linear Attention	0.20	0.23	0.22	0.23	0.22 ± 0.02
	Past-Token Prediction	0.15	0.13	0.17	0.15	0.15 ± 0.02
	Scene Memory Transformer	0.41	0.33	0.33	0.33	0.36 ± 0.04
	PRISM	0.33	0.77	0.33	0.22	0.41 ± 0.04
ReMemBench (ALL)	Past-Token Prediction	0.15	0.13	0.17	0.15	0.16 ± 0.03
	SAM2Act++	0.18	0.15	0.24	0.30	0.19 ± 0.03
	PRISM	0.39	0.43	0.23	0.23	0.32 ± 0.08

Table II: Success rate across ReMemBench task categories (20 trials × 3 seeds).

How much improvement does PRISM provide on standard benchmarks that do not explicitly test memory?

PRISM with memory (n=256) demonstrates substantial improvements on standard visuomotor benchmarks, even when memory is not explicitly required. On RoboCasa, PRISM improves from 32% (no memory) to 43% (+11 points), also exceeding GR00T-N1-2B by 11 points. On LIBERO, PRISM achieves an average success rate of 89%, improving by 12 points over its no-memory variant (77%) and outperforming Open Vision-Language-Action by 15 points, indicating that short-term memory helps disambiguate visually identical states requiring different actions.

RoboCasa
Method	Success Rate
GR00T-N1-2B	0.32
Diffusion Policy	0.26
PRISM (no memory)	0.32
PRISM (n=256)	0.43

LIBERO
Method	LIBERO-90	LIBERO-10	Object	Spatial	Goal	Avg
Open Vision-Language-Action	0.62	0.54	0.88	0.85	0.79	0.74
Diffusion Policy	–	0.72	0.93	0.78	0.68	–
PRISM (no memory)	0.77	0.75	0.89	0.85	0.61	0.77
PRISM (n=256)	0.85	0.81	0.93	0.92	0.95	0.89

Table III: Success rates on RoboCasa and LIBERO. PRISM with memory (n=256) outperforms its no-memory variant and strong pretrained baselines.

How does PRISM's performance scale with increasing memory capacity?

PRISM exhibits strong scalability with increasing memory size on ReMemBench, with success rate increasing by 0.26 as the memory window expands from n=1 to n=512. This continuous improvement without saturation demonstrates PRISM's ability to extract useful information from temporally extended memory, making it well-suited for tasks requiring reasoning over extended temporal horizons.

Performance scaling with increasing memory capacity (n=1 to n=512).