Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

Abstract

Many robotic tasks demand short-term memory, whether it's retrieving objects that are no longer visible or turning off an appliance after a certain amount of time. Yet, most visuomotor policies remain myopic, relying only on immediate sensory input without leveraging past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which selectively filters retrieved information to suppress irrelevant details, and (ii) a hierarchical architecture that first compresses local interactions into compact tokens and then integrates them to capture temporally extended dependencies. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes at five frames per second, an order of magnitude longer than previous approaches. To systematically evaluate memory in visuomotor control, we introduce ReMemBench—a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory—designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including transformer-based visuomotor policies with short-term memory, recurrent architectures, and other short-term memory-management strategies. Across ReMemBench and real-world evaluations, it achieves two times the success rate, and on the RoboCasa benchmark, it yields a 14 percentage points gain over the strongest baseline. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory–augmented visuomotor policies that scale to long-horizon tasks.

Method Overview

PRISM applies gated attention to filter historical context and hierarchical architecture to scale attention over long interaction histories, improving causal transformer policies trained with behavior cloning to handle noisy histories and reduce computation.

PRISM Architecture

ReMemBench Categories

ReMemBench is designed to evaluate short-term memory in visuomotor policies. Guided by the cognitive science literature, we decompose short-term memory into several functional categories. Diversity in categories promotes developments in general memory mechanisms and not custom, non-generalizable solutions for a particular task. Each category is instantiated with two household-manipulation tasks. Below are videos of each category.

Spatial Memory

Prospective Memory

Object-Associative Memory

Object-Set Memory

Real-World Rollouts

We evaluate PRISM on a real-world adaptation of 'Wash and Return to Container' task from ReMemBench. Below is the visualization of two successful rollouts of PRISM.