PRISM applies gated attention to filter information retrieved from history and hierarchical summarization to scale attention over long interaction histories, improving causal transformer policies trained with behavior cloning by improving robustness to noisy histories and reducing computation.
ReMemBench is designed to evaluate short-term memory in visuomotor policies. Guided by the cognitive science literature, we decompose short-term memory into several functional categories. Diversity in categories promotes developments in general memory mechanisms and not custom, non-generalizable solutions for a particular task. Each category is instantiated with two household-manipulation tasks. Below are videos of each category.
We evaluate PRISM on a real-world adaptation of 'Wash and Return to Container' task from ReMemBench. Below is the visualization of two successful rollouts of PRISM.
PRISM significantly outperforms prior approaches on ReMemBench, achieving 41% average success rate compared to Long Short-Term Memory (12%), Mamba (25%), Transformer XL (15%), Gated Transformer XL (26%), Linear Attention (22%), Past-Token Prediction (15%), and Scene Memory Transformer (36%). Recurrent models struggle with long-horizon credit assignment due to vanishing gradients, while attention-based methods fail to effectively filter irrelevant information from memory.
| Group | Method | ReMemBench Task Categories | Avg | |||
|---|---|---|---|---|---|---|
| Spatial | Prospective | Object-Associative | Object-Set | |||
| ReMemBench (Four Tasks) | Long Short-Term Memory | 0.12 | 0.10 | 0.12 | 0.13 | 0.12 ± 0.04 |
| Mamba | 0.20 | 0.30 | 0.23 | 0.27 | 0.25 ± 0.05 | |
| Transformer XL | 0.15 | 0.13 | 0.17 | 0.15 | 0.15 ± 0.02 | |
| Gated Transformer XL | 0.23 | 0.30 | 0.25 | 0.27 | 0.26 ± 0.03 | |
| Linear Attention | 0.20 | 0.23 | 0.22 | 0.23 | 0.22 ± 0.02 | |
| Past-Token Prediction | 0.15 | 0.13 | 0.17 | 0.15 | 0.15 ± 0.02 | |
| Scene Memory Transformer | 0.41 | 0.33 | 0.33 | 0.33 | 0.36 ± 0.04 | |
| PRISM | 0.33 | 0.77 | 0.33 | 0.22 | 0.41 ± 0.04 | |
| ReMemBench (ALL) | Past-Token Prediction | 0.15 | 0.13 | 0.17 | 0.15 | 0.16 ± 0.03 |
| SAM2Act++ | 0.18 | 0.15 | 0.24 | 0.30 | 0.19 ± 0.03 | |
| PRISM | 0.39 | 0.43 | 0.23 | 0.23 | 0.32 ± 0.08 | |
Table II: Success rate across ReMemBench task categories (20 trials × 3 seeds).
PRISM with memory (n=256) demonstrates substantial improvements on standard visuomotor benchmarks, even when memory is not explicitly required. On RoboCasa, PRISM improves from 32% (no memory) to 43% (+11 points), also exceeding GR00T-N1-2B by 11 points. On LIBERO, PRISM achieves an average success rate of 89%, improving by 12 points over its no-memory variant (77%) and outperforming Open Vision-Language-Action by 15 points, indicating that short-term memory helps disambiguate visually identical states requiring different actions.
| RoboCasa | |
|---|---|
| Method | Success Rate |
| GR00T-N1-2B | 0.32 |
| Diffusion Policy | 0.26 |
| PRISM (no memory) | 0.32 |
| PRISM (n=256) | 0.43 |
| LIBERO | ||||||
|---|---|---|---|---|---|---|
| Method | LIBERO-90 | LIBERO-10 | Object | Spatial | Goal | Avg |
| Open Vision-Language-Action | 0.62 | 0.54 | 0.88 | 0.85 | 0.79 | 0.74 |
| Diffusion Policy | – | 0.72 | 0.93 | 0.78 | 0.68 | – |
| PRISM (no memory) | 0.77 | 0.75 | 0.89 | 0.85 | 0.61 | 0.77 |
| PRISM (n=256) | 0.85 | 0.81 | 0.93 | 0.92 | 0.95 | 0.89 |
Table III: Success rates on RoboCasa and LIBERO. PRISM with memory (n=256) outperforms its no-memory variant and strong pretrained baselines.
PRISM exhibits strong scalability with increasing memory size on ReMemBench, with success rate increasing by 0.26 as the memory window expands from n=1 to n=512. This continuous improvement without saturation demonstrates PRISM's ability to extract useful information from temporally extended memory, making it well-suited for tasks requiring reasoning over extended temporal horizons.
Performance scaling with increasing memory capacity (n=1 to n=512).