1Google DeepMind 2Carnegie Mellon University 3Princeton University
Many useful robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks.
Naïvely conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment.
In this paper, we analyze why policies latch onto these spurious correlations. We find that this problem arises because of limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem.
Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact representation of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations.
Conditioning policies on subsampled past observations seems natural, but it leads to failures across diverse tasks. Policies latch onto spurious correlations in training histories that do not generalize to out-of-distribution rollouts at deployment time.
We ran experiments in simulation to understand why this happens. We probe how well policies understand the current history state (by training a separate head that predicts history state with a stop-gradient, and comparing accuracies). We find that methods that condition on subsampled past histories accurately predict history state in unseen but in-distribution trajectories, but fail on trajectories from their own policy's rollouts, which are slightly out-of-distribution.
History understanding degrades on policy rollouts (out-of-distribution).
Regularizations like Past Token Prediction (PTP) can help, but the problem persists, both in history understanding and evaluation performance (see results below).
Regularization does not fix lack of coverage over possible histories.
The key part of BPP is that we train policies that are conditioned on a few behaviorally salient events, instead of raw histories. These salient events are detected by a VLM with a task-specific criterion and very simple prompting. For example, for the Drawer Search task, we ask the VLM "has a drawer just been opened?". A schematic of the BPP pipeline is shown below.
To account for VLM response latency, during training we mask-out keyframes that were detected in the last 3 seconds.
BPP achieves the highest success rates across all four real-world manipulation tasks.
| Task | Current Obs | Naive History | PTP | BPP (Ours) |
|---|---|---|---|---|
| Drawer Search | 11.1% | 0.0% | 0.0% | 33.3% |
| Marshmallows | 40.0% | 25.0% | 35.0% | 65.0% |
| Mug Replacement | 0.0% | 5.0% | 40.0% | 60.0% |
| Stacking Puzzle | 6.5% | 21.0% | 52.0% | 56.0% |
| Average | 14.4% | 12.8% | 31.8% | 53.6% |
BPP also outperforms baselines on simulation tasks requiring history conditioning. Hover over bars to see values.
Side-by-side comparisons across methods for each real-world task.
@article{sobolmark2026bpp,
title={Long-Context Robot Imitation Learning by Focusing on Key History Frames},
author={Sobol Mark, Max and Liang, Jacky and Attarian, Maria and Fu, Chuyuan and Dwibedi, Debidatta and Kumar, Aviral and Shah, Dhruv},
journal={arXiv preprint},
year={2026}
}