Long-Context Robot Imitation Learning by Focusing on Key History Frames

Max Sobol Mark^1,2 Jacky Liang¹ Maria Attarian¹ Chuyuan Fu¹ Debidatta Dwibedi¹ Aviral Kumar^{1,2,*Equal advising} Dhruv Shah^{1,3,*Equal advising}

¹Google DeepMind ²Carnegie Mellon University ³Princeton University

arXiv

Abstract

Many useful robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks.

Naïvely conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment.

In this paper, we analyze why policies latch onto these spurious correlations. We find that this problem arises because of limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem.

Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact representation of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations.

Naive History Conditioning Fails

Conditioning policies on subsampled past observations seems natural, but it leads to failures across diverse tasks. Policies latch onto spurious correlations in training histories that do not generalize to out-of-distribution rollouts at deployment time.

Drawer Search

Marshmallows

Mug Replacement

Stacking Puzzle

We ran experiments in simulation to understand why this happens. We probe how well policies understand the current history state (by training a separate head that predicts history state with a stop-gradient, and comparing accuracies). We find that methods that condition on subsampled past histories accurately predict history state in unseen but in-distribution trajectories, but fail on trajectories from their own policy's rollouts, which are slightly out-of-distribution.

State prediction accuracy degrades on policy rollouts

History understanding degrades on policy rollouts (out-of-distribution).

Regularizations like Past Token Prediction (PTP) can help, but the problem persists, both in history understanding and evaluation performance (see results below).

Regularization does not fix lack of coverage over possible histories.

How can we get around lack of coverage of histories?

Loading frames...

Big Picture Policies (Ours)

Naive History Conditioning

Inputs that the policy sees

Task: Insert two handfuls of marshmallows into the bowl, then press the red button.

Naive history conditioning yields complex inputs, with ample opportunities to go out-of-distribution (e.g. failed grasps).

What if instead of looking at raw subsampled histories, we looked at the bigger picture?

BPP projects trajectories onto a compact representation of task-relevant events, so policy inputs look more in-distribution.

Big Picture Policies (BPP)

Big Picture Policies (BPP) learns keyframe-conditioned policies

The key part of BPP is that we train policies that are conditioned on a few behaviorally salient events, instead of raw histories. These salient events are detected by a VLM with a task-specific criterion and very simple prompting. For example, for the Drawer Search task, we ask the VLM "has a drawer just been opened?". A schematic of the BPP pipeline is shown below.

BPP method schematic showing keyframe detection pipeline

To account for VLM response latency, during training we mask-out keyframes that were detected in the last 3 seconds.

Results

BPP achieves the highest success rates across all four real-world manipulation tasks.

Task	Current Obs	Naive History	PTP	BPP (Ours)
Drawer Search	11.1%	0.0%	0.0%	33.3%
Marshmallows	40.0%	25.0%	35.0%	65.0%
Mug Replacement	0.0%	5.0%	40.0%	60.0%
Stacking Puzzle	6.5%	21.0%	52.0%	56.0%
Average	14.4%	12.8%	31.8%	53.6%

BPP also outperforms baselines on simulation tasks requiring history conditioning. Hover over bars to see values.

BPP shows robust history understanding and long-horizon behaviors

Side-by-side comparisons across methods for each real-world task.

Drawer Search

Current Obs Fail

Naive History Fail

PTP Fail

BPP (Ours) Success

Marshmallows

Current Obs Fail

Naive History Fail

BPP (Ours) Success

Mug Replacement

Naive History Fail

BPP (Ours) Success

Stacking Puzzle

Naive History Fail

BPP (Ours) Success

Citation

@article{sobolmark2026bpp,
  title={Long-Context Robot Imitation Learning by Focusing on Key History Frames},
  author={Sobol Mark, Max and Liang, Jacky and Attarian, Maria and Fu, Chuyuan and Dwibedi, Debidatta and Kumar, Aviral and Shah, Dhruv},
  journal={arXiv preprint},
  year={2026}
}