ICDG - TMLR2026'

Abstract

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings.

We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Trained on a limited subset of tasks, our model can zero-shot generate high-quality transitions from which we learn control policies for unseen task combinations. We then introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance, ultimately solving nearly all held-out tasks and demonstrating the emergence of compositional structure in the learned representations.

Robotic tasks grow combinatorially, but collecting demonstrations still scales linearly. Having access to a limited subset of tasks, we learn a semantic compositional diffusion transformer that generates expert-style transitions for unseen task combinations - then iteratively validates synthetic data with offline RL and snowballs coverage across the task space - without requiring new data collection efforts.

Setting

To ground this idea, we use CompoSuite (Mendez et al., 2022), which provides 4×4×4×4 = 256 tasks by composing one element from each axis: robot, object, obstacle, and objective. Our main experiments study a 14-task training regime where the robot is fixed to IIWA (the remaining three axes vary), providing access to 4×4×4 = 64 possible task combinations. We evaluate our approach on 32 held-out tasks that are not used during training.

IIWA, Box, None, PickPlace — (a) `IIWA`, `Box`, `None`, `PickPlace`

Jaco, Dumbbell, ObjectWall, Push — (b) `Jaco`, `Dumbbell`, `ObjectWall`, `Push`

Panda, Plate, ObjectDoor, Shelf — (c) `Panda`, `Plate`, `ObjectDoor`, `Shelf`

Kinova3, Hollowbox, GoalWall, Trashcan — (d) `Kinova3`, `Hollowbox`, `GoalWall`, `Trashcan`

Four example CompoSuite tasks, each defined by selecting one element from each axis.

Task indicator encoding — Overview of the 16-dimensional task indicator encoding each task compositionally.

Training tasks — Visualization of the 14 training tasks (IIWA-only split, tasks 1-14).

Held-out test tasks — Visualization of the 32 held-out test tasks for zero-shot evaluation.

How the method works

Generate data for unseen tasks with a semantic, compositional diffusion transformer

Transformers as compositional graphs: Instead of hard-coding task structure, which requires substantial domain knowledge and may be suboptimal, we propose to learn the graph structure directly from data. A transformer's self-attention naturally implements message passing over a graph where each component (robot, object, obstacle, objective, action, reward) is a node. By learning attention weights, the transformer discovers which components influence each other and learns the compositional structure automatically.
Semantic, compositional tokenization: We factorize transitions into component-specific tokens, where each element has its own encoder-decoder. Self-attention enables each component to attend to others at every step, implementing graph-compositional inference. Crucially, this ensures synthetic data generation only updates the components involved in each task, preventing cross-task corruption while enabling generalization to unseen combinations.

Semantic compositional transformer architecture — Visualization of our semantic compositional transformer architecture. We factorize each transition into state factors, actions, reward, and terminal indicators. Each state factor has its own encoder-decoder pair. The encoded tokens are processed by diffusion transformer layers, which use adaptive statistics conditioned on timestep and task indicators. Output tokens are decoded by factor-specific decoders.

Iterative compositional data generation

Train the generator on real transitions from a fraction of tasks (~20% in our main setting) using component-wise tokenization so each robot, object, and scene factor has its own encoder–decoder.
Generate synthetic rollouts for held-out task combinations conditioned on task indicators.
Validate by training offline RL (TD3-BC) on generated data and rolling out short episodes; only datasets that pass a success-quality threshold enter the pool.
Iterate: add admitted synthetic data to training and repeat - growing reliable coverage while avoiding cross-task corruption via localized component updates.

Method overview — Iterative Compositional Data Generation.

Main results

Policies trained on our synthetic data outperform monolithic and standard DiT baselines and, after several refinement rounds, surpass strong multi-task RL baselines - without collecting new real trajectories.

Zero-shot generalization

Iterative self-improvement (14 training tasks / 64 possible combinations)

Iterative self-improvement performance — Performance of different diffusion architectures over iterations of our self-improvement procedure. From left to right, top to bottom: best success rate, per-iteration success rate, best success rate separated by initial task difficulty, percentage of solved tasks, and dataset coverage across 4 iterations.

Environment interaction efficiency

Iterative compositional generation still needs simulator rollouts to score candidate policies - but those steps only evaluate datasets, they do not imitate data collection or online learning.

We compare to RLPD (Ball et al., 2023) (offline + online RL) starting from the same 14-task expert pool, allowing up to 100k on-policy steps per held-out task. Our method reaches much higher success and return with only ~20k evaluation interactions per task after four refinement rounds - demonstrating that a small amount of structured evaluation can validate massive parallel synthetic rollouts (~1M model-generated transitions per task).

Note that our approach is not tied to this particular choice of data evaluation and any suitable scoring function that assesses the utility of generated data could be used instead.

Environment interaction efficiency on 32 held-out tasks. Our approach achieves substantially higher success rate and return using only ~20k environment interactions vs RLPD's 100k.

Utility of rare successful synthetic trajectories

While our iterative generation process often produces datasets with high success rates, some tasks may yield synthetic datasets with relatively low success rates. We show that even datasets containing only occasional successful trajectories provide valuable signal for downstream reinforcement learning. Using RLPD (Ball et al., 2023) initialized with the lowest non-zero success synthetic datasets (only 8-12% offline RL success rates), we reach 80-100% success within 500,000 environment interactions, whereas standard online SAC fails even after 5 million interactions - demonstrating that rare successful trajectories significantly accelerate online learning.

SAC vs RLPD with rare synthetic data — Learning curves comparing RLPD initialized with rare-success synthetic datasets (8% and 12% offline success) vs. online SAC trained without initial offline data. Despite extremely low success rates in the synthetic data, RLPD rapidly solves both tasks, demonstrating that even occasional successful trajectories provide valuable signal for online RL.

What structure does the model learn?

Intervention tests and attention maps show emergent dependencies between task components - different from prior hand-designed compositional stacks and stable across depth and iterative training.

Intervention influence showing how masking encoders affects outputs. Strong diagonal effects with cross-factor interactions.

Attention weights depth 1 — Attention weights ordering: robot strongest, followed by goal, object, then obstacle.

Attention across iterations — Attention weights across transformer layers and iterations. The learned dependency structure remains stable with robot as strongest attention focus throughout.

Scaling to larger compositional task spaces

In a 56-task training regime (all four compositional axes varied, 256 possible combinations in total), performance scales consistently as the number of axes increases.

Scaling to larger compositional spaces — Performance in the 56/256-task regime. From left to right, top to bottom: best success rate, per-iteration success rate, best success rate separated by initial task difficulty, percentage of solved tasks, and dataset coverage across 4 iterations. Semantic compositional model consistently achieves higher success and coverage while maintaining data quality.

When does monolithic generation fall short?

Monolithic score gap — Performance gap vs. ground-truth data as training-task count grows - the gap widens sharply in the low-data regime. At 56/256 tasks (roughly 20% of the tasks) the model is unable to zero-shot generalize meaningfully. This is a similar data regime to the 14/64 IIWA-only tasks we investigated in our main experiments.

BibTeX

@article{pham2026iterative,
  title   = {Iterative Compositional Data Generation for Robot Control},
  author  = {Pham, Anh-Quan and Hussing, Marcel and Patankar, Shubhankar P. and Bassett, Dani S. and Mendez-Mendez, Jorge and Eaton, Eric},
  journal = {Transactions on Machine Learning Research},
  year    = {2026}
}

Please use the camera-ready citation from OpenReview / the published TMLR entry when available.

Iterative Compositional Data Generationfor Robot Control