Transactions on Machine Learning Research (TMLR), 2026
Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings.
We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Trained on a limited subset of tasks, our model can zero-shot generate high-quality transitions from which we learn control policies for unseen task combinations. We then introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance, ultimately solving nearly all held-out tasks and demonstrating the emergence of compositional structure in the learned representations.
Robotic tasks grow combinatorially, but collecting demonstrations still scales linearly. Having access to a limited subset of tasks, we learn a semantic compositional diffusion transformer that generates expert-style transitions for unseen task combinations - then iteratively validates synthetic data with offline RL and snowballs coverage across the task space - without requiring new data collection efforts.
To ground this idea, we use CompoSuite (Mendez et al., 2022), which provides 4×4×4×4 = 256 tasks by composing one element from each axis: robot, object, obstacle, and objective. Our main experiments study a 14-task training regime where the robot is fixed to IIWA (the remaining three axes vary), providing access to 4×4×4 = 64 possible task combinations. We evaluate our approach on 32 held-out tasks that are not used during training.
IIWA, Box, None, PickPlace
Jaco, Dumbbell, ObjectWall, Push
Panda, Plate, ObjectDoor, Shelf
Kinova3, Hollowbox, GoalWall, TrashcanFour example CompoSuite tasks, each defined by selecting one element from each axis.
Policies trained on our synthetic data outperform monolithic and standard DiT baselines and, after several refinement rounds, surpass strong multi-task RL baselines - without collecting new real trajectories.
Iterative compositional generation still needs simulator rollouts to score candidate policies - but those steps only evaluate datasets, they do not imitate data collection or online learning.
We compare to RLPD (Ball et al., 2023) (offline + online RL) starting from the same 14-task expert pool, allowing up to 100k on-policy steps per held-out task. Our method reaches much higher success and return with only ~20k evaluation interactions per task after four refinement rounds - demonstrating that a small amount of structured evaluation can validate massive parallel synthetic rollouts (~1M model-generated transitions per task).
Note that our approach is not tied to this particular choice of data evaluation and any suitable scoring function that assesses the utility of generated data could be used instead.
While our iterative generation process often produces datasets with high success rates, some tasks may yield synthetic datasets with relatively low success rates. We show that even datasets containing only occasional successful trajectories provide valuable signal for downstream reinforcement learning. Using RLPD (Ball et al., 2023) initialized with the lowest non-zero success synthetic datasets (only 8-12% offline RL success rates), we reach 80-100% success within 500,000 environment interactions, whereas standard online SAC fails even after 5 million interactions - demonstrating that rare successful trajectories significantly accelerate online learning.
Intervention tests and attention maps show emergent dependencies between task components - different from prior hand-designed compositional stacks and stable across depth and iterative training.
In a 56-task training regime (all four compositional axes varied, 256 possible combinations in total), performance scales consistently as the number of axes increases.
@article{pham2026iterative,
title = {Iterative Compositional Data Generation for Robot Control},
author = {Pham, Anh-Quan and Hussing, Marcel and Patankar, Shubhankar P. and Bassett, Dani S. and Mendez-Mendez, Jorge and Eaton, Eric},
journal = {Transactions on Machine Learning Research},
year = {2026}
}
Please use the camera-ready citation from OpenReview / the published TMLR entry when available.