Our framework efficiently scales up data generation of language-labeled robot data and effectively distills this data down into a robust multi-task language-conditioned visuomotor policy.
For scaling up data generation, we use a language model to guide high-level planning and sampling-based robot planners to generate rich and diverse manipulation trajectories (b). To robustify this data-collection process, the language model also infers a code-snippet for the success condition of each task, simultaneously enabling the data-collection process to detect failure and retry and automatically label of trajectories with success/failure (c).
For distilling down into a policy for real-world deployment (d), we extend the diffusion policy single-task behavior-cloning approach to multi-task settings with language conditioning.
We use a language model to predict each task's success condition code snippet, which allows the robot to retry failed tasks.
The result is demonstrations of robust behavior, which teach the policy to recover after failed attempts, resulting in more successful trajectories when given more time.
Language-model planners' abilities to perform rich, 6 DoF manipulation alone is language-constrained. Many things robotic systems need to understand, like geometry and articulation structure, are challenging to describe in natural language. That is where sampling-based planners come in.
Approach | 6 DoF Manipulation | Common-sense | No Sim State |
---|---|---|---|
Sampling-based Planners | |||
LLM Planners | |||
Our Data Generation | |||
Our Policy |
We introduce a new multi-task benchmark to test â long-horizon behavior, ð§ common-sense reasoning, ðĻ tool-use, and intuitive physics. Running our language-guided skill learning framework in the benchmark gives an infinite amount of language-labelled robot experience.
Using domain randomization, our diffusion policy can be deployed on a real robot with ðŠķ no fine-tuning.
@inproceedings{ha2023scalingup,
title={Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition},
author={Huy Ha and Pete Florence and Shuran Song},
year={2023},
booktitle={Proceedings of the 2023 Conference on Robot Learning},
}
If you have any questions, please contact Huy Ha.
The framework uses privileged simulation state information for the data generation process. This is why the language model can infer good reward functions using simulation contact and joint information. While we have successfully demonstrated its application in a transport task in the real world using a domain randomized policy, there remains room for improvement in terms of perfecting Sim2Real transferability. This represents an exciting challenge to tackle, and is currently our main focus for enhancement.
At data collection time, the language model also predicts a success condition used to label its experience with success or failure. The distilled policy filters the replay buffer using this automatically generated success label, learning from only successful experiences.
Tasks in our benchmark are contact-rich and require fine-grained, 6 DoF behavior to solve. Instead of getting language models to output actions for such tasks directly, we use them for high-level planning over API calls to sampling-based planners, such as rapidly-exploring random trees and grasp samplers.
The result is a data generation approach that combines the best of both worlds: Low-level geometry reasoning and diverse trajectories from sampling-based planners, and the flexibility of a language model.
Our policy builds on Diffusion Policy, a behavior cloning approach for learning from diverse, multi-modal demonstrations. Each action inference is sampled from a pseudo-random diffusion process over action sequences. The action sequence samples are visualized here as lines, where blue is the start of the action sequence while red is the end.
You can generate them yourself too! Check out our codebase for visualization visualization instructions.
These prior works use language models as zero-shot planners and policies, which limit their inference-time performance by the language model's planning robustness. This also means they do not improve with more experience.
In contrast, our approach uses language models as zero-shot data collection policies, supplied with an API to sampling-based robot planners. The generated data is then distilled into a robust, multi-task visuomotor policy, which performs better than its data collection policy.