Scaling Up and Distilling Down

Language-Guided Robot Skill Acquisition

Huy Ha¹, Pete Florence², Shuran Song^1,3

¹Columbia University, ²Google DeepMind, ³Stanford University

www.cs.columbia.edu/~huy/scalingup/

Skill Learning

a large set of reusable and robust skills

plan

novel scenarios and
new tasks

How can we scalably
acquire robot skills?

Behavior Cloning

BC-Z , Jang et al, CoRL 2021.

Diffusion Policy , Chi et al, RSS 2023.

✅ reliably produces robust robot skills

Behavior Cloning

✅ reliably produces robust robot skills

❌ reliance on human demonstration collection

Scalable
Skill
Learning

Reinforcement Learning

Levine et al, ISER 2017

Gu et al, ICRA 2017

Morgan et al, ICRA 2021

Qin et al, CoRL 2022

✅ automatic data collection and policy learning

Reinforcement Learning

✅ automatic data collection and policy learning

❌ exploration in sparse reward and long-horizon

Scalable
Skill
Learning

How can we get a
Unified Framework
for
Scalable Skill Learning?

Language-guided Robot Data Generation
&
Language-conditioned Robot Policy Learning

Scale up Language-labelled Robot Data Generation

✅ rich manipulation skills

✅ flexibility to novel tasks & domains

sampling-based planners generates diverse robot behavior

(succeeds some of the times)

Verify-and-Retry

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

❌

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

❌

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

✅

def bus_is_balanced_on_the_block(state) -> bool:
    return is_on_top_of("bus", "block")

Verify-and-Retry

✅ Increases success rate

✅ Demonstrates retrying behavior

Distill Down to a Language-conditioned Visuo-motor Policy

Evaluation

High Entropy, Precise Actions

Distilled Common Sense

Data Generation Results

Distilled Retrying Behavior

Here's a plot showing success rates by episode time as a function of how much time is given.
When our data generation policy's grasp or placement attempts fail, it detects this with its inferred success condition and retries. This means the more time you give it, the more successful it will be, giving this monotonically increasing line.
This does not just lead to higher success rates but also, crucially, *demonstrates* retrying behavior to the distilled policy. And indeed, we not only see that the distilled policy successfully inherits this behavior from its data, but also *improves upon* it thanks to success filtering.
In contrast, without verify & retry, the data collection policy achieves low success rates, halting after the first attempt. So, even when you take only the successful trajectories from such a data collection policy and distill out a policy, it inherits this brittle open-loop behavior.
Questions

Distilled Policy Results

Here, we compare against BC-Zero, and varied all policy learning configurations including policy output design choices like action representation (absolute or delta), action sequence execution & prediction horizon, as well as policy output design choices.

The single most important design choice seems to be how the action is generated. BC-Zero policy's action generation is a feed forward multi-layer perceptron, trained with a huber loss. When learning from the diverse data generated from the sampling-based planners, it never achieved more than 35.5%.
Meanwhile, our policy's action generation is a pseudo-random diffusion process, trained with the denoising loss. This allows it to deal with high entropy, multi-modal data much better.
Questions

High Distilled Performance from Diverse Attempts

Verify & Retry

🚀 High data generation success rate

🔄 Retry after failure

Diverse Attempts

🎲 Sampling-based planners

🧠 Diffusion Policy

Real-world Deployment With No Fine-tuning

A Framework for Scalable Skill Learning

The Language-guided Automation Recipe

Language Models

✅ high-level flexibility

guide

External Tools

✅ low-level heavy-lifting

Google Bard , Google Blog

OpenAI GPT-4 , Unite.AI

The Language-guided Automation Recipe

A Robot Skill Learning Workflow

Language-guided Robot Learning

Language to Rewards for Robotic Skill Synthesis , Yu et al

Language-guided Robot Learning

TidyBot , Wu et al, IROS 2023

Language-guided Robot Learning

Infinigen , Raistrick et al, CVPR 2023

How we can put robotics on the same scaling trends as large language models while not compromising on robust manipulation and control?

Future Work

Limitations

Sim2Real. Online adaptation to novel visual & physical domains.
Trajectory Generation. More embodiments and tasks.
Asset/Environment Design. Procedural & learned approaches.

Opportunities

Policy Scalability. A trade-off between inductive biases and expressivity.
Data Investigation. Fix the algorithm, control data generation.

This work is not without limitations. I think there is still a lot to do on the Sim2Real front beyond domain randomization, which enables better online adaptation to novel domains visually as well as physically.
Further, what kind of robot tasks and embodiments we can support is determined by what external tools we give to the language model. On the trajectory generation front, maybe MPC is all we might need to extend this to more bipeds, quadrupeds, and dexterous, dynamic tasks.
After these two points, the last bottleneck of the system is in asset and environment design, for which both state-of-the-art procedural and learned approaches holds a lot of promise.
Personally, though, I'm extremely excited about what automatic robot data generation as a tool enables.
We can now ask questions regarding policy learning scalability. When Vision Transformers came out, it outperformed Convolutional Neural Networks. It turns out that the convolution inductive bias only helped in the low-data regime. So how much of the current state-of-the-art robot policy learning rankings now are an artifact of the low-data regime. Rather than hoping that our findings extrapolate to the high-data regime, we can now test it.
Lastly, with the opportunity to fully control our data generation process, we can gain a much better understanding of robot data. In this project, I've seen that just including retrying behavior helps performance significantly. If we fix the algorithm and dataset size, but include or remove some behaviors from the dataset, how does that affect downstream performance? In general, what makes good robot data?

Scaling Up and Distilling Down

Language-Guided Robot Skill Acquisition

Huy Ha¹, Pete Florence², Shuran Song¹,

¹Columbia University, ²Google Research

www.cs.columbia.edu/~huy/scalingup/

Scaling Up and Distilling Down

Language-Guided Robot Skill Acquisition

Skill Learning

Skill Learning

Behavior Cloning

Behavior Cloning

Reinforcement Learning

Reinforcement Learning

Language-guided Robot Data Generation & Language-conditioned Robot Policy Learning

Scale up Language-labelled Robot Data Generation

Scale up Language-labelled Robot Data Generation

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Verify-and-Retry

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Distill Down to a Language-conditioned Visuo-motor Policy

Evaluation

Evaluation

Evaluation

Evaluation

High Entropy, Precise Actions

Distilled Common Sense

Data Generation Results

Data Generation Results

Data Generation Results

Data Generation Results

Distilled Retrying Behavior

Distilled Policy Results

High Distilled Performance from Diverse Attempts

Real-world Deployment With No Fine-tuning

Real-world Deployment With No Fine-tuning

A Framework for Scalable Skill Learning

A Framework for Scalable Skill Learning

A Framework for Scalable Skill Learning

A Framework for Scalable Skill Learning

The Language-guided Automation Recipe

The Language-guided Automation Recipe

The Language-guided Automation Recipe

A Robot Skill Learning Workflow

A Robot Skill Learning Workflow

A Robot Skill Learning Workflow

Language-guided Robot Learning

Language-guided Robot Learning

Language-guided Robot Learning

Future Work

Scaling Up and Distilling Down

Language-Guided Robot Skill Acquisition

Domain Randomization

Full Sim Results

Policy Ablations

Language Model Ablations

Error Analysis

Language-guided Robot Data Generation
&
Language-conditioned Robot Policy Learning