8 Papers Accepted to CoRL 2024

Researchers from the department presented their work at the Conference on Robot Learning (CoRL) in Munich, Germany. Since its inception in 2017, CoRL has promoted pioneering research and innovative applications at the intersection of robotics and machine learning, showcasing groundbreaking advancements in these dynamic fields.

 

D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Rearrangement
Yixuan Wang Columbia University, Mingtong Zhang University of Illinois, Urbana-Champaign, Zhuoran Li National University of Singapore, Tarik Kelestemur Boston Dynamics AI Institute, Katherine Rose Driggs-Campbell University of Illinois, Urbana-Champaign, Jiajun Wu Stanford University, Li Fei-Fei Stanford University, Yunzhu Li Columbia University

Abstract:
Scene representation is a crucial design choice in robotic manipulation systems. An ideal representation is expected to be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce 3D Fields—dynamic 3D descriptor fields. These fields are implicit 3D representations that take in 3D points and output semantic features and instance masks. They can also capture the dynamics of the underlying 3D environments. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from visual foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to rearrangement tasks in a zero-shot manner. Through extensive evaluation in real worlds and simulations, we demonstrate that 3D Fields are effective for zero-shot generalizable rearrangement tasks. We also compare 3D Fields with state-of-the-art implicit 3D representations and show significant improvements in effectiveness and efficiency. Project page: https://robopil.github.io/d3fields/

 

3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing
Binghao Huang Columbia University, Yixuan Wang Columbia University, Xinyi Yang University of Illinois, Urbana-Champaign, Yiyue Luo University of Washington, Yunzhu Li Columbia University

Abstract:
Tactile and visual perception are both crucial for humans to perform fine-grained interactions with their environment. Developing similar multi-modal sensing capabilities for robots can significantly enhance and expand their manipulation skills. This paper introduces 3D-ViTac, a multi-modal sensing and learning system designed for dexterous bimanual manipulation. Our system features tactile sensors equipped with dense sensing units, each covering an area of 3mm2. These sensors are low-cost and flexible, providing detailed and extensive coverage of physical contacts, effectively complementing visual information. To integrate tactile and visual data, we fuse them into a unified 3D representation space that preserves their 3D structures and spatial relationships. The multi-modal representation can then be coupled with diffusion policies for imitation learning. Through concrete hardware experiments, we demonstrate that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies, particularly in safe interactions with fragile items and executing long-horizon tasks involving in-hand manipulation. Our project page is available at https://binghao-huang.github.io/3D-ViTac/.

 

RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation
Hanxiao Jiang Columbia University, Binghao Huang Columbia University, Ruihai Wu Peking University, Zhuoran Li National University of Singapore, Shubham Garg Amazon, Hooshang Nayyeri Amazon, Shenlong Wang University of Illinois, Urbana-Champaign, Yunzhu Li Columbia University

Abstract:
We introduce the novel task of interactive scene exploration, wherein robots autonomously explore environments and produce an action-conditioned scene graph (ACSG) that captures the structure of the underlying environment. The ACSG accounts for both low-level information (geometry and semantics) and high-level information (action-conditioned relationships between different entities) in the scene. To this end, we present the Robotic Exploration (RoboEXP) system, which incorporates the Large Multimodal Model (LMM) and an explicit memory design to enhance our system’s capabilities. The robot reasons about what and how to explore an object, accumulating new information through the interaction process and incrementally constructing the ACSG. Leveraging the constructed ACSG, we illustrate the effectiveness and efficiency of our RoboEXP system in facilitating a wide range of real-world manipulation tasks involving rigid, articulated objects, nested objects, and deformable objects. Project Page: https://jianghanxiao.github.io/roboexp-web/

 

Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling
Mingtong Zhang University of Illinois, Urbana-Champaign, Kaifeng Zhang Columbia University, Yunzhu Li Columbia University

Abstract:
Videos of robots interacting with objects encode rich information about the objects’ dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects’ 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot’s action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework’s ability to model complex shapes and dynamics. Our project page is available at \url{https://gaussian-gbnd.github.io/}.

 

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
Wenlong Huang Stanford University, Chen Wang Stanford University, Yunzhu Li Columbia University, Ruohan Zhang Stanford University, Li Fei-Fei Stanford University

Abstract:
Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep are expressed as Python functions mapping a set of 3D keypoints in the environment to a numerical cost. We demonstrate that by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observation. We present system implementations on a mobile single-arm platform and a stationary dual-arm platform that can perform a large variety of manipulation tasks, featuring multi-stage, in-the-wild, bimanual, and reactive behaviors, all without task-specific data or environment models.

 

GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy
Yixuan Wang Columbia University, Guang Yin University of Illinois, Urbana-Champaign, Binghao Huang Columbia University, Tarik Kelestemur Boston Dynamics AI Institute, Jiuguang Wang Boston Dynamics AI Institute, Yunzhu Li Columbia University

Abstract:
Diffusion-based policies have shown remarkable capability in executing complex robotic manipulation tasks but lack explicit characterization of geometry and semantics, which often limits their ability to generalize to unseen objects and layouts. To enhance the generalization capabilities of Diffusion Policy, we introduce a novel framework that incorporates explicit spatial and semantic information via 3D semantic fields. We generate 3D descriptor fields from multi-view RGBD observations with large foundational vision models, then compare these descriptor fields against reference descriptors to obtain semantic fields. The proposed method explicitly considers geometry and semantics, enabling strong generalization capabilities in tasks requiring category-level generalization, resolving geometric ambiguities, and attention to subtle geometric details. We evaluate our method across eight tasks involving articulated objects and instances with varying shapes and textures from multiple object categories. Our method demonstrates its effectiveness by increasing Diffusion Policy’s average success rate on \textit{unseen} instances from 20% to 93%. Additionally, we provide a detailed analysis and visualization to interpret the sources of performance gain and explain how our method can generalize to novel instances. Project page: https://robopil.github.io/GenDP/

 

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation
Junbang Liang Columbia University, Ruoshi Liu Columbia University, Ege Ozguroglu Columbia University, Sruthi Sudhakar Columbia University, Achal Dave Toyota Research Institute, Pavel Tokmakov Toyota Research Institute, Shuran Song Stanford University, Carl Vondrick Columbia University

Abstract:
A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on 4 tasks of increasing complexity and demonstrate that capitalizing on internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

 

Differentiable Robot Rendering
Ruoshi Liu Columbia University, Alper Canberk Columbia University, Shuran Song Stanford University,  Carl Vondrick Columbia University

Abstract:
Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.