arxiv:2406.20095

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Published on Jun 28

· Submitted by

variante on Jul 1

Upvote

Authors:

Xiang Li ,

Cristina Mata ,

Jongwoo Park ,

Kumara Kahatapitiya ,

Yoo Sung Jang ,

Jinghuan Shang ,

Kanchana Ranasinghe ,

Ryan Burgert ,

Mu Cai ,

Yong Jae Lee ,

Abstract

Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

View arXiv page View PDF Add to collection

Community

variante

Paper author Paper submitter Jul 1

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Introduction

In this paper, we propose LLaRA, a framework to turn robot expert trajectories into conversation-style data and other auxiliary data for instruction tuning. We then finetune a pretrained vision-language model on that and successfully turn it into a strong robot manipulation policy. So how did we do that?

Visuomotor Instruction Tuning

First, we transform a typical behavior cloning dataset into an instruction-tuning dataset and finetune a VLM (LLaVA) on this dataset. The resulting LLaRA framework benefits from the broad, inherent knowledge embedded within the VLM, enabling better visuomotor task learning.

Supercharging Visuomotor Instruction Dataset

Then we create auxiliary robotics instruction tuning datasets from the same source to enhance the VLM policy. The idea is that the auxiliary datasets will drive VLMs to learn a better spatio-temporal understanding of the scene and eventually benefit robot learning.
Note that the auxiliary datasets were constructed from the same robot expert trajectories in a self-supervised fashion. The process depends only on object detection and runs in a self-supervised fashion without taking advantage of any external data.

Experiments (Real-world and simulated)

We extensively studied the best practices for auxiliary datasets and found that our auxiliary datasets can significantly enhance VLM policy performance, especially with limited original data. On VIMA Bench, our method consistently outperforms the RT-2 style baseline.
We ran multiple types of real-world robot experiments and found that our method, trained on just 8k simulated data, performs strongly in unseen real-world settings. In addition, with minimal in-domain fine-tuning, the model achieves a 91.6% average success rate.

Conclusion

In conclusion, LLaRA turns an instruction-tuned vision language model (VLM) into a robot policy using curated instruction-tuning datasets and it shows great potential.

For more details:
Paper: http://arxiv.org/abs/2406.20095
Code: https://github.com/LostXine/LLaRA