File size: 3,319 Bytes
a440bb5
 
 
 
02bd988
03d8897
 
 
 
02bd988
5ad08d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a440bb5
5ad08d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02bd988
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: mit
pipeline_tag: robotics
---
# Octo Small

See https://github.com/octo-models/octo for instructions for using this model.

Octo Small is trained with a window size of 2, predicting 7-dimensional actions 4 steps into the future using a diffusion policy. The model is a Transformer with 27M parameters (equivalent to a ViT-S). Images are tokenized by preprocessing with a lightweight convolutional encoder, then grouped into 16x16 patches. Language is tokenized by applying the T5 tokenizer, and then applying the T5-Base language encoder. 

Observations and tasks conform to the following spec:

Observations: 

```
{
    image_primary: ('batch', 'history_window', 256, 256, 3),
    image_wrist: ('batch', 'history_window', 128, 128, 3),
}
```

Tasks: 
```
{
    image_primary: ('batch', 256, 256, 3),
    image_wrist: ('batch', 128, 128, 3),
    language_instruction: {
        attention_mask: ('batch', 16),
        input_ids: ('batch', 16),
    },
}
```

At inference, you may pass in any subset of these observation and task keys, with a history window up to 2 timesteps.


This model was trained on a mix of datasets from the Open X-Embodiment dataset.

| Dataset                                                    | Proportion of batch |
|------------------------------------------------------------|---------------------|
| Fractal (Brohan et al, 2022)                               | 17.0\%              |
| Kuka (Kalashnikov et al, 2018)                             | 17.0\%              |
| Bridge (Walke et al, 2023)                         | 17.0\%              |
| BC-Z (Jang et al, 2022)                                    | 9.1\%               |
| Stanford Hydra Dataset (Belkhale et al, 2023)          | 6.0\%               |
| Language Table~ (Lynch et al, 2023)                | 5.9\%               |
| Taco Play (Rosete-Beas et al, 2022, Mees et al., 2023)   | 3.6\%               |
| Furniture Bench Dataset (Heo et al, 2023)      | 3.3\%               |
| UTAustin Mutex (Shah et al, 2023)                       | 3.0\%               |
| Austin Sailor Dataset (Nasiriany et al, 2022)          | 2.9\%               |
| Roboturk (Mandlekar et al, 2018)         | 2.8\%               |
| Toto (Zhou et al, 2023)                                 | 2.4\%               |
| Austin Sirius Dataset (Liu et al, 2023)                 | 2.3\%               |
| Berkeley Autolab UR5 (Chen et al)            | 1.5\%               |
| IAMLab CMU Pickup Insert (Saxena et al, 2023) | 1.2\%               |
| Viola (Zhu et al, 2023)                                 | 1.2\%               |
| Berkeley Fanuc Manipulation (Zhu et al, 2023) | 1.0\%               |
| NYU Franka Play Dataset (Cui et al, 2022)                | 0.9\%               |
| UCSD Kitchen Dataset (Ge Yan and Wang, 2023)                 | <0.1\%              |
| Jaco Play (Dass et al, 2023)                         | 0.6\%               |
| Berkeley Cable Routing (Luo et al, 2023)           | 0.3\%               |
| Austin Buds Dataset (Zhu et al, 2022)                  | 0.3\%               |
| CMU Stretch (Mendonca et al, 2023)                 | 0.2\%               |
| NYU Door Opening (Pari et al, 2021)                | 0.1\%               |
| DLR EDAN Shared Control (Quere et al, 2020)          | 0.1\%               |