File size: 3,295 Bytes
0c4ccdd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
# DnD-Transformer: ✨ A Spark of Vision-Language Intelligence
<p align="center">
🤗 <a href="https://huggingface.co/leonardPKU/DnD-Transformer">Model</a>   | 🤗 <a href=""> Dataset (Coming Soon)</a>  |   📑 <a href="https://arxiv.org/abs/2410.01912">Paper</a> |   💻<a href="https://github.com/chenllliang/DnD-Transformer"> Github</a>
</p>
</div>
## Updates 🎈
- 2024-10-8: Release models and inference code
- 2024-10-4: Release paper
<br>
What's New?
1. A better AR image genenation paradigm and transformer model structure based on 2D autoregression. It generates images of higher quality without increasing computation budget.
2. A spark of vision-language intelligence for the first time, enabling unconditional rich-text image generation, outperforming diffusion models like DDPM and Stable Diffusion on dedicated rich-text image datasets, highlighting the distinct advantage of autoregressive models for multimodal modeling.
<p>
## Models
### DnD-Tokenizers (VQ)
*Text-Image*
| Code Size | Link |
|:---:|:---:|
| 24x24x1 | [🤗](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/2d_tokenizer_text_image) |
*ImageNet*
| Code Size | Link | rFID |
|:---:|:---:|:---:|
| 16x16x2 | [🤗](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/2d_tokenzier_imagenet) | 0.92 |
*arXiv-Image*
coming soon~
### DnD-Transformers (GPT)
*Text-Image*
| Code Shape | Model Size | Link |
|:---:|:---:|:---:|
| 24x24x1 | XXL | [🤗](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/trained_dnd_transformer_text_image_1layer/XXL) |
*ImageNet*
| Code Shape | Model Size | Link | gFID |
|:---:|:---:|:---:|:---:|
| 16x16x2 | XXL | [🤗](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/trained_dnd_transformer_imagenet_2layer/XXL) | 2.58 (cfg=2) |
| 16x16x2 | XXXL | [🤗](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/trained_dnd_transformer_imagenet_2layer/XXXL) | 2.21 (cfg=1.7) |
*arXiv-Image*
coming soon~
## Setup
```bash
conda create -n DnD python=3.10
conda activate DnD
pip install -r requirements.txt
```
## Inference
*Sampling Text-Image Examples*
```bash
cd ./src
bash ./scripts/sampling_dnd_transformer_text_image.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint
```
*Sampling ImageNet Examples*
```bash
cd ./src
bash ./scripts/sampling_dnd_transformer_imagenet.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint
# An npz would be saved after genearting 50k images, you can follow https://github.com/openai/guided-diffusion/tree/main/evaluations to compute the generated FID.
```
## Training
Training code and Dataset are coming soon!
## Reference
```bib
@misc{chen2024sparkvisionlanguageintelligence2dimensional,
title={A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation},
author={Liang Chen and Sinan Tan and Zefan Cai and Weichu Xie and Haozhe Zhao and Yichi Zhang and Junyang Lin and Jinze Bai and Tianyu Liu and Baobao Chang},
year={2024},
eprint={2410.01912},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.01912},
}
```
|