|
# DnD-Transformer: β¨ A Spark of Vision-Language Intelligence |
|
|
|
|
|
|
|
<p align="center"> |
|
π€ <a href="https://huggingface.co/leonardPKU/DnD-Transformer">Model</a>   | π€ <a href=""> Dataset (Coming Soon)</a>  |   π <a href="https://arxiv.org/abs/2410.01912">Paper</a> |   π»<a href="https://github.com/chenllliang/DnD-Transformer"> Github</a> |
|
</p> |
|
|
|
|
|
</div> |
|
|
|
## Updates π |
|
|
|
- 2024-10-8: Release models and inference code |
|
- 2024-10-4: Release paper |
|
|
|
<br> |
|
|
|
|
|
What's New? |
|
|
|
1. A better AR image genenation paradigm and transformer model structure based on 2D autoregression. It generates images of higher quality without increasing computation budget. |
|
|
|
2. A spark of vision-language intelligence for the first time, enabling unconditional rich-text image generation, outperforming diffusion models like DDPM and Stable Diffusion on dedicated rich-text image datasets, highlighting the distinct advantage of autoregressive models for multimodal modeling. |
|
<p> |
|
|
|
|
|
|
|
|
|
## Models |
|
|
|
### DnD-Tokenizers (VQ) |
|
|
|
|
|
|
|
*Text-Image* |
|
|
|
| Code Size | Link | |
|
|:---:|:---:| |
|
| 24x24x1 | [π€](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/2d_tokenizer_text_image) | |
|
|
|
*ImageNet* |
|
|
|
| Code Size | Link | rFID | |
|
|:---:|:---:|:---:| |
|
| 16x16x2 | [π€](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/2d_tokenzier_imagenet) | 0.92 | |
|
|
|
*arXiv-Image* |
|
|
|
coming soon~ |
|
|
|
### DnD-Transformers (GPT) |
|
|
|
|
|
|
|
*Text-Image* |
|
|
|
| Code Shape | Model Size | Link | |
|
|:---:|:---:|:---:| |
|
| 24x24x1 | XXL | [π€](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/trained_dnd_transformer_text_image_1layer/XXL) | |
|
|
|
|
|
*ImageNet* |
|
|
|
| Code Shape | Model Size | Link | gFID | |
|
|:---:|:---:|:---:|:---:| |
|
| 16x16x2 | XXL | [π€](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/trained_dnd_transformer_imagenet_2layer/XXL) | 2.58 (cfg=2) | |
|
| 16x16x2 | XXXL | [π€](https://huggingface.co/leonardPKU/DnD-Transformer/tree/main/trained_dnd_transformer_imagenet_2layer/XXXL) | 2.21 (cfg=1.7) | |
|
|
|
|
|
*arXiv-Image* |
|
|
|
coming soon~ |
|
|
|
|
|
## Setup |
|
|
|
```bash |
|
conda create -n DnD python=3.10 |
|
conda activate DnD |
|
pip install -r requirements.txt |
|
``` |
|
|
|
|
|
## Inference |
|
|
|
*Sampling Text-Image Examples* |
|
```bash |
|
cd ./src |
|
bash ./scripts/sampling_dnd_transformer_text_image.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint |
|
``` |
|
|
|
*Sampling ImageNet Examples* |
|
```bash |
|
cd ./src |
|
bash ./scripts/sampling_dnd_transformer_imagenet.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint |
|
|
|
# An npz would be saved after genearting 50k images, you can follow https://github.com/openai/guided-diffusion/tree/main/evaluations to compute the generated FID. |
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
## Training |
|
|
|
Training code and Dataset are coming soon! |
|
|
|
|
|
|
|
## Reference |
|
|
|
```bib |
|
@misc{chen2024sparkvisionlanguageintelligence2dimensional, |
|
title={A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation}, |
|
author={Liang Chen and Sinan Tan and Zefan Cai and Weichu Xie and Haozhe Zhao and Yichi Zhang and Junyang Lin and Jinze Bai and Tianyu Liu and Baobao Chang}, |
|
year={2024}, |
|
eprint={2410.01912}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2410.01912}, |
|
} |
|
``` |
|
|
|
|
|
|