Emu: An Open Multimodal Generalist

Generative Pretraining in Multimodality

[Quan Sun](https://github.com/Quan-Sun)1*, [Qiying Yu](https://yqy2001.github.io)2,1*, [Yufeng Cui]()1*, [Fan Zhang]()1*, [Xiaosong Zhang](https://github.com/zhangxiaosong18)1*, [Yueze Wang]()1, [Hongcheng Gao]()1, [Jingjing Liu](https://air.tsinghua.edu.cn/en/info/1046/1194.htm)2, [Tiejun Huang](https://scholar.google.com/citations?user=knvEK4AAAAAJ&hl=en)1,3, [Xinlong Wang](https://www.xloong.wang/)1 1 [BAAI](https://www.baai.ac.cn/english.html), 2 [THU](https://air.tsinghua.edu.cn), 3 [PKU](https://english.pku.edu.cn/)
* Equal Contribution | [Paper](https://arxiv.org/abs/2307.05222) | [Demo(tmp)](http://218.91.113.230:9002/) |
**Emu** is a Large Multimodal Model (LMM) trained with a unified autoregressive objective, *i.e.*, predict-the-next-element, including both visual embeddings and textual tokens. Trained under this objective, **Emu** can serve as a generalist interface for diverse multimodal tasks, such as image captioning, image/video question answering, and text-to-image generation, together with new abilities like in-context text and image generation, and image blending. ## Setup Clone the github repository and install required packages: ```shell git clone https://github.com/baaivision/Emu cd Emu pip install -r requirements.txt ``` ## Model Weights We release the pretrained and instruction-tuned weights of **Emu**. Our weights are subject to LLaMA's [license](https://github.com/facebookresearch/llama/blob/main/LICENSE). | Model name | Weight | | ---------- | ------------------------------------------------------- | | **Emu** | [🤗 HF link](https://huggingface.co/BAAI/Emu/blob/main/Emu-pretrain.pt) (27GB) | | **Emu-I** | [🤗 HF link](https://huggingface.co/BAAI/Emu/blob/main/Emu-instruct.pt) (27GB) | ## Model Usage At present, we provide inference code for image captioning and visual question answering: ```sh python emu_inference.py --instruct --ckpt-path $Instruct_CKPT_PATH ``` ## Citation If you find Emu useful for your your research and applications, please consider citing: ``` @article{Emu, title={Generative Pretraining in Multimodality}, author={Sun, Quan and Yu, Qiying and Cui, Yufeng and Zhang, Fan and Zhang, Xiaosong and Wang, Yueze and Gao, Hongcheng and Liu, Jingjing and Huang, Tiejun and Wang, Xinlong}, publisher={arXiv:2307.05222}, year={2023}, }