General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

[GitHub](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/) | [Paper](https://arxiv.org/abs/2409.01704) [Haoran Wei*](https://scholar.google.com/citations?user=J4naK0MAAAAJ&hl=en), Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, [Zheng Ge](https://joker316701882.github.io/), Liang Zhao, [Jianjian Sun](https://scholar.google.com/citations?user=MVZrGkYAAAAJ&hl=en), [Yuang Peng](https://scholar.google.com.hk/citations?user=J0ko04IAAAAJ&hl=zh-CN&oi=ao), Chunrui Han, [Xiangyu Zhang](https://scholar.google.com/citations?user=yuB-cfoAAAAJ&hl=en) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6653eee7a2d7a882a805ab95/dAOiYlj3Hdl4_G1KpahcB.png) ## Usage Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10: ``` torch==2.0.1 torchvision==0.15.2 transformers==4.37.2 megfile==3.1.2 ``` ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True) model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id) model = model.eval().cuda() # input your test image image_file = 'xxx.jpg' # plain texts OCR model.chat(tokenizer, image_file, ocr_type='ocr') # format texts OCR: # model.chat(tokenizer, image_file, ocr_type='format') # fine-grained OCR: # model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='') # model.chat(tokenizer, image_file, ocr_type='format', ocr_box='') # model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='') # model.chat(tokenizer, image_file, ocr_type='format', ocr_color='') # multi-crop OCR: # res = model.chat_crop(tokenizer, image_file = image_file) # render the formatted OCR results: # model.chat(tokenizer, image_file, ocr_type='format', ocr_box='', ocr_color='', render=True, save_render_file = './demo.html') print(res) ``` More details about 'ocr_type', 'ocr_box', 'ocr_color', and 'render' can be found at our GitHub. ## More Multimodal Projects 👏 Welcome to explore more multimodal projects of our team: [Vary](https://github.com/Ucas-HaoranWei/Vary) | [Fox](https://github.com/ucaslcl/Fox) | [OneChart](https://github.com/LingyvKong/OneChart) ## Citation If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️! ```bib @article{wei2024general, title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model}, author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others}, journal={arXiv preprint arXiv:2409.01704}, year={2024} } @article{wei2023vary, title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models}, author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu}, journal={arXiv preprint arXiv:2312.06109}, year={2023} } ```