--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers --- # Qwen2-VL-7B-Instruct-abliterated ## Introduction Abliterated version of [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), an advanced multimodal large language model. Weight orthogonalization has been applied to inhibit the model's ability to express refusals while preserving the model's text and multimodal capabilities. Nonetheless, the model may still refuse your request, misunderstand your intent, or provide unsolicited advice regarding ethics or safety. ## Requirements If you encounter errors such as `KeyError: 'qwen2_vl'` or `ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'`, you can try building transformers from source with command `pip install git+https://github.com/huggingface/transformers` ## Quickstart ```python from PIL import Image import requests import torch from torchvision import io from typing import Dict from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor path = "natong19/Qwen2-VL-7B-Instruct-abliterated" # Load the model in half-precision on the available device(s) model = Qwen2VLForConditionalGeneration.from_pretrained( path, torch_dtype="auto", device_map="auto" ) min_pixels = 256*28*28 max_pixels = 1280*28*28 processor = AutoProcessor.from_pretrained(path, min_pixels=min_pixels, max_pixels=max_pixels) # Image url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" image = Image.open(requests.get(url, stream=True).raw) conversation = [ { "role": "user", "content": [ { "type": "image", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preprocess the inputs text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n' inputs = processor( text=[text_prompt], images=[image], padding=True, return_tensors="pt" ) inputs = inputs.to("cuda") # Inference: Generation of the output output_ids = model.generate(**inputs, max_new_tokens=128) generated_ids = [ output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids) ] output_text = processor.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True ) print(output_text) ``` The above code can be run on 24GB VRAM. For more usage examples, such as multi-image inference, video inference and batch inference, please refer to the [Qwen2-VL-7B-Instruct repo](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct). ## Evaluation Evaluation framework: lm-evaluation-harness 0.4.2 and lmms-eval 0.2.1 | Datasets | Qwen2-VL-7B-Instruct | Qwen2-VL-7B-Instruct-abliterated | | :--- | :---: | :---: | | _**Text benchmarks**_ | | ARC (25-shot) | 57.8 | 57.8 | | MMLU (5-shot) | 69.7 | 68.4 | | TruthfulQA (0-shot) | 49.5 | 45.4 | | Winogrande (5-shot) | 72.6 | 72.8 | | _**Multimodal benchmarks**_ | | AI2D (lite) | 78.8 | 79.8 | | GQA (lite) | 73.2 | 73.6 | | MMBench (EN dev, lite) |84.1 | 82.6 | | MMMU (val) | 50.8 | 51.6 | | OCRBench | 77.7 | 78.1 | | VQAv2 (val, lite) | 79.9 | 79.8 |