|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- multimodal |
|
library_name: transformers |
|
--- |
|
# Qwen2-VL-7B-Instruct-abliterated |
|
|
|
## Introduction |
|
|
|
Abliterated version of [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), an advanced multimodal large language model. Weight orthogonalization has been applied to inhibit the model's ability to express refusals while preserving the model's text and multimodal capabilities. Nonetheless, the model may still refuse your request, misunderstand your intent, or provide unsolicited advice regarding ethics or safety. |
|
|
|
## Requirements |
|
If you encounter errors such as `KeyError: 'qwen2_vl'` or `ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'`, you can try building transformers from source with command `pip install git+https://github.com/huggingface/transformers` |
|
|
|
|
|
## Quickstart |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
import torch |
|
from torchvision import io |
|
from typing import Dict |
|
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
|
|
|
path = "natong19/Qwen2-VL-7B-Instruct-abliterated" |
|
|
|
# Load the model in half-precision on the available device(s) |
|
model = Qwen2VLForConditionalGeneration.from_pretrained( |
|
path, torch_dtype="auto", device_map="auto" |
|
) |
|
|
|
min_pixels = 256*28*28 |
|
max_pixels = 1280*28*28 |
|
processor = AutoProcessor.from_pretrained(path, min_pixels=min_pixels, max_pixels=max_pixels) |
|
|
|
# Image |
|
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
conversation = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image", |
|
}, |
|
{"type": "text", "text": "Describe this image."}, |
|
], |
|
} |
|
] |
|
|
|
# Preprocess the inputs |
|
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) |
|
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n' |
|
|
|
inputs = processor( |
|
text=[text_prompt], images=[image], padding=True, return_tensors="pt" |
|
) |
|
inputs = inputs.to("cuda") |
|
|
|
# Inference: Generation of the output |
|
output_ids = model.generate(**inputs, max_new_tokens=128) |
|
generated_ids = [ |
|
output_ids[len(input_ids) :] |
|
for input_ids, output_ids in zip(inputs.input_ids, output_ids) |
|
] |
|
output_text = processor.batch_decode( |
|
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True |
|
) |
|
print(output_text) |
|
``` |
|
The above code can be run on 24GB VRAM. For more usage examples, such as multi-image inference, video inference and batch inference, please refer to the [Qwen2-VL-7B-Instruct repo](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct). |
|
|
|
## Evaluation |
|
|
|
Evaluation framework: lm-evaluation-harness 0.4.2 and lmms-eval 0.2.1 |
|
|
|
| Datasets | Qwen2-VL-7B-Instruct | Qwen2-VL-7B-Instruct-abliterated | |
|
| :--- | :---: | :---: | |
|
| _**Text benchmarks**_ | |
|
| ARC (25-shot) | 57.8 | 57.8 | |
|
| MMLU (5-shot) | 69.7 | 68.4 | |
|
| TruthfulQA (0-shot) | 49.5 | 45.4 | |
|
| Winogrande (5-shot) | 72.6 | 72.8 | |
|
| _**Multimodal benchmarks**_ | |
|
| AI2D (lite) | 78.8 | 79.8 | |
|
| GQA (lite) | 73.2 | 73.6 | |
|
| MMBench (EN dev, lite) |84.1 | 82.6 | |
|
| MMMU (val) | 50.8 | 51.6 | |
|
| OCRBench | 77.7 | 78.1 | |
|
| VQAv2 (val, lite) | 79.9 | 79.8 | |