File size: 3,429 Bytes
3e471fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: apache-2.0
language:
- en
pipeline_tag: image-text-to-text
tags:
- multimodal
library_name: transformers
---
# Qwen2-VL-7B-Instruct-abliterated

## Introduction

Abliterated version of [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), an advanced multimodal large language model. Weight orthogonalization has been applied to inhibit the model's ability to express refusals while preserving the model's text and multimodal capabilities. Nonetheless, the model may still refuse your request, misunderstand your intent, or provide unsolicited advice regarding ethics or safety.

## Requirements
If you encounter errors such as `KeyError: 'qwen2_vl'` or `ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'`, you can try building transformers from source with command `pip install git+https://github.com/huggingface/transformers`


## Quickstart

```python
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

path = "natong19/Qwen2-VL-7B-Instruct-abliterated"

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    path, torch_dtype="auto", device_map="auto"
)

min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained(path, min_pixels=min_pixels, max_pixels=max_pixels)

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
```
The above code can be run on 24GB VRAM. For more usage examples, such as multi-image inference, video inference and batch inference, please refer to the [Qwen2-VL-7B-Instruct repo](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

## Evaluation

Evaluation framework: lm-evaluation-harness 0.4.2 and lmms-eval 0.2.1

| Datasets | Qwen2-VL-7B-Instruct | Qwen2-VL-7B-Instruct-abliterated |
| :--- | :---: | :---: |
| _**Text benchmarks**_ |
| ARC (25-shot) | 57.8 | 57.8 |
| MMLU (5-shot) | 69.7 | 68.4 |
| TruthfulQA (0-shot) | 49.5 | 45.4 |
| Winogrande (5-shot) | 72.6 | 72.8 |
| _**Multimodal benchmarks**_ |
| AI2D (lite) | 78.8 | 79.8 |
| GQA (lite) | 73.2 | 73.6 |
| MMBench (EN dev, lite) |84.1 | 82.6 |
| MMMU (val) | 50.8 | 51.6 |
| OCRBench | 77.7 | 78.1 |
| VQAv2 (val, lite) | 79.9 | 79.8 |