natong19
/

Qwen2-VL-7B-Instruct-abliterated

Image-Text-to-Text

text2text-generation

Inference Endpoints

Model card Files Files and versions Community

Qwen2-VL-7B-Instruct-abliterated / README.md

natong19's picture

Upload files

3e471fe about 1 month ago

|

history blame contribute delete

3.43 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	library_name: transformers
	---
	# Qwen2-VL-7B-Instruct-abliterated

	## Introduction

	Abliterated version of [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), an advanced multimodal large language model. Weight orthogonalization has been applied to inhibit the model's ability to express refusals while preserving the model's text and multimodal capabilities. Nonetheless, the model may still refuse your request, misunderstand your intent, or provide unsolicited advice regarding ethics or safety.

	## Requirements
	If you encounter errors such as `KeyError: 'qwen2_vl'` or `ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'`, you can try building transformers from source with command `pip install git+https://github.com/huggingface/transformers`


	## Quickstart

	```python
	from PIL import Image
	import requests
	import torch
	from torchvision import io
	from typing import Dict
	from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

	path = "natong19/Qwen2-VL-7B-Instruct-abliterated"

	# Load the model in half-precision on the available device(s)
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	path, torch_dtype="auto", device_map="auto"
	)

	min_pixels = 2562828
	max_pixels = 12802828
	processor = AutoProcessor.from_pretrained(path, min_pixels=min_pixels, max_pixels=max_pixels)

	# Image
	url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
	image = Image.open(requests.get(url, stream=True).raw)

	conversation = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	},
	{"type": "text", "text": "Describe this image."},
	],
	}
	]

	# Preprocess the inputs
	text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
	# Excepted output: '<\|im_start\|>system\nYou are a helpful assistant.<\|im_end\|>\n<\|im_start\|>user\n<\|vision_start\|><\|image_pad\|><\|vision_end\|>Describe this image.<\|im_end\|>\n<\|im_start\|>assistant\n'

	inputs = processor(
	text=[text_prompt], images=[image], padding=True, return_tensors="pt"
	)
	inputs = inputs.to("cuda")

	# Inference: Generation of the output
	output_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids = [
	output_ids[len(input_ids) :]
	for input_ids, output_ids in zip(inputs.input_ids, output_ids)
	]
	output_text = processor.batch_decode(
	generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
	)
	print(output_text)
	```
	The above code can be run on 24GB VRAM. For more usage examples, such as multi-image inference, video inference and batch inference, please refer to the [Qwen2-VL-7B-Instruct repo](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

	## Evaluation

	Evaluation framework: lm-evaluation-harness 0.4.2 and lmms-eval 0.2.1

	\| Datasets \| Qwen2-VL-7B-Instruct \| Qwen2-VL-7B-Instruct-abliterated \|
	\| :--- \| :---: \| :---: \|
	\| _Text benchmarks_ \|
	\| ARC (25-shot) \| 57.8 \| 57.8 \|
	\| MMLU (5-shot) \| 69.7 \| 68.4 \|
	\| TruthfulQA (0-shot) \| 49.5 \| 45.4 \|
	\| Winogrande (5-shot) \| 72.6 \| 72.8 \|
	\| _Multimodal benchmarks_ \|
	\| AI2D (lite) \| 78.8 \| 79.8 \|
	\| GQA (lite) \| 73.2 \| 73.6 \|
	\| MMBench (EN dev, lite) \|84.1 \| 82.6 \|
	\| MMMU (val) \| 50.8 \| 51.6 \|
	\| OCRBench \| 77.7 \| 78.1 \|
	\| VQAv2 (val, lite) \| 79.9 \| 79.8 \|