Edit model card

量化部署

为了降低用户在本地使用XuanYuan的成本,降低显存需求,我们提供量化好的XuanYuan-13B-Chat模型8bit和4bit模型。

8bit模型:

在8bit量化算法上,我们使用目前社区广泛使用的bitsandbytes库。

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

model_name_or_path = "/your/model/path"
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False, legacy=True)
model = LlamaForCausalLM.from_pretrained(model_name_or_path,torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer("问题:李时珍是哪一个朝代的人?回答:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(outputs)

4bit模型:

在4bit量化算法上,我们使用auto-gptq工具。

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "/your/model/path"
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False, legacy=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer("问题:李时珍是哪一个朝代的人?回答:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(outputs)

在vllm下使用4bit模型:

普通huggingface的推理脚本运行gptq量化的4bit模型,推理的速度很慢,并不实用。而最新版本的vllm已经支持包含gptq在内的多种量化模型的加载,vllm依靠量化的加速算子以及pagedAttention,continue batching以及一些调度机制,可以实现至少10倍的推理吞吐的提升。 您可以安装最新版本的vllm并使用以下脚本使用我们的4bit量化模型:

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(temperature=0.7, top_p=0.95,max_tokens=256)
llm = LLM(model="/your/model/path", quantization="gptq", dtype="float16")

prompts = "问题:李时珍是哪一个时代的人?回答:"
result = llm.generate(prompts, sampling_params)
result_output = [[output.outputs[0].text,output.outputs[0].token_ids] for output in result]

print('generated_result', result_output[0])
Downloads last month
13
Safetensors
Model size
13.1B params
Tensor type
F32
·
FP16
·
I8
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.