Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- es
|
5 |
+
tags:
|
6 |
+
- quantization
|
7 |
+
- gptq
|
8 |
+
pipeline_tag: text-generation
|
9 |
+
library_name: transformers
|
10 |
+
inference: false
|
11 |
+
---
|
12 |
+
|
13 |
+
# Llama-2-7b-ft-instruct-es-gptq-4bit
|
14 |
+
|
15 |
+
[Llama 2 (7B)](https://huggingface.co/meta-llama/Llama-2-7b) fine-tuned on [Clibrain](https://huggingface.co/clibrain)'s Spanish instructions dataset and **optimized** using **GPTQ**.
|
16 |
+
|
17 |
+
|
18 |
+
## Model Details
|
19 |
+
|
20 |
+
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model.
|
21 |
+
|
22 |
+
|
23 |
+
## About GPTQ (from HF Blog)
|
24 |
+
|
25 |
+
Quantization methods usually belong to one of two categories:
|
26 |
+
|
27 |
+
1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
|
28 |
+
2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
|
29 |
+
|
30 |
+
GPTQ falls into the PTQ category, and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.
|
31 |
+
|
32 |
+
Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.
|
33 |
+
|
34 |
+
The benefits of this scheme are twofold:
|
35 |
+
|
36 |
+
- Memory savings close to x4 for int4 quantization, as the dequantization happens close to the compute unit in a fused kernel, and not in the GPU global memory.
|
37 |
+
- Potential speedups thanks to the time saved on data communication due to the lower bitwidth used for weights.
|
38 |
+
|
39 |
+
The GPTQ paper tackles the layer-wise compression problem:
|
40 |
+
|
41 |
+
Given a layer \\(l\\) with weight matrix \\(W_{l}\\) and layer input \\(X_{l}\\), we want to find a quantized version of the weight \\(\hat{W}_{l}\\) to minimize the mean squared error (MSE):
|
42 |
+
|
43 |
+
|
44 |
+
\\({\hat{W}_{l}}^{*} = argmin_{\hat{W_{l}}} \|W_{l}X-\hat{W}_{l}X\|^{2}_{2}\\)
|
45 |
+
|
46 |
+
Once this is solved per layer, a solution to the global problem can be obtained by combining the layer-wise solutions.
|
47 |
+
|
48 |
+
In order to solve this layer-wise compression problem, the author uses the Optimal Brain Quantization framework ([Frantar et al 2022](https://arxiv.org/abs/2208.11580)). The OBQ method starts from the observation that the above equation can be written as the sum of the squared errors, over each row of \\(W_{l}\\).
|
49 |
+
|
50 |
+
|
51 |
+
\\( \sum_{i=0}^{d_{row}} \|W_{l[i,:]}X-\hat{W}_{l[i,:]}X\|^{2}_{2} \\)
|
52 |
+
|
53 |
+
This means that we can quantize each row independently. This is called per-channel quantization. For each row \\(W_{l[i,:]}\\), OBQ quantizes one weight at a time while always updating all not-yet-quantized weights, in order to compensate for the error incurred by quantizing a single weight. The update on selected weights has a closed-form formula, utilizing Hessian matrices.
|
54 |
+
|
55 |
+
The GPTQ paper improves this framework by introducing a set of optimizations that reduces the complexity of the quantization algorithm while retaining the accuracy of the model.
|
56 |
+
|
57 |
+
Compared to OBQ, the quantization step itself is also faster with GPTQ: it takes 2 GPU-hours to quantize a BERT model (336M) with OBQ, whereas with GPTQ, a Bloom model (176B) can be quantized in less than 4 GPU-hours.
|
58 |
+
|
59 |
+
To learn more about the exact algorithm and the different benchmarks on perplexity and speedups, check out the original [paper](https://arxiv.org/pdf/2210.17323.pdf).
|
60 |
+
|
61 |
+
## Example of Usage
|
62 |
+
|
63 |
+
```sh
|
64 |
+
pip install transformers optimum
|
65 |
+
```
|
66 |
+
|
67 |
+
```py
|
68 |
+
import torch
|
69 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
|
70 |
+
|
71 |
+
model_id = "clibrain/Llama-2-7b-ft-instruct-es-gptq-4bit"
|
72 |
+
|
73 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
|
74 |
+
|
75 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
76 |
+
|
77 |
+
def create_instruction(instruction, input_data=None, context=None):
|
78 |
+
sections = {
|
79 |
+
"Instrucción": instruction,
|
80 |
+
"Entrada": input_data,
|
81 |
+
"Contexto": context,
|
82 |
+
}
|
83 |
+
|
84 |
+
system_prompt = "A continuación hay una instrucción que describe una tarea, junto con una entrada que proporciona más contexto. Escriba una respuesta que complete adecuadamente la solicitud.\n\n"
|
85 |
+
prompt = system_prompt
|
86 |
+
|
87 |
+
for title, content in sections.items():
|
88 |
+
if content is not None:
|
89 |
+
prompt += f"### {title}:\n{content}\n\n"
|
90 |
+
|
91 |
+
prompt += "### Respuesta:\n"
|
92 |
+
|
93 |
+
return prompt
|
94 |
+
|
95 |
+
|
96 |
+
def generate(
|
97 |
+
instruction,
|
98 |
+
input=None,
|
99 |
+
context=None,
|
100 |
+
max_new_tokens=128,
|
101 |
+
temperature=0.1,
|
102 |
+
top_p=0.75,
|
103 |
+
top_k=40,
|
104 |
+
num_beams=4,
|
105 |
+
**kwargs
|
106 |
+
):
|
107 |
+
|
108 |
+
prompt = create_instruction(instruction, input, context)
|
109 |
+
print(prompt.replace("### Respuesta:\n", ""))
|
110 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
111 |
+
input_ids = inputs["input_ids"].to("cuda")
|
112 |
+
attention_mask = inputs["attention_mask"].to("cuda")
|
113 |
+
generation_config = GenerationConfig(
|
114 |
+
temperature=temperature,
|
115 |
+
top_p=top_p,
|
116 |
+
top_k=top_k,
|
117 |
+
num_beams=num_beams,
|
118 |
+
**kwargs,
|
119 |
+
)
|
120 |
+
with torch.no_grad():
|
121 |
+
generation_output = model.generate(
|
122 |
+
input_ids=input_ids,
|
123 |
+
attention_mask=attention_mask,
|
124 |
+
generation_config=generation_config,
|
125 |
+
return_dict_in_generate=True,
|
126 |
+
output_scores=True,
|
127 |
+
max_new_tokens=max_new_tokens,
|
128 |
+
early_stopping=True
|
129 |
+
)
|
130 |
+
s = generation_output.sequences[0]
|
131 |
+
output = tokenizer.decode(s)
|
132 |
+
return output.split("### Respuesta:")[1].lstrip("\n")
|
133 |
+
|
134 |
+
instruction = "Dame una lista de lugares a visitar en España."
|
135 |
+
print(generate(instruction))
|
136 |
+
```
|