TheBloke commited on
Commit
5b8b77a
1 Parent(s): 56f9eb4

Initial GPTQ model commit

Browse files
Files changed (1) hide show
  1. README.md +189 -0
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ license: other
4
+ ---
5
+
6
+ <!-- header start -->
7
+ <div style="width: 100%;">
8
+ <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
9
+ </div>
10
+ <div style="display: flex; justify-content: space-between; width: 100%;">
11
+ <div style="display: flex; flex-direction: column; align-items: flex-start;">
12
+ <p><a href="https://discord.gg/Jq4vkcDakD">Chat & support: my new Discord server</a></p>
13
+ </div>
14
+ <div style="display: flex; flex-direction: column; align-items: flex-end;">
15
+ <p><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
16
+ </div>
17
+ </div>
18
+ <!-- header end -->
19
+
20
+ # LmSys' Vicuna 7B v1.3 GPTQ
21
+
22
+ These files are GPTQ 4bit model files for [LmSys' Vicuna 7B v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3).
23
+
24
+ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
25
+
26
+ ## Repositories available
27
+
28
+ * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/vicuna-7B-v1.3-GPTQ)
29
+ * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/vicuna-7B-v1.3-GGML)
30
+ * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/lmsys/vicuna-7b-v1.3)
31
+
32
+ ## How to easily download and use this model in text-generation-webui
33
+
34
+ Please make sure you're using the latest version of text-generation-webui
35
+
36
+ 1. Click the **Model tab**.
37
+ 2. Under **Download custom model or LoRA**, enter `TheBloke/vicuna-7B-v1.3-GPTQ`.
38
+ 3. Click **Download**.
39
+ 4. The model will start downloading. Once it's finished it will say "Done"
40
+ 5. In the top left, click the refresh icon next to **Model**.
41
+ 6. In the **Model** dropdown, choose the model you just downloaded: `vicuna-7B-v1.3-GPTQ`
42
+ 7. The model will automatically load, and is now ready for use!
43
+ 8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
44
+ * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
45
+ 9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
46
+
47
+ ## How to use this GPTQ model from Python code
48
+
49
+ First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
50
+
51
+ `pip install auto-gptq`
52
+
53
+ Then try the following example code:
54
+
55
+ ```python
56
+ from transformers import AutoTokenizer, pipeline, logging
57
+ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
58
+ import argparse
59
+
60
+ model_name_or_path = "TheBloke/vicuna-7B-v1.3-GPTQ"
61
+ model_basename = "vicuna-7b-v1.3-GPTQ-4bit-128g.no-act.order"
62
+
63
+ use_triton = False
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
66
+
67
+ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
68
+ model_basename=model_basename,
69
+ use_safetensors=True,
70
+ trust_remote_code=False,
71
+ device="cuda:0",
72
+ use_triton=use_triton,
73
+ quantize_config=None)
74
+
75
+ # Note: check the prompt template is correct for this model.
76
+ prompt = "Tell me about AI"
77
+ prompt_template=f'''### Human: {prompt}
78
+ ### Assistant:'''
79
+
80
+ print("\n\n*** Generate:")
81
+
82
+ input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
83
+ output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
84
+ print(tokenizer.decode(output[0]))
85
+
86
+ # Inference can also be done using transformers' pipeline
87
+
88
+ # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
89
+ logging.set_verbosity(logging.CRITICAL)
90
+
91
+ print("*** Pipeline:")
92
+ pipe = pipeline(
93
+ "text-generation",
94
+ model=model,
95
+ tokenizer=tokenizer,
96
+ max_new_tokens=512,
97
+ temperature=0.7,
98
+ top_p=0.95,
99
+ repetition_penalty=1.15
100
+ )
101
+
102
+ print(pipe(prompt_template)[0]['generated_text'])
103
+ ```
104
+
105
+ ## Provided files
106
+
107
+ **vicuna-7b-v1.3-GPTQ-4bit-128g.no-act.order.safetensors**
108
+
109
+ This will work with AutoGPTQ and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead.
110
+
111
+ It was created with group_size 128 to increase inference accuracy, but without --act-order (desc_act) to increase compatibility and improve inference speed.
112
+
113
+ * `vicuna-7b-v1.3-GPTQ-4bit-128g.no-act.order.safetensors`
114
+ * Works with AutoGPTQ in CUDA or Triton modes.
115
+ * Works with GPTQ-for-LLaMa in CUDA mode. May have issues with GPTQ-for-LLaMa Triton mode.
116
+ * Works with text-generation-webui, including one-click-installers.
117
+ * Parameters: Groupsize = 128. Act Order / desc_act = False.
118
+
119
+ <!-- footer start -->
120
+ ## Discord
121
+
122
+ For further support, and discussions on these models and AI in general, join us at:
123
+
124
+ [TheBloke AI's Discord server](https://discord.gg/Jq4vkcDakD)
125
+
126
+ ## Thanks, and how to contribute.
127
+
128
+ Thanks to the [chirper.ai](https://chirper.ai) team!
129
+
130
+ I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
131
+
132
+ If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
133
+
134
+ Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
135
+
136
+ * Patreon: https://patreon.com/TheBlokeAI
137
+ * Ko-Fi: https://ko-fi.com/TheBlokeAI
138
+
139
+ **Special thanks to**: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
140
+
141
+ **Patreon special mentions**: vamX, K, Jonathan Leane, Lone Striker, Sean Connelly, Chris McCloskey, WelcomeToTheClub, Nikolai Manek, John Detwiler, Kalila, David Flickinger, Fen Risland, subjectnull, Johann-Peter Hartmann, Talal Aujan, John Villwock, senxiiz, Khalefa Al-Ahmad, Kevin Schuppel, Alps Aficionado, Derek Yates, Mano Prime, Nathan LeClaire, biorpg, trip7s trip, Asp the Wyvern, chris gileta, Iucharbius , Artur Olbinski, Ai Maven, Joseph William Delisle, Luke Pendergrass, Illia Dulskyi, Eugene Pentland, Ajan Kanaga, Willem Michiel, Space Cruiser, Pyrater, Preetika Verma, Junyu Yang, Oscar Rangel, Spiking Neurons AB, Pierre Kircher, webtim, Cory Kujawski, terasurfer , Trenton Dambrowitz, Gabriel Puliatti, Imad Khwaja, Luke.
142
+
143
+ Thank you to all my generous patrons and donaters!
144
+
145
+ <!-- footer end -->
146
+
147
+ # Original model card: LmSys' Vicuna 7B v1.3
148
+
149
+
150
+ # Vicuna Model Card
151
+
152
+ ## Model Details
153
+
154
+ Vicuna is a chat assistant trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
155
+
156
+ - **Developed by:** [LMSYS](https://lmsys.org/)
157
+ - **Model type:** An auto-regressive language model based on the transformer architecture.
158
+ - **License:** Non-commercial license
159
+ - **Finetuned from model:** [LLaMA](https://arxiv.org/abs/2302.13971).
160
+
161
+ ### Model Sources
162
+
163
+ - **Repository:** https://github.com/lm-sys/FastChat
164
+ - **Blog:** https://lmsys.org/blog/2023-03-30-vicuna/
165
+ - **Paper:** https://arxiv.org/abs/2306.05685
166
+ - **Demo:** https://chat.lmsys.org/
167
+
168
+ ## Uses
169
+
170
+ The primary use of Vicuna is research on large language models and chatbots.
171
+ The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.
172
+
173
+ ## How to Get Started with the Model
174
+
175
+ Command line interface: https://github.com/lm-sys/FastChat#vicuna-weights.
176
+ APIs (OpenAI API, Huggingface API): https://github.com/lm-sys/FastChat/tree/main#api.
177
+
178
+ ## Training Details
179
+
180
+ Vicuna v1.3 is fine-tuned from LLaMA with supervised instruction fine-tuning.
181
+ The training data is around 140K conversations collected from ShareGPT.com.
182
+ See more details in the "Training Details of Vicuna Models" section in the appendix of this [paper](https://arxiv.org/pdf/2306.05685.pdf).
183
+
184
+ ## Evaluation
185
+
186
+ Vicuna is evaluated with standard benchmarks, human preference, and LLM-as-a-judge. See more details in this [paper](https://arxiv.org/pdf/2306.05685.pdf).
187
+
188
+ ## Difference between different versions of Vicuna
189
+ See [vicuna_weights_version.md](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md)