cmh commited on
Commit
80e1249
1 Parent(s): 376b17a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +191 -0
README.md CHANGED
@@ -1,3 +1,194 @@
1
  ---
2
  license: llama2
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ language: fr
4
+ pipeline_tag: text-generation
5
+ inference: false
6
+ tags:
7
+ - LLM
8
+ - llama-2
9
+ - finetuned
10
  ---
11
+
12
+ # Vigogne 2 7B Chat - GGUF
13
+ - Model creator: [bofenghuang](https://huggingface.co/bofenghuang)
14
+ - Original model: [Vigogne 2 13B Chat](https://huggingface.co/bofenghuang/vigogne-2-13b-chat)
15
+
16
+ <!-- description start -->
17
+ ## Description
18
+
19
+ This repo contains GGUF format model files for [bofenghuang's Vigogne 2 13B Chat](https://huggingface.co/bofenghuang/vigogne-2-13b-chat).
20
+
21
+
22
+ <p align="center" width="100%">
23
+ <img src="https://huggingface.co/bofenghuang/vigogne-2-13b-chat/resolve/main/logo_v2.jpg" alt="Vigogne" style="width: 30%; min-width: 300px; display: block; margin: auto;">
24
+ </p>
25
+
26
+ # Vigogne-2-13B-Chat: A Llama-2-based French Chat LLM
27
+
28
+ Vigogne-2-13B-Chat is a French chat LLM, based on [LLaMA-2-13B](https://ai.meta.com/llama), optimized to generate helpful and coherent responses in conversations with users.
29
+
30
+ Check out our [release blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) and [GitHub repository](https://github.com/bofenghuang/vigogne) for more information.
31
+
32
+ **Usage and License Notices**: Vigogne-2-13B-Chat follows Llama-2's [usage policy](https://ai.meta.com/llama/use-policy). A significant portion of the training data is distilled from GPT-3.5-Turbo and GPT-4, kindly use it cautiously to avoid any violations of OpenAI's [terms of use](https://openai.com/policies/terms-of-use).
33
+
34
+ ## Prompt Template
35
+
36
+ We utilized prefix tokens `<user>:` and `<assistant>:` to distinguish between user and assistant utterances.
37
+
38
+ You can apply this formatting using the [chat template](https://huggingface.co/docs/transformers/main/chat_templating) through the `apply_chat_template()` method.
39
+
40
+ ```python
41
+ from transformers import AutoTokenizer
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained("bofenghuang/vigogne-2-13b-chat")
44
+
45
+ conversation = [
46
+ {"role": "user", "content": "Bonjour ! Comment ça va aujourd'hui ?"},
47
+ {"role": "assistant", "content": "Bonjour ! Je suis une IA, donc je n'ai pas de sentiments, mais je suis prêt à vous aider. Comment puis-je vous assister aujourd'hui ?"},
48
+ {"role": "user", "content": "Quelle est la hauteur de la Tour Eiffel ?"},
49
+ {"role": "assistant", "content": "La Tour Eiffel mesure environ 330 mètres de hauteur."},
50
+ {"role": "user", "content": "Comment monter en haut ?"},
51
+ ]
52
+
53
+ print(tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True))
54
+ ```
55
+
56
+ You will get
57
+
58
+ ```
59
+ <s><|system|>: Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez autant que vous le pouvez.
60
+ <|user|>: Bonjour ! Comment ça va aujourd'hui ?
61
+ <|assistant|>: Bonjour ! Je suis une IA, donc je n'ai pas de sentiments, mais je suis prêt à vous aider. Comment puis-je vous assister aujourd'hui ?</s>
62
+ <|user|>: Quelle est la hauteur de la Tour Eiffel ?
63
+ <|assistant|>: La Tour Eiffel mesure environ 330 mètres de hauteur.</s>
64
+ <|user|>: Comment monter en haut ?
65
+ <|assistant|>:
66
+ ```
67
+
68
+ ## Usage
69
+
70
+ <!-- ### Inference using the quantized versions
71
+
72
+ The quantized versions of this model are generously provided by [TheBloke](https://huggingface.co/TheBloke)!
73
+
74
+ - AWQ for GPU inference: [TheBloke/Vigogne-2-13B-Chat-AWQ](https://huggingface.co/TheBloke/Vigogne-2-13B-Chat-AWQ)
75
+ - GTPQ for GPU inference: [TheBloke/Vigogne-2-13B-Chat-GPTQ](https://huggingface.co/TheBloke/Vigogne-2-13B-Chat-GPTQ)
76
+ - GGUF for CPU+GPU inference: [TheBloke/Vigogne-2-13B-Chat-GGUF](https://huggingface.co/TheBloke/Vigogne-2-13B-Chat-GGUF)
77
+
78
+ These versions facilitate testing and development with various popular frameworks, including [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [vLLM](https://github.com/vllm-project/vllm), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [llama.cpp](https://github.com/ggerganov/llama.cpp), [text-generation-webui](https://github.com/oobabooga/text-generation-webui), and more. -->
79
+
80
+ ### Inference using the unquantized model with 🤗 Transformers
81
+
82
+ ```python
83
+ from typing import Dict, List, Optional
84
+ import torch
85
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TextStreamer
86
+
87
+ model_name_or_path = "bofenghuang/vigogne-2-13b-chat"
88
+
89
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right", use_fast=False)
90
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=torch.float16, device_map="auto")
91
+
92
+ streamer = TextStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
93
+
94
+
95
+ def chat(
96
+ query: str,
97
+ history: Optional[List[Dict]] = None,
98
+ temperature: float = 0.7,
99
+ top_p: float = 1.0,
100
+ top_k: float = 0,
101
+ repetition_penalty: float = 1.1,
102
+ max_new_tokens: int = 1024,
103
+ **kwargs,
104
+ ):
105
+ if history is None:
106
+ history = []
107
+
108
+ history.append({"role": "user", "content": query})
109
+
110
+ input_ids = tokenizer.apply_chat_template(history, add_generation_prompt=True, return_tensors="pt").to(model.device)
111
+ input_length = input_ids.shape[1]
112
+
113
+ generated_outputs = model.generate(
114
+ input_ids=input_ids,
115
+ generation_config=GenerationConfig(
116
+ temperature=temperature,
117
+ do_sample=temperature > 0.0,
118
+ top_p=top_p,
119
+ top_k=top_k,
120
+ repetition_penalty=repetition_penalty,
121
+ max_new_tokens=max_new_tokens,
122
+ pad_token_id=tokenizer.eos_token_id,
123
+ **kwargs,
124
+ ),
125
+ streamer=streamer,
126
+ return_dict_in_generate=True,
127
+ )
128
+
129
+ generated_tokens = generated_outputs.sequences[0, input_length:]
130
+ generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)
131
+
132
+ history.append({"role": "assistant", "content": generated_text})
133
+
134
+ return generated_text, history
135
+
136
+
137
+ # 1st round
138
+ response, history = chat("Un escargot parcourt 100 mètres en 5 heures. Quelle est sa vitesse ?", history=None)
139
+
140
+ # 2nd round
141
+ response, history = chat("Quand il peut dépasser le lapin ?", history=history)
142
+
143
+ # 3rd round
144
+ response, history = chat("Écris une histoire imaginative qui met en scène une compétition de course entre un escargot et un lapin.", history=history)
145
+ ```
146
+
147
+ You can also use the Google Colab Notebook provided below.
148
+
149
+ <a href="https://colab.research.google.com/github/bofenghuang/vigogne/blob/main/notebooks/infer_chat.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
150
+
151
+ ### Inference using the unquantized model with vLLM
152
+
153
+ Set up an OpenAI-compatible server with the following command:
154
+
155
+ ```bash
156
+ # Install vLLM
157
+ # This may take 5-10 minutes.
158
+ # pip install vllm
159
+
160
+ # Start server for Vigogne-Chat models
161
+ python -m vllm.entrypoints.openai.api_server --model bofenghuang/vigogne-2-13b-chat
162
+
163
+ # List models
164
+ # curl http://localhost:8000/v1/models
165
+ ```
166
+
167
+ Query the model using the openai python package.
168
+
169
+ ```python
170
+ import openai
171
+
172
+ # Modify OpenAI's API key and API base to use vLLM's API server.
173
+ openai.api_key = "EMPTY"
174
+ openai.api_base = "http://localhost:8000/v1"
175
+
176
+ # First model
177
+ models = openai.Model.list()
178
+ model = models["data"][0]["id"]
179
+
180
+ # Chat completion API
181
+ chat_completion = openai.ChatCompletion.create(
182
+ model=model,
183
+ messages=[
184
+ {"role": "user", "content": "Parle-moi de toi-même."},
185
+ ],
186
+ max_tokens=1024,
187
+ temperature=0.7,
188
+ )
189
+ print("Chat completion results:", chat_completion)
190
+ ```
191
+
192
+ ## Limitations
193
+
194
+ Vigogne is still under development, and there are many limitations that have to be addressed. Please note that it is possible that the model generates harmful or biased content, incorrect information or generally unhelpful answers.