andersonbcdefg commited on
Commit
eeed786
1 Parent(s): 73cfaae

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This is a quantized version of Llama2-7B trained on the LIMA (Less is More for Alignment) dataset, located at `GAIR/lima` on HuggingFace.
2
+ To get started using this model, you'll need to install `transformers` (for the tokenizer) and `ctranslate2` (for the model). You'll
3
+ also need `huggingface_hub` to easily download the weights.
4
+
5
+ ```
6
+ pip install -U transformers ctranslate2 huggingface_hub
7
+ ```
8
+
9
+ Next, download this repository from the Hub. You can download the files manually and place them in a folder, or use the HuggingFace library
10
+ to download them programatically. Here, we're putting them in a local directory called "Llama2_TaylorAI".
11
+
12
+ ```python
13
+ from huggingface_hub import snapshot_download
14
+ snapshot_download(repo_id="TaylorAI/Llama2-7B-SFT-LIMA-ct2", local_dir="Llama2_TaylorAI")
15
+ ```
16
+
17
+ Then, you can perform inference as follows. Note that the model was trained with the separator `\\n\\n###\\n\\n` between the prompt/instruction
18
+ and the model's response, so to get the expected result, you'll want to append this to your prompt. The model was also trained to finish its
19
+ output with the suffix `@@@`, so you can stop generating tokens once you reach this suffix, or use it to split the completion and keep the
20
+ relevant part. All of this is shown in the example below.
21
+
22
+ ```
23
+ from ctranslate2 import Generator
24
+ from transformers import AutoTokenizer
25
+
26
+ tokenizer = AutoTokenizer.from_pretrained("TaylorAI/Llama2-7B-SFT-LIMA-ct2")
27
+ # point this wherever you stored this repository. if you have a GPU, use device="cuda", otherwise "cpu"
28
+ model = Generator("Llama2_TaylorAI", device="cuda")
29
+
30
+ # Unlike normal Transformers models, Ctranslate2 operates on actual "tokens" (little subword strings), not token ids (integers)
31
+ def tokenize_for_ct2(
32
+ prompt: str,
33
+ prompt_suffix: str,
34
+ tokenizer: Any,
35
+ ):
36
+ full_prompt = prompt + prompt_suffix
37
+ input_ids = tokenizer.encode(full_prompt)
38
+ input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
39
+ return input_tokens
40
+
41
+ example_input = "What is the meaning of life?"
42
+ example_input_tokens = tokenize_for_ct2(example_input, prompt_suffix="\n\n###\n\n", tokenizer=tokenizer)
43
+
44
+ # the model returns an iterator, from which we can lazily stream tokens
45
+ result = []
46
+ it = model.generate_tokens(
47
+ example_input_tokens,
48
+ max_length=1024,
49
+ sampling_topp=0.9,
50
+ sampling_temperature=1.0,
51
+ repetition_penalty=1.5
52
+ )
53
+ stop_sequence = "@@@"
54
+ for step in it:
55
+ result.append(step.token_id)
56
+ # stop early if we have generated the suffix
57
+ output_so_far = tokenizer.decode(completion_tokens, skip_special_tokens=True)
58
+ if output_so_far.endswith(stop_sequence):
59
+ break
60
+
61
+ output = tokenizer.decode(completion_tokens, skip_special_tokens=True).split(stop_sequence)[0]
62
+ print(output)
63
+ ```