File size: 2,724 Bytes

This is a quantized version of Llama2-7B trained on the LIMA (Less is More for Alignment) dataset, located at `GAIR/lima` on HuggingFace.
To get started using this model, you'll need to install `transformers` (for the tokenizer) and `ctranslate2` (for the model). You'll
also need `huggingface_hub` to easily download the weights.

```
pip install -U transformers ctranslate2 huggingface_hub
```

Next, download this repository from the Hub. You can download the files manually and place them in a folder, or use the HuggingFace library
to download them programatically. Here, we're putting them in a local directory called "Llama2_TaylorAI".

```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="TaylorAI/Llama2-7B-SFT-LIMA-ct2", local_dir="Llama2_TaylorAI")
```

Then, you can perform inference as follows. Note that the model was trained with the separator `\n\n###\n\n` between the prompt/instruction
and the model's response, so to get the expected result, you'll want to append this to your prompt. The model was also trained to finish its
output with the suffix `@@@`, so you can stop generating tokens once you reach this suffix, or use it to split the completion and keep the
relevant part. All of this is shown in the example below.

```
from ctranslate2 import Generator
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TaylorAI/Llama2-7B-SFT-LIMA-ct2")
# point this wherever you stored this repository. if you have a GPU, use device="cuda", otherwise "cpu"
model = Generator("Llama2_TaylorAI", device="cuda")

# Unlike normal Transformers models, Ctranslate2 operates on actual "tokens" (little subword strings), not token ids (integers)
def tokenize_for_ct2(
    prompt: str,
    prompt_suffix: str,
    tokenizer: Any,
):
    full_prompt = prompt + prompt_suffix
    input_ids = tokenizer.encode(full_prompt)
    input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    return input_tokens

example_input = "What is the meaning of life?"
example_input_tokens = tokenize_for_ct2(example_input, prompt_suffix="\n\n###\n\n", tokenizer=tokenizer)

# the model returns an iterator, from which we can lazily stream tokens
result = []
it = model.generate_tokens(
  example_input_tokens, 
  max_length=1024, 
  sampling_topp=0.9,
  sampling_temperature=1.0,
  repetition_penalty=1.5
)
stop_sequence = "@@@"
for step in it:
  result.append(step.token_id)
  # stop early if we have generated the suffix
  output_so_far = tokenizer.decode(completion_tokens, skip_special_tokens=True)
  if output_so_far.endswith(stop_sequence):
    break

output = tokenizer.decode(completion_tokens, skip_special_tokens=True).split(stop_sequence)[0]
print(output)
```