File size: 2,643 Bytes
9cda10e
 
a11eb6d
 
 
 
 
 
9cda10e
 
 
 
6fcbe78
a11eb6d
9cda10e
a11eb6d
9cda10e
 
a11eb6d
9cda10e
a11eb6d
 
 
 
 
 
 
 
 
 
 
9cda10e
96477ec
a11eb6d
9cda10e
a11eb6d
9cda10e
a11eb6d
 
b0ae2fb
a11eb6d
9cda10e
a11eb6d
9cda10e
 
a11eb6d
266406b
a11eb6d
9cda10e
cb64c73
 
a11eb6d
1265ee3
266406b
a11eb6d
9a9f518
a11eb6d
 
 
 
da1156e
a11eb6d
266406b
a11eb6d
 
 
 
 
 
cb64c73
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
library_name: transformers
license: mit
datasets:
- HuggingFaceH4/ultrachat_200k
- Open-Orca/OpenOrca
language:
- en
---

# Model Card for Model ID

Phi-2-chat-v05 is a finetuned version of Phi-2 to increase the model's understanding of instructions and multi-turn conversations.
In essence: it now has a concept of shutting up after an answer is given - as opposed to just switching into random generator mode.

Finetuning used 25k records from the dataset `HuggingFaceH4/ultrachat_200k`


# Prompt format

```
<|system|>
You are a helpful assistant....
<|user|>
Why is the sky blue?
<|assistant|>
The sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere [...]
<|user|>
Who was the phenomenon named after?
<|assistant|>
```

The system generates its output after the special token <|assistant|>. You need to have that token in the input for a reliable response.
Or you can use the tokenizer's chat_template, as shown below.

# How to use it?

Dependencies
```
pip install -u torch[cuda] transformers einops
```

Cost for inference.


```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "WeeRobots/phi-2-chat-v05"

model = AutoModelForCausalLM.from_pretrained(model_id, device_map={"": 0}, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False, trust_remote_code=True)

payload = tokenizer.apply_chat_template([
    { 'role': 'system', 'content': '''You are a state machine. The user will add state slot values and you'll keep track of them.''' },
    { 'role': 'user', 'content': '''Place 15 into slot apple''' },
    { 'role': 'assistant', 'content': '''Roger that.''' },
    { 'role': 'user', 'content': '''Bananas slot should be 20''' },
    { 'role': 'assistant', 'content': '''Certainly''' },
    { 'role': 'user', 'content': '''What is the value of Apple + Banana?''' },
], tokenize=False, add_generation_prompt=True,)
device = "cuda"
model_input = tokenizer(payload, return_tensors="pt").to(device)
with torch.no_grad():
  # IMPORTANT: always set the eos_token_id in this call. the model is trained to emit the eos_token the right time
  # but it might continue generating irrelevant text. this way the model will stop at the right place
  model_response = model.generate(**model_input, max_new_tokens=512, eos_token_id=tokenizer.eos_token_id, )
  print(tokenizer.decode(model_result[0], skip_special_tokens=False))
```

# Non production quality
Be aware that this model tuning wasn't thoroughly tested, and isn't meant to be used in production, only for experimentation or hobby projects.