--- library_name: transformers license: mit datasets: - HuggingFaceH4/ultrachat_200k - Open-Orca/OpenOrca language: - en --- # Model Card for Model ID Phi-2-chat-v05 is a finetuned version of Phi-2 to increase the model's understanding of instructions and multi-turn conversations. In essence: it now has a concept of shutting up after an answer is given - as opposed to just switching into random generator mode. Finetuning used 25k records from the dataset `HuggingFaceH4/ultrachat_200k` # Prompt format ``` <|system|> You are a helpful assistant.... <|user|> Why is the sky blue? <|assistant|> The sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere [...] <|user|> Who was the phenomenon named after? <|assistant|> ``` The system generates its output after the special token <|assistant|>. You need to have that token in the input for a reliable response. Or you can use the tokenizer's chat_template, as shown below. # How to use it? Dependencies ``` pip install -u torch[cuda] transformers einops ``` Code for inference. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "WeeRobots/phi-2-chat-v05" model = AutoModelForCausalLM.from_pretrained(model_id, device_map={"": 0}, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False, trust_remote_code=True) payload = tokenizer.apply_chat_template([ { 'role': 'system', 'content': '''You are a state machine. The user will add state slot values and you'll keep track of them.''' }, { 'role': 'user', 'content': '''Place 15 into slot apple''' }, { 'role': 'assistant', 'content': '''Roger that.''' }, { 'role': 'user', 'content': '''Bananas slot should be 20''' }, { 'role': 'assistant', 'content': '''Certainly''' }, { 'role': 'user', 'content': '''What is the value of Apple + Banana?''' }, ], tokenize=False, add_generation_prompt=True,) device = "cuda" model_input = tokenizer(payload, return_tensors="pt").to(device) with torch.no_grad(): # IMPORTANT: always set the eos_token_id in this call. the model is trained to emit the eos_token the right time # but it might continue generating irrelevant text. this way the model will stop at the right place model_response = model.generate(**model_input, max_new_tokens=512, eos_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(model_result[0], skip_special_tokens=False)) ``` # Non production quality Be aware that this model tuning wasn't thoroughly tested, and isn't meant to be used in production, only for experimentation or hobby projects.