WeeRobots
/

phi-2-chat-v05

Text Generation

Inference Endpoints

Model card Files Files and versions Community

phi-2-chat-v05 / README.md

PeterZentai's picture

Update README.md

96477ec verified 9 months ago

|

2.64 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- HuggingFaceH4/ultrachat_200k
	- Open-Orca/OpenOrca
	language:
	- en
	---

	# Model Card for Model ID

	Phi-2-chat-v05 is a finetuned version of Phi-2 to increase the model's understanding of instructions and multi-turn conversations.
	In essence: it now has a concept of shutting up after an answer is given - as opposed to just switching into random generator mode.

	Finetuning used 25k records from the dataset `HuggingFaceH4/ultrachat_200k`


	# Prompt format

	```
	<\|system\|>
	You are a helpful assistant....
	<\|user\|>
	Why is the sky blue?
	<\|assistant\|>
	The sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere [...]
	<\|user\|>
	Who was the phenomenon named after?
	<\|assistant\|>
	```

	The system generates its output after the special token <\|assistant\|>. You need to have that token in the input for a reliable response.
	Or you can use the tokenizer's chat_template, as shown below.

	# How to use it?

	Dependencies
	```
	pip install -u torch[cuda] transformers einops
	```

	Cost for inference.


	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "WeeRobots/phi-2-chat-v05"

	model = AutoModelForCausalLM.from_pretrained(model_id, device_map={"": 0}, trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False, trust_remote_code=True)

	payload = tokenizer.apply_chat_template([
	{ 'role': 'system', 'content': '''You are a state machine. The user will add state slot values and you'll keep track of them.''' },
	{ 'role': 'user', 'content': '''Place 15 into slot apple''' },
	{ 'role': 'assistant', 'content': '''Roger that.''' },
	{ 'role': 'user', 'content': '''Bananas slot should be 20''' },
	{ 'role': 'assistant', 'content': '''Certainly''' },
	{ 'role': 'user', 'content': '''What is the value of Apple + Banana?''' },
	], tokenize=False, add_generation_prompt=True,)
	device = "cuda"
	model_input = tokenizer(payload, return_tensors="pt").to(device)
	with torch.no_grad():
	# IMPORTANT: always set the eos_token_id in this call. the model is trained to emit the eos_token the right time
	# but it might continue generating irrelevant text. this way the model will stop at the right place
	model_response = model.generate(**model_input, max_new_tokens=512, eos_token_id=tokenizer.eos_token_id, )
	print(tokenizer.decode(model_result[0], skip_special_tokens=False))
	```

	# Non production quality
	Be aware that this model tuning wasn't thoroughly tested, and isn't meant to be used in production, only for experimentation or hobby projects.