File size: 2,314 Bytes
8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 d3d9ba2 8d260b4 f6230c8 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 f6230c8 76560b6 631c7b7 76560b6 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
library_name: transformers
tags: ["gemma","chatml"]
---
# ChatML Tokenizer for Gemma
This repository includes a fast tokenizer for [google/gemma-7b](https://huggingface.co/google/gemma-7b) with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id `106` (`<start_of_turn>`) and `107` (`<end_of_turn>`) with the chatML tokens `<|im_start|>` and `<|im_end|>`.
No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.
_Note: It is important to note that this tokenizer is not 100% ChatML compliant, since it seems [google/gemma-7b](https://huggingface.co/google/gemma-7b), always requires the original `<bos>` token to be part of the input. This means the chat template is `<bos>` + `chatml` + `<eos>`_
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
messages = [
{"role": "system", "content": "You are Gemma."},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(chatml)
# <bos><|im_start|>system
# You are Gemma.<|im_end|>
# <|im_start|>user
# Hello, how are you?<|im_end|>
# <|im_start|>assistant
# I'm doing great. How can I help you today?<|im_end|>\n<eos>
```
## Test
```python
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)
# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"
# tokenize messages
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}")
``` |