Why are "add_bos_token" and "add_eos_token" missing in tokenizer_config.json ?

#140
by ekurtic - opened

Without these two in the tokenizer_config.json, I find it impossible to initialize the Llama-3 tokenizer with disabled adding of the BOS token.

This behaves as expected:

from transformers import AutoTokenizer
llama2_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
llama3_tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

>>> llama2_tok("hello")
{'input_ids': [1, 22172], 'attention_mask': [1, 1]}

>>> llama3_tok("hello")
{'input_ids': [128000, 15339], 'attention_mask': [1, 1]}

As we can see here, BOS tokens are added correctly for both tokenizers.

Let's now try to disable adding of the BOS token and enable adding of the EOS token:

from transformers import AutoTokenizer
llama2_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_bos_token=False, add_eos_token=True)
llama3_tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", add_bos_token=False, add_eos_token=True)

>>> llama2_tok("hello")
{'input_ids': [22172, 2], 'attention_mask': [1, 1]}             <----- Good. BOS token not added, EOS token added.

>>> llama3_tok("hello")
{'input_ids': [128000, 15339], 'attention_mask': [1, 1]}       <---- Not good. BOS token added, EOS not added.

As can be seen, Llama-3 completely ignored the given add_bos_token and add_eos_token.
From what I have been able to trace, this might be due to the missing add_bos_token and add_eos_token in the tokenizer_config.json of the Llama-3 model.

Meta Llama org

Hey! This is unfortunately expected for now, and the template processor should be updated. If you check the class this used for both is not the same.
More details here: https://github.com/huggingface/transformers/issues/30947#issuecomment-2128057992

Sign up or log in to comment