Chat template during finetuning?

#2
by wvangils - opened

Hi, did you use the original Llama-3 chat template while finetuning? The template is now missing from the tokenizer config, so it defaults to ChatML. Using the model to follow an instruction by applying a chat template leads to long inference times. Does this sounds familiar?

ReBatch org

I did not use the Llama-3 chat template, it is trained on the ChatML template.
I don't fully understand the question, applying a chat template should not lead to longer inference time per token? Unless your conversation is very long, the first token can take a bit longer.

Okey, thanks. I know, this should not be the case. However, I use an instruction to summarize (400 tokens) and I supply context (1000 tokens). On the original Llama-3 8B model inference is done within seconds. When I use the finetuned for Dutch model inference takes quite long, 30sec+. I will take another look at this later today to see if there is something different in the parameters.

ReBatch org

That is weird, the model architecture is exactly the same. Only the weights are different.

Sign up or log in to comment