128K usage?

#1
by Downtown-Case - opened

Hello. I see this model is configured for a max context size of 128K... but upon trying it at 128K, it doesn't seem coherent at all.

I noticed that Qwen 2.5 instruct requires YARN auto scaling for >32K usage, but this base model makes no mention of that.

Is this just an oversight? Is YARN required for long context usage with this base model as well?

@Downtown-Case The context length of the base model is determined by evaluation through perplexity (ppl). We did not observe significant ppl degradation at 128k context length. However, this does not imply that the base model can generate coherent content at such long contexts. We recommend using the instruct model for generation tasks.

If you use YARN on the base model, you should also achieve better long-text results, but coherence between different parts of the text still isn't guaranteed since the model hasn't been trained for that.

Thanks. In some cases I have (raw novel continuation) base models sometimes work better.

I will try it some more, I might just be holding it wrong, and otherwise just use the instruct model.

Downtown-Case changed discussion status to closed

Sign up or log in to comment