Why GLM3 is better than GLM4 on LVEval benchmark?

#48
by AnaRhisT - opened

Hi,

I'm testing both chatglm3-6b (32K ctx length) and glm-4-9b-chat (128K ctx length) on LVEval (I use 32K ctx length on glm4 as well),
and the results of ChatGLM3 are much better than GLM4.

Any ideas why is it happening?

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

We haven't tested this dataset, so could you please verify if GLM-4-9b-chat performs well for everyday long-text usage (e.g., document Q&A)? Here is a demo (https://github.com/THUDM/GLM-4/blob/main/composite_demo/README_en.md), you can try it. If incorrect usage has been ruled out as the cause, then look further down.

Unlike GLM3, GLM-4-9b-chat has been deeply optimized for user scenarios such as daily document Q&A. This might impact its performance on some benchmarks. However, we believe that optimizations closer to actual user scenarios should result in a better overall experience. We look forward to hearing if GLM-4-9b-chat enhances your experience.

Sign up or log in to comment