THUDM/glm-4-9b-chat · Why GLM3 is better than GLM4 on LVEval benchmark?

We haven't tested this dataset, so could you please verify if GLM-4-9b-chat performs well for everyday long-text usage (e.g., document Q&A)? Here is a demo (https://github.com/THUDM/GLM-4/blob/main/composite_demo/README_en.md), you can try it. If incorrect usage has been ruled out as the cause, then look further down.

Unlike GLM3, GLM-4-9b-chat has been deeply optimized for user scenarios such as daily document Q&A. This might impact its performance on some benchmarks. However, we believe that optimizations closer to actual user scenarios should result in a better overall experience. We look forward to hearing if GLM-4-9b-chat enhances your experience.