There's a HUGE drop in popular knowledge from v2 to v2.5.

#1
by phil111 - opened

Qwen2 72b scored 73.9 on my popular knowledge test (movies, songs, games, sports...) compared to 77.9 for Llama 3.1 70b, and between 62-64 for Llama 3.1 8b & Gemma2 9b.

However, Qwen2.5 72b scored about the same as Qwen2 7b (~50). I did a vibe check with Qwen2.5 34b & 7b and the same thing is happening (there's a huge drop in general knowledge compared to v2). Thanks for the v2.5 family, but I fail to see how hallucinating like crazy when it comes to what people care most about (popular knowledge) in order to achieve small bumps in test scores (e.g. MMLU) is a reasonable compromise.

Example: "Who sang the modern hit song Dear Future Husband? What album is it from? And what year was it released?"

Even small LLMs get this one right, including tiny little Gemma 2 2b (Meghan Trainor, Album name is Title, year is 2015) and your 10x smaller Qwen2 7b.

But all Qwen2.5s, including the 72b version (below), got it wrong, and most other similar questions about movies, shows, sports, games, and other areas of non-academic popular knowledge. I tested locally w/ GGUFs, but it still got it wrong on LMsys, so it's unlikely I configured something incorrectly. Qwen2 72b has about the same amount of popular knowledge as Gemma 2 2b despite being ~36x bigger.

Response: "The song "Dear Future Husband" was performed by Mandy Moore. It's from her album Silver Lining, which was released in 2003.

However, it's worth noting that while this version of the song is well-known, a more recent and widely popular version was covered by the band Fifth Harmony. Their version appeared on their album 7/27 and was released in 2016..."

It's crazy how censored this model is. Even censorship lovers from California didn't go that far.

phil111 changed discussion status to closed
phil111 changed discussion status to open

@wcde I agree, it's definitely too censored, but not much more than Llama 3.1, and far less than Microsoft's Phi 3.5, which is the most censored LLM I've ever come across.

I have not done a ton of testing but this model is awesome. I am super interested in the 32B Coder model. The 7B is very nice already.

@gefit I'm sure the coding versions are good at coding, and the math versions are good at math.

However, a general purpose instruct LLM needs to answer random questions about very popular domains of knowledge that random people come across, such as about top movies, games, music, TV shows, sports, celebrities, popular literature, and so on.

And it's not just about Q&A. How are you supposed to chat with an LLM that's so profoundly ignorant about what the majority of the population cares the most about (it's called pop culture for a reason, it's popular)? How are you supposed to write stories if you don't know the basic facts about what you're writing about?

I really wish LLM makers would stop chasing test scores and bragging about them (e.g. 'we're only 72b parameters but are achieving test scores comparable to a 405b model'). Did you train on better data than Meta? Did you train for longer with more compute? Did you do anything special? Better? Or did you just focus on the data that overlaps with the standardized LLM tests at the expense of the rest of humanity's knowledge. Qwen2 72b is still too error prone and low IQ for academic use, so the small boost in its academic performance really doesn't make it any more usable.

You may have boosted the test scores of Qwen Instruct, but it's now effectively useless as a general purpose LLM to anybody with half a brain and diverse interests.

I can report the same, it doesn't have as good of world knowledge. But translation is best I've seen with open source model, it's really unrivalled expect maybe by Llama 405B but I haven't tried. The question I always ask is "Who is Victor Tsoi and what songs did he release". The output quality is about as good as WizardLM-2-8x22B with this question gets a lot of them right but makes up a good chunk of them

This comment has been hidden

@nonetrix Qwen2.5 72b does a very good with the top 0.01% of pop culture (e.g. cast of the TV show Friends), but there's a sudden drop off, and Viktor is arguably the most famous Russian singer so I'm not surprised Qwen2.5 does well on this question, although it only appeared to get 4 out of the 8 songs it listed correct. I tested it with Llama 3.1 70b Q4 and it got 7/7 correct (it initially listed only 4, so I asked for a few more), but it did make a couple translation errors (e.g. Quiet vs Calm Night).

Llama 3.1 70b: "Viktor Robertovich Tsoi was a Soviet musician, singer, and songwriter who co-founded the Russian rock band Kino. He is widely regarded as one of the most influential and iconic musicians in Russian history.

Some of his most popular songs include:

  1. "Khochu Peremen" ("I Want Changes")
  2. "Zvezda Po Imeni Solntse" ("A Star Called Sun")
  3. "Gruppa Krovi" ("Blood Type")
  4. "Posledniy Geroi" ("The Last Hero")

These songs are considered some of the greatest hits of Kino and Viktor Tsoi's solo career.

Llama 3.1 70b (list a few more): "..."Pachka Sigaret" ("Pack of Cigarettes")
"Sledi Za Soboi" ("Traces Behind")
"Spokoynaya Noch" ("Quiet Night")..."

Regardless, I ask a large number of these types of questions and even Llama 3.1 8b does far better than Qwen2.5 72b (~52 vs 64.9), which is very odd because Qwen2 72b did far better then L3.1 8b (73.9). There's simply no excuse for a 7b parameter LLM, let alone a 72b parameter one, to be so profoundly ignorant about the lion's share of pop culture.

However, a general purpose instruct LLM needs to answer random questions about very popular domains of knowledge that random people come across, such as about top movies, games, music, TV shows, sports, celebrities, popular literature, and so on.

And it's not just about Q&A. How are you supposed to chat with an LLM that's so profoundly ignorant about what the majority of the population cares the most about (it's called pop culture for a reason, it's popular)? How are you supposed to write stories if you don't know the basic facts about what you're writing about?

Can't echo this enough, it's a shame how much data like this has been disregarded in recent LLM releases... Surely it's possible to include such things without any great detriment to reasoning/coding - even then, I would even be able to put up with that given the existence of dedicated math/code models.

It would be a great differentiator b/w Qwen and other open models if this were to change in the future.

Previous qwen needed pre-fill and other hints to write normally. I'm not sure this one is more censored. It did write like a channer in the demo no problem. Seemed like it knew who genshin characters and vtubers were. Then again, I can't argue with your benchmark tests.

It took finetuning to turbocat/magnum/etc to make the 2.0 shine and produce decent prose. As released it was similarly meh.

@jackboot Massively popular cultural information (the top ~0.01%) can be recovered from Qwen2.5 72b with perfect accuracy, such as the entire casts, and their respective character names, from the TV shows Friends and The Big Bang Theory (most watched shows globally). Same goes for Genshin (wildly popular in China), music legends like Madonna & Michael Jackson, and so on.

Normally LLMs have a smooth knowledge slope, with hallucinations gradually increasing as you ask about progressively less popular culture. This is the case with Qwen2 72b, Llama 3.1 72b, Gemma 2, Mistrals (e.g. Small & Mixtral), and so on.

In contrast, some models use a highly curated corpus to maximize test scores at a given size, most notably Phi 3, Yi, and InternLM. All three have the highest HF scores in their size ranges and can perfectly recall the top ~0.01% of cultural information, but then suddenly have a huge spike in hallucinations, with their ~8b LLM scores ranging from only 35-40 on my pop culture test (Llama 3.1 & Gemma 2 ~8b score 62+). I pasted Yi1.5-9b's Corner Gas cast output below as an example of how far off they are when it comes to reasonably popular culture (a top 5 show in Canada). But again it got the entire Friends and the Big Bang Theory's cast right without a single error. Even Yi1.5-34b only scores 52.2 (not much better despite being much larger than Yi 9b).

Qwen2.5 is now doing the same. They've decided to sacrifice the large bulk of popular culture information in order to add more tokens, and train them longer, that have a higher probability of showing up on standardized LLM tests like the MMLU.

  1. Hank Pomerleau - Percy Blain
  2. Norm MacDonald - Norm Udstrand
  3. Edie Clifton - Lorna Wilson
  4. Carla Pomerleau - Krysten Henderson
  5. Arthur Pratt - Paul Hermann Goldsmith (He was a voice actor for Arthur's character)
  6. Gordon Bell - Andrew Millar

Should be...

Brent Leroy (Brent Butt) - Main Characters
Lacey Burrows (Gabrielle Miller) - Restaurant Owner
Hank Yarbo (Fred Ewanuick) - Friend
Oscar Leroy (Eric Peterson) - Father
Emma Leroy (Janet Wright) - Mother
Davis Quinton (Lorne Cardinal) - Cop
Karen Pelly (Tara Spencer-Nairn) - Cop
Wanda Dollard (Nancy Robertson) - Employee
Fitzy (Cavan Cunningham) - Mayor

Sign up or log in to comment