@m-ric on Hugging Face: "𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐀𝐠𝐞𝐧𝐭𝐬 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐭𝐡𝐞 𝐭𝐨𝐩 𝐨𝐟…"

Post

711

𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐀𝐠𝐞𝐧𝐭𝐬 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐭𝐡𝐞 𝐭𝐨𝐩 𝐨𝐟 𝐆𝐀𝐈𝐀 𝐥𝐞𝐚𝐝𝐞𝐫𝐛𝐨𝐚𝐫𝐝! 🥳

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the 𝗚𝗔𝗜𝗔 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

🛠️ 𝗥𝗲𝗾𝘂𝗶𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝘁𝗼𝗼𝗹𝘀, at least a web browser
🔢 𝗥𝗶𝗴𝗼𝗿𝗼𝘂𝘀 𝗹𝗼𝗴𝗶𝗰, many questions having strong math aspects
🖼️ 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹, the agent had to handle all file types: 🔊, 🖼️, 🎬...
👣 𝗠𝘂𝗹𝘁𝗶-𝘀𝘁𝗲𝗽, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard 😳
> "In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(𝘯𝘰 𝘧𝘪𝘭𝘦 𝘢𝘵𝘵𝘢𝘤𝘩𝘦𝘥 𝘰𝘧 𝘤𝘰𝘶𝘳𝘴𝘦, 𝘵𝘩𝘦 𝘢𝘨𝘦𝘯𝘵 𝘩𝘢𝘴 𝘵𝘰 𝘧𝘪𝘯𝘥 𝘢𝘭𝘭 𝘵𝘩𝘦 𝘪𝘯𝘧𝘰)

➡️ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
🚀 Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
🥇 On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

𝙂𝙤 𝙘𝙝𝙚𝙘𝙠 𝙤𝙪𝙩 𝙩𝙝𝙚 𝙡𝙚𝙖𝙙𝙚𝙧𝙗𝙤𝙖𝙧𝙙 👉 gaia-benchmark/leaderboard

Congrats! I'm excited to recreate it locally and look at how it works under the hood.

I believe this is where the code for the benchmark run lives?
https://github.com/aymeric-roucher/agent_reasoning_benchmark

Haven't been able to get it to run properly though. I'm aware it depends on the unstable versions of transformer agents.

Would love to be able to run it. Can also push some fixes I have to make to get it to run!

Join the conversation