Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update 12 days ago
Post
711
๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ ๐€๐ ๐ž๐ง๐ญ๐ฌ ๐ซ๐ž๐š๐œ๐ก๐ž๐ฌ ๐ญ๐ก๐ž ๐ญ๐จ๐ฉ ๐จ๐Ÿ ๐†๐€๐ˆ๐€ ๐ฅ๐ž๐š๐๐ž๐ซ๐›๐จ๐š๐ซ๐! ๐Ÿฅณ

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the ๐—š๐—”๐—œ๐—” ๐—น๐—ฒ๐—ฎ๐—ฑ๐—ฒ๐—ฟ๐—ฏ๐—ผ๐—ฎ๐—ฟ๐—ฑ, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

๐Ÿ› ๏ธ ๐—ฅ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ ๐˜‚๐˜€๐—ถ๐—ป๐—ด ๐˜๐—ผ๐—ผ๐—น๐˜€, at least a web browser
๐Ÿ”ข ๐—ฅ๐—ถ๐—ด๐—ผ๐—ฟ๐—ผ๐˜‚๐˜€ ๐—น๐—ผ๐—ด๐—ถ๐—ฐ, many questions having strong math aspects
๐Ÿ–ผ๏ธ ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น, the agent had to handle all file types: ๐Ÿ”Š, ๐Ÿ–ผ๏ธ, ๐ŸŽฌ...
๐Ÿ‘ฃ ๐— ๐˜‚๐—น๐˜๐—ถ-๐˜€๐˜๐—ฒ๐—ฝ, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard ๐Ÿ˜ณ
> "In NASAโ€™s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(๐˜ฏ๐˜ฐ ๐˜ง๐˜ช๐˜ญ๐˜ฆ ๐˜ข๐˜ต๐˜ต๐˜ข๐˜ค๐˜ฉ๐˜ฆ๐˜ฅ ๐˜ฐ๐˜ง ๐˜ค๐˜ฐ๐˜ถ๐˜ณ๐˜ด๐˜ฆ, ๐˜ต๐˜ฉ๐˜ฆ ๐˜ข๐˜จ๐˜ฆ๐˜ฏ๐˜ต ๐˜ฉ๐˜ข๐˜ด ๐˜ต๐˜ฐ ๐˜ง๐˜ช๐˜ฏ๐˜ฅ ๐˜ข๐˜ญ๐˜ญ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ช๐˜ฏ๐˜ง๐˜ฐ)

โžก๏ธ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐Ÿš€ Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
๐Ÿฅ‡ On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

๐™‚๐™ค ๐™˜๐™๐™š๐™˜๐™  ๐™ค๐™ช๐™ฉ ๐™ฉ๐™๐™š ๐™ก๐™š๐™–๐™™๐™š๐™ง๐™—๐™ค๐™–๐™ง๐™™ ๐Ÿ‘‰ gaia-benchmark/leaderboard

Congrats! I'm excited to recreate it locally and look at how it works under the hood.

I believe this is where the code for the benchmark run lives?
https://github.com/aymeric-roucher/agent_reasoning_benchmark

Haven't been able to get it to run properly though. I'm aware it depends on the unstable versions of transformer agents.

Would love to be able to run it. Can also push some fixes I have to make to get it to run!

In this post