Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Articles

Organizations

Posts 16

view post
Post
2455
๐—ฌ๐—ผ๐˜‚ ๐—ฑ๐—ผ๐—ป'๐˜ ๐—ป๐—ฒ๐—ฒ๐—ฑ "๐—ณ๐˜‚๐—ป๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ฐ๐—ฎ๐—น๐—น๐—ถ๐—ป๐—ด ๐—ณ๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด" ๐˜๐—ผ ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐—ด๐—ผ๐—ผ๐—ฑ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ โ›”

It's trendy to share models "fine-tuned for function calling"; but from my observations, this fine-tuning is not necessary or sufficient to build good agent systems.
To name only a few:
๐Ÿฆโ€โฌ› Nexusflow/๐—ก๐—ฒ๐˜…๐˜‚๐˜€๐—ฅ๐—ฎ๐˜ƒ๐—ฒ๐—ป-๐—ฉ๐Ÿฎ-๐Ÿญ๐Ÿฏ๐—•
โŒ˜ CohereForAI/๐—ฐ๐Ÿฐ๐—ฎ๐—ถ-๐—ฐ๐—ผ๐—บ๐—บ๐—ฎ๐—ป๐—ฑ-๐—ฟ-๐—ฝ๐—น๐˜‚๐˜€
โ›ต๏ธ mistralai/๐— ๐—ถ๐˜…๐˜๐—ฟ๐—ฎ๐—น-๐Ÿด๐˜…๐Ÿฎ๐Ÿฎ๐—•-๐—œ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜-๐˜ƒ๐Ÿฌ.๐Ÿญ
"Fine-tuned for function-calling" generally means "fine-tuned to generate function calls in correct JSON for extremely simple tasks". In other terms, it means "improve the formatting of the tool calls".

Yet I discovered two things while improving Transformers Agents:
๐Ÿง Even when used as JSON agents, these fine-tuned models don't perform very well
๐Ÿ… ๐™‚๐™ค๐™ค๐™™ ๐™—๐™–๐™จ๐™š ๐™ข๐™ค๐™™๐™š๐™ก๐™จ ๐™ฅ๐™š๐™ง๐™›๐™ค๐™ง๐™ข ๐™—๐™š๐™ฉ๐™ฉ๐™š๐™ง ๐™ฌ๐™ž๐™ฉ๐™๐™ค๐™ช๐™ฉ ๐™–๐™ฃ๐™ฎ ๐™›๐™ž๐™ฃ๐™š-๐™ฉ๐™ช๐™ฃ๐™ž๐™ฃ๐™œ, ๐™Ÿ๐™ช๐™จ๐™ฉ ๐™ฅ๐™ก๐™–๐™ž๐™ฃ ๐™ฅ๐™ง๐™ค๐™ข๐™ฅ๐™ฉ๐™ž๐™ฃ๐™œ. (Llama-3-70B-Instruct, GPT-4o, Claude-3.5-Sonnet)

๐Ÿ‘‡ The graph below shows the count of errors for my GPT-4o validation run on the GAIA benchmark: ๐™ฐ๐š๐šŽ๐š—๐š๐™ฟ๐šŠ๐š›๐šœ๐š’๐š—๐š๐™ด๐š›๐š›๐š˜๐š› and ๐™ฐ๐š๐šŽ๐š—๐š๐™ด๐šก๐šŽ๐šŒ๐šž๐š๐š’๐š˜๐š—๐™ด๐š›๐š›๐š˜๐š› are the ones caused by incorrect formatting.
โžค As you can see, their count is already close to 0!
And given that GPT-4o is certainly not fine-tuned for our Code tool calling format, this shows that "function calling fine-tuning" is not necessary!

The hardest thing to get right in an agent is still to ๐™ฅ๐™ก๐™–๐™ฃ ๐™œ๐™ค๐™ค๐™™ ๐™ฉ๐™–๐™จ๐™ -๐™จ๐™ค๐™ก๐™ซ๐™ž๐™ฃ๐™œ ๐™ฉ๐™ง๐™–๐™Ÿ๐™š๐™˜๐™ฉ๐™ค๐™ง๐™ž๐™š๐™จ ๐™ค๐™ซ๐™š๐™ง ๐™จ๐™š๐™ซ๐™š๐™ง๐™–๐™ก ๐™จ๐™ฉ๐™š๐™ฅ๐™จ.
To improve this, we could:
- Use more powerful base models
- Make tool calling datasets with complex solving trajectories
- Use RL! cc @lvwerra
view post
Post
659
๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ ๐€๐ ๐ž๐ง๐ญ๐ฌ ๐ซ๐ž๐š๐œ๐ก๐ž๐ฌ ๐ญ๐ก๐ž ๐ญ๐จ๐ฉ ๐จ๐Ÿ ๐†๐€๐ˆ๐€ ๐ฅ๐ž๐š๐๐ž๐ซ๐›๐จ๐š๐ซ๐! ๐Ÿฅณ

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the ๐—š๐—”๐—œ๐—” ๐—น๐—ฒ๐—ฎ๐—ฑ๐—ฒ๐—ฟ๐—ฏ๐—ผ๐—ฎ๐—ฟ๐—ฑ, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

๐Ÿ› ๏ธ ๐—ฅ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ ๐˜‚๐˜€๐—ถ๐—ป๐—ด ๐˜๐—ผ๐—ผ๐—น๐˜€, at least a web browser
๐Ÿ”ข ๐—ฅ๐—ถ๐—ด๐—ผ๐—ฟ๐—ผ๐˜‚๐˜€ ๐—น๐—ผ๐—ด๐—ถ๐—ฐ, many questions having strong math aspects
๐Ÿ–ผ๏ธ ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น, the agent had to handle all file types: ๐Ÿ”Š, ๐Ÿ–ผ๏ธ, ๐ŸŽฌ...
๐Ÿ‘ฃ ๐— ๐˜‚๐—น๐˜๐—ถ-๐˜€๐˜๐—ฒ๐—ฝ, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard ๐Ÿ˜ณ
> "In NASAโ€™s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(๐˜ฏ๐˜ฐ ๐˜ง๐˜ช๐˜ญ๐˜ฆ ๐˜ข๐˜ต๐˜ต๐˜ข๐˜ค๐˜ฉ๐˜ฆ๐˜ฅ ๐˜ฐ๐˜ง ๐˜ค๐˜ฐ๐˜ถ๐˜ณ๐˜ด๐˜ฆ, ๐˜ต๐˜ฉ๐˜ฆ ๐˜ข๐˜จ๐˜ฆ๐˜ฏ๐˜ต ๐˜ฉ๐˜ข๐˜ด ๐˜ต๐˜ฐ ๐˜ง๐˜ช๐˜ฏ๐˜ฅ ๐˜ข๐˜ญ๐˜ญ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ช๐˜ฏ๐˜ง๐˜ฐ)

โžก๏ธ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐Ÿš€ Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
๐Ÿฅ‡ On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

๐™‚๐™ค ๐™˜๐™๐™š๐™˜๐™  ๐™ค๐™ช๐™ฉ ๐™ฉ๐™๐™š ๐™ก๐™š๐™–๐™™๐™š๐™ง๐™—๐™ค๐™–๐™ง๐™™ ๐Ÿ‘‰ gaia-benchmark/leaderboard