30 167 157

Aymeric Roucher

m-ric

http://a-roucher.github.io

AymericRoucher

aymeric-roucher

AI & ML interests

MLE at Hugging Face 🤗 LLMs, Agents, RAG, Multimodal.

Articles

Organizations

Posts 16

Post

2455

𝗬𝗼𝘂 𝗱𝗼𝗻'𝘁 𝗻𝗲𝗲𝗱 "𝗳𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗰𝗮𝗹𝗹𝗶𝗻𝗴 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴" 𝘁𝗼 𝗯𝘂𝗶𝗹𝗱 𝗴𝗼𝗼𝗱 𝗮𝗴𝗲𝗻𝘁𝘀 ⛔

It's trendy to share models "fine-tuned for function calling"; but from my observations, this fine-tuning is not necessary or sufficient to build good agent systems.
To name only a few:
🐦‍⬛ Nexusflow/𝗡𝗲𝘅𝘂𝘀𝗥𝗮𝘃𝗲𝗻-𝗩𝟮-𝟭𝟯𝗕
⌘ CohereForAI/𝗰𝟰𝗮𝗶-𝗰𝗼𝗺𝗺𝗮𝗻𝗱-𝗿-𝗽𝗹𝘂𝘀
⛵️ mistralai/𝗠𝗶𝘅𝘁𝗿𝗮𝗹-𝟴𝘅𝟮𝟮𝗕-𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁-𝘃𝟬.𝟭
"Fine-tuned for function-calling" generally means "fine-tuned to generate function calls in correct JSON for extremely simple tasks". In other terms, it means "improve the formatting of the tool calls".

Yet I discovered two things while improving Transformers Agents:
🧐 Even when used as JSON agents, these fine-tuned models don't perform very well
🏅 𝙂𝙤𝙤𝙙 𝙗𝙖𝙨𝙚 𝙢𝙤𝙙𝙚𝙡𝙨 𝙥𝙚𝙧𝙛𝙤𝙧𝙢 𝙗𝙚𝙩𝙩𝙚𝙧 𝙬𝙞𝙩𝙝𝙤𝙪𝙩 𝙖𝙣𝙮 𝙛𝙞𝙣𝙚-𝙩𝙪𝙣𝙞𝙣𝙜, 𝙟𝙪𝙨𝙩 𝙥𝙡𝙖𝙞𝙣 𝙥𝙧𝙤𝙢𝙥𝙩𝙞𝙣𝙜. (Llama-3-70B-Instruct, GPT-4o, Claude-3.5-Sonnet)

👇 The graph below shows the count of errors for my GPT-4o validation run on the GAIA benchmark: 𝙰𝚐𝚎𝚗𝚝𝙿𝚊𝚛𝚜𝚒𝚗𝚐𝙴𝚛𝚛𝚘𝚛 and 𝙰𝚐𝚎𝚗𝚝𝙴𝚡𝚎𝚌𝚞𝚝𝚒𝚘𝚗𝙴𝚛𝚛𝚘𝚛 are the ones caused by incorrect formatting.
➤ As you can see, their count is already close to 0!
And given that GPT-4o is certainly not fine-tuned for our Code tool calling format, this shows that "function calling fine-tuning" is not necessary!

The hardest thing to get right in an agent is still to 𝙥𝙡𝙖𝙣 𝙜𝙤𝙤𝙙 𝙩𝙖𝙨𝙠-𝙨𝙤𝙡𝙫𝙞𝙣𝙜 𝙩𝙧𝙖𝙟𝙚𝙘𝙩𝙤𝙧𝙞𝙚𝙨 𝙤𝙫𝙚𝙧 𝙨𝙚𝙫𝙚𝙧𝙖𝙡 𝙨𝙩𝙚𝙥𝙨.
To improve this, we could:
- Use more powerful base models
- Make tool calling datasets with complex solving trajectories
- Use RL! cc @lvwerra

Post

659

𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐀𝐠𝐞𝐧𝐭𝐬 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐭𝐡𝐞 𝐭𝐨𝐩 𝐨𝐟 𝐆𝐀𝐈𝐀 𝐥𝐞𝐚𝐝𝐞𝐫𝐛𝐨𝐚𝐫𝐝! 🥳

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the 𝗚𝗔𝗜𝗔 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

🛠️ 𝗥𝗲𝗾𝘂𝗶𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝘁𝗼𝗼𝗹𝘀, at least a web browser
🔢 𝗥𝗶𝗴𝗼𝗿𝗼𝘂𝘀 𝗹𝗼𝗴𝗶𝗰, many questions having strong math aspects
🖼️ 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹, the agent had to handle all file types: 🔊, 🖼️, 🎬...
👣 𝗠𝘂𝗹𝘁𝗶-𝘀𝘁𝗲𝗽, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard 😳
> "In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(𝘯𝘰 𝘧𝘪𝘭𝘦 𝘢𝘵𝘵𝘢𝘤𝘩𝘦𝘥 𝘰𝘧 𝘤𝘰𝘶𝘳𝘴𝘦, 𝘵𝘩𝘦 𝘢𝘨𝘦𝘯𝘵 𝘩𝘢𝘴 𝘵𝘰 𝘧𝘪𝘯𝘥 𝘢𝘭𝘭 𝘵𝘩𝘦 𝘪𝘯𝘧𝘰)

➡️ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
🚀 Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
🥇 On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

𝙂𝙤 𝙘𝙝𝙚𝙘𝙠 𝙤𝙪𝙩 𝙩𝙝𝙚 𝙡𝙚𝙖𝙙𝙚𝙧𝙗𝙤𝙖𝙧𝙙 👉 gaia-benchmark/leaderboard

View all posts