m-ric (Aymeric Roucher)

posted an update about 14 hours ago

Post

364

New cookbook!

I show to to make agentic RAG using Transformers Agents.

Compared to vanilla RAG, agentic RAG can:
✅ Reformulate the query
✅ Critique the retrived content to re-retrieve if needed

➡️ Score increase of 8.5%! 💪 (Llama-3-70B-judge)

Read it here 👉 https://huggingface.co/learn/cookbook/agent_rag

replied to their post 5 days ago

It's not using GPT-4o for evaluation, evaluation is done with exact string match!

posted an update 7 days ago

Post

2547

𝗬𝗼𝘂 𝗱𝗼𝗻'𝘁 𝗻𝗲𝗲𝗱 "𝗳𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗰𝗮𝗹𝗹𝗶𝗻𝗴 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴" 𝘁𝗼 𝗯𝘂𝗶𝗹𝗱 𝗴𝗼𝗼𝗱 𝗮𝗴𝗲𝗻𝘁𝘀 ⛔

It's trendy to share models "fine-tuned for function calling"; but from my observations, this fine-tuning is not necessary or sufficient to build good agent systems.
To name only a few:
🐦‍⬛ Nexusflow/𝗡𝗲𝘅𝘂𝘀𝗥𝗮𝘃𝗲𝗻-𝗩𝟮-𝟭𝟯𝗕
⌘ CohereForAI/𝗰𝟰𝗮𝗶-𝗰𝗼𝗺𝗺𝗮𝗻𝗱-𝗿-𝗽𝗹𝘂𝘀
⛵️ mistralai/𝗠𝗶𝘅𝘁𝗿𝗮𝗹-𝟴𝘅𝟮𝟮𝗕-𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁-𝘃𝟬.𝟭
"Fine-tuned for function-calling" generally means "fine-tuned to generate function calls in correct JSON for extremely simple tasks". In other terms, it means "improve the formatting of the tool calls".

Yet I discovered two things while improving Transformers Agents:
🧐 Even when used as JSON agents, these fine-tuned models don't perform very well
🏅 𝙂𝙤𝙤𝙙 𝙗𝙖𝙨𝙚 𝙢𝙤𝙙𝙚𝙡𝙨 𝙥𝙚𝙧𝙛𝙤𝙧𝙢 𝙗𝙚𝙩𝙩𝙚𝙧 𝙬𝙞𝙩𝙝𝙤𝙪𝙩 𝙖𝙣𝙮 𝙛𝙞𝙣𝙚-𝙩𝙪𝙣𝙞𝙣𝙜, 𝙟𝙪𝙨𝙩 𝙥𝙡𝙖𝙞𝙣 𝙥𝙧𝙤𝙢𝙥𝙩𝙞𝙣𝙜. (Llama-3-70B-Instruct, GPT-4o, Claude-3.5-Sonnet)

👇 The graph below shows the count of errors for my GPT-4o validation run on the GAIA benchmark: 𝙰𝚐𝚎𝚗𝚝𝙿𝚊𝚛𝚜𝚒𝚗𝚐𝙴𝚛𝚛𝚘𝚛 and 𝙰𝚐𝚎𝚗𝚝𝙴𝚡𝚎𝚌𝚞𝚝𝚒𝚘𝚗𝙴𝚛𝚛𝚘𝚛 are the ones caused by incorrect formatting.
➤ As you can see, their count is already close to 0!
And given that GPT-4o is certainly not fine-tuned for our Code tool calling format, this shows that "function calling fine-tuning" is not necessary!

The hardest thing to get right in an agent is still to 𝙥𝙡𝙖𝙣 𝙜𝙤𝙤𝙙 𝙩𝙖𝙨𝙠-𝙨𝙤𝙡𝙫𝙞𝙣𝙜 𝙩𝙧𝙖𝙟𝙚𝙘𝙩𝙤𝙧𝙞𝙚𝙨 𝙤𝙫𝙚𝙧 𝙨𝙚𝙫𝙚𝙧𝙖𝙡 𝙨𝙩𝙚𝙥𝙨.
To improve this, we could:
- Use more powerful base models
- Make tool calling datasets with complex solving trajectories
- Use RL! cc @lvwerra

3 replies

·

posted an update 12 days ago

Post

712

𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐀𝐠𝐞𝐧𝐭𝐬 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐭𝐡𝐞 𝐭𝐨𝐩 𝐨𝐟 𝐆𝐀𝐈𝐀 𝐥𝐞𝐚𝐝𝐞𝐫𝐛𝐨𝐚𝐫𝐝! 🥳

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the 𝗚𝗔𝗜𝗔 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

🛠️ 𝗥𝗲𝗾𝘂𝗶𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝘁𝗼𝗼𝗹𝘀, at least a web browser
🔢 𝗥𝗶𝗴𝗼𝗿𝗼𝘂𝘀 𝗹𝗼𝗴𝗶𝗰, many questions having strong math aspects
🖼️ 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹, the agent had to handle all file types: 🔊, 🖼️, 🎬...
👣 𝗠𝘂𝗹𝘁𝗶-𝘀𝘁𝗲𝗽, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard 😳
> "In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(𝘯𝘰 𝘧𝘪𝘭𝘦 𝘢𝘵𝘵𝘢𝘤𝘩𝘦𝘥 𝘰𝘧 𝘤𝘰𝘶𝘳𝘴𝘦, 𝘵𝘩𝘦 𝘢𝘨𝘦𝘯𝘵 𝘩𝘢𝘴 𝘵𝘰 𝘧𝘪𝘯𝘥 𝘢𝘭𝘭 𝘵𝘩𝘦 𝘪𝘯𝘧𝘰)

➡️ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
🚀 Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
🥇 On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

𝙂𝙤 𝙘𝙝𝙚𝙘𝙠 𝙤𝙪𝙩 𝙩𝙝𝙚 𝙡𝙚𝙖𝙙𝙚𝙧𝙗𝙤𝙖𝙧𝙙 👉 gaia-benchmark/leaderboard

1 reply

·

replied to their post 21 days ago

Great idea! Can I build it @victor or you'd like to make it yourself?

posted an update 21 days ago

Post

3025

💰 𝗚𝗲𝘁 𝘁𝗵𝗲 𝗽𝗿𝗶𝗰𝗲 𝗼𝗳 𝗮𝗻𝘆 𝗟𝗟𝗠 𝗔𝗣𝗜 𝗿𝗲𝗾𝘂𝗲𝘀𝘁 ⇒ 𝘁𝗼𝗸𝗲𝗻𝗰𝗼𝘀𝘁

I've just found out about 𝙰𝚐𝚎𝚗𝚝𝙾𝚙𝚜-𝙰𝙸/𝚝𝚘𝚔𝚎𝚗𝚌𝚘𝚜𝚝 (https://github.com/AgentOps-AI/tokencost).
𝗧𝗵𝗶𝘀 𝗹𝗶𝗯𝗿𝗮𝗿𝘆 𝗴𝗶𝘃𝗲𝘀 𝘆𝗼𝘂 𝘁𝗵𝗲 𝗽𝗿𝗶𝗰𝗲 𝗼𝗳 𝘆𝗼𝘂𝗿 𝗰𝗮𝗹𝗹𝘀 𝘁𝗼 𝗮𝗻𝘆 𝗟𝗟𝗠 𝗔𝗣𝗜: OpenAI, Anthropic, Mistral, AWS or Databricks...

For any model, you can use as input either string prompts or messages, and get as outputs either the price or token count.

Congrats to the AgentOps-AI team: this will be very useful when trying to get a ballpark estimate of a project's price, to compare APIs, or for precise monitoring of usage!

✨ Daily reminder: 𝗿𝘂𝗻𝗻𝗶𝗻𝗴 𝗮𝗻 𝗔𝟭𝟬𝟬 𝗰𝗼𝘀𝘁𝘀 𝘆𝗼𝘂 𝗲𝘅𝗮𝗰𝘁𝗹𝘆 $𝟬.𝟬𝟬/𝗵𝗼𝘂𝗿 (or 0.00€ in current exchange rates) on a HF space with ZeroGPU!
Learn more on ZeroGPU 👉 https://www.datacenterdynamics.com/en/news/hugging-face-launches-zerogpu-project-to-democratize-ai-gives-away-10-million-worth-of-compute/

4 replies

·

posted an update about 1 month ago

Post

1816

𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗮𝗻 𝗮𝗴𝗲𝗻𝘁𝗶𝗰 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄 𝘂𝘀𝗲 𝗶𝘁𝘀 𝗟𝗟𝗠 𝗲𝗻𝗴𝗶𝗻𝗲 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝘁𝗮𝘀𝗸𝘀?

➡️ I made my first ever 𝘮𝘢𝘯𝘪𝘮 video to show just that:

𝗪𝗮𝘁𝗰𝗵 𝗯𝗲𝗹𝗼𝘄 𝗵𝗼𝘄 𝗮 𝗥𝗲𝗮𝗰𝘁 𝗔𝗴𝗲𝗻𝘁 𝘀𝗼𝗹𝘃𝗲𝘀 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝘁𝗮𝘀𝗸, by leveraging its memory to iterate on previous actions! 🎬👇

Read our blog post on Agents: https://huggingface.co/blog/agents

1 reply

·

posted an update about 1 month ago

Post

807

𝙒𝙧𝙞𝙩𝙞𝙣𝙜 𝙩𝙤𝙤𝙡 𝙘𝙖𝙡𝙡𝙨 𝙞𝙣 𝙘𝙤𝙙𝙚 𝙟𝙪𝙨𝙩 𝙬𝙤𝙧𝙠𝙨 𝙗𝙚𝙩𝙩𝙚𝙧 𝙩𝙝𝙖𝙣 𝙅𝙎𝙊𝙉 💪

I was really happy to learn today by @sergeipetrov that paper 𝘌𝘹𝘦𝘤𝘶𝘵𝘢𝘣𝘭𝘦 𝘊𝘰𝘥𝘦 𝘈𝘤𝘵𝘪𝘰𝘯𝘴 𝘌𝘭𝘪𝘤𝘪𝘵 𝘉𝘦𝘵𝘵𝘦𝘳 𝘓𝘓𝘔 𝘈𝘨𝘦𝘯𝘵𝘴 was accepted at ICLR 2024!

As a reminder, an agent is a system in which you embed a LLM engine, to let it call tools.

These tools are meant like an IronMan suit, to supplement the LLM in areas that it isn't good at.
🧑‍💻 For instance your friendly LLM may be terrible at calculating powers of floating numbers ("What is X ^0.2947 ?"), so it should use a calculator.
🔎It may be terrible at knowing precise facts ("What was the date of the Golden Bull?") so it should use a web browser.

So the agent system will prompt an agent with "Now you can use these tools: calculator, search,..."

But 𝙝𝙤𝙬 𝙨𝙝𝙤𝙪𝙡𝙙 𝙩𝙝𝙚 𝙖𝙜𝙚𝙣𝙩 𝙚𝙭𝙥𝙧𝙚𝙨𝙨 𝙞𝙩𝙨 𝙖𝙘𝙩𝙞𝙤𝙣𝙨?

All well known frameworks let agents write their actions as JSON strings.

We 𝗽𝗿𝗲𝗳𝗲𝗿𝗿𝗲𝗱 𝘁𝗼 𝗴𝗼 𝘄𝗶𝘁𝗵 𝗳𝗼𝗿𝗺𝘂𝗹𝗮𝘁𝗶𝗻𝗴 𝗮𝗰𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝗖𝗼𝗱𝗲, 𝘄𝗵𝗶𝗰𝗵 𝗶𝘀 𝗺𝘂𝗰𝗵 𝗺𝗼𝗿𝗲 𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗹𝗲 𝗮𝗻𝗱 𝗰𝗼𝗻𝗰𝗶𝘀𝗲, 𝗮𝗻𝗱 𝗮𝗹𝗹𝗼𝘄𝘀 𝘁𝗼 𝗰𝗵𝗮𝗶𝗻 𝗮𝗰𝘁𝗶𝗼𝗻𝘀 𝘀𝗲𝗮𝗺𝗹𝗲𝘀𝘀𝗹𝘆: see the picture attached for an example where Code formulation really shines.

And the paper confirms our choice: researchers show that 𝗰𝗼𝗺𝗽𝗮𝗿𝗲𝗱 𝘁𝗼 𝗝𝗦𝗢𝗡 𝗼𝗿 𝗽𝗹𝗮𝗶𝗻 𝘁𝗲𝘅𝘁, 𝗖𝗼𝗱𝗲 𝗶𝘀 𝗯𝗲𝘁𝘁𝗲𝗿 𝗯𝗼𝘁𝗵 𝗶𝗻 𝗰𝗼𝗻𝗰𝗶𝘀𝗲𝗻𝗲𝘀𝘀 𝗮𝗻𝗱 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲:
➤ Up to 30% fewer steps for the same actions (much more concise)
➤ Up to 20% higher performance on benchmarks

And we find additional benefits, for instance a natural handling of variables.

Read the paper here 📖 Executable Code Actions Elicit Better LLM Agents (2402.01030)
Get your ReactCodeAgent running with our Agents framework! 👉 https://huggingface.co/learn/cookbook/agents

posted an update about 2 months ago

Post

972

𝐍𝐞𝐰 𝐠𝐮𝐢𝐝𝐞 𝐢𝐧 𝐨𝐮𝐫 𝐎𝐩𝐞𝐧-𝐒𝐨𝐮𝐫𝐜𝐞 𝐀𝐈 𝐜𝐨𝐨𝐤𝐛𝐨𝐨𝐤: 𝙎𝙩𝙧𝙪𝙘𝙩𝙪𝙧𝙚𝙙 𝙜𝙚𝙣𝙚𝙧𝙖𝙩𝙞𝙤𝙣! ✨

Many use LLM use cases involve generating outputs with a specific structure.

➡️ For instance when using an LLM as a judge to evaluate another model's outputs, you need it to give you not only a score, but also the rationale for this score, and maybe a confidence level.
So you do not need only "score: 1", but more a dictionary like:

{
     "rationale": "The answer does not match the true answer at all."
     "score": 1,
     "confidence_level": 0.85
}

🤔 How to force your LLM to generate such a structured output?

🏗️ 𝗖𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 is a great technique to generate structured output: you can specify a grammar (=set of rules) that the output should follow, and 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝘁𝗵𝗲𝗻 𝗳𝗼𝗿𝗰𝗲𝘀 𝘁𝗵𝗲 𝗱𝗲𝗰𝗼𝗱𝗲𝗿 𝘁𝗼 𝗼𝗻𝗹𝘆 𝗽𝗶𝗰𝗸 𝘁𝗼𝗸𝗲𝗻𝘀 𝘁𝗵𝗮𝘁 𝗿𝗲𝘀𝗽𝗲𝗰𝘁 𝘆𝗼𝘂𝗿 𝗴𝗿𝗮𝗺𝗺𝗮𝗿.

I've created a guide to show you how to use it, both via our Inference API and locally using 𝘰𝘶𝘵𝘭𝘪𝘯𝘦𝘴!

👉 Read it here: https://huggingface.co/learn/cookbook/structured_generation

Thank you @stevhliu for your great help in improving it!

posted an update 2 months ago

Post

2723

💰❌ 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐯𝐞𝐫𝐲 𝐆𝐏𝐔 𝐏𝐨𝐨𝐫 - 𝐒𝐜𝐚𝐥𝐢𝐧𝐠 𝐥𝐚𝐰𝐬 𝐫𝐞𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧

🎆 Good news: 𝘆𝗼𝘂 𝗰𝗮𝗻 𝗱𝗼 𝗰𝘂𝘁𝘁𝗶𝗻𝗴-𝗲𝗱𝗴𝗲 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝘄𝗶𝘁𝗵 𝗮 𝗰𝗮𝗹𝗰𝘂𝗹𝗮𝘁𝗼𝗿 𝗮𝗻𝗱 𝗠𝗶𝗰𝗿𝗼𝘀𝗼𝗳𝘁 𝗣𝗮𝗶𝗻𝘁 𝟮𝟬𝟬𝟲!

The Chinchilla experiments (by Google DeepMind) ran hundreds of pre-trainings with models >1B parameters (I do not want to imagine how much that cost) to 𝗳𝗶𝗻𝗱 𝘁𝗵𝗲 𝗼𝗽𝘁𝗶𝗺𝗮𝗹 𝗿𝗮𝘁𝗶𝗼 𝗼𝗳 𝗺𝗼𝗱𝗲𝗹 𝘀𝗶𝘇𝗲 𝘃𝘀 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘁𝗼𝗸𝗲𝗻𝘀. Why is this question so important?
Well, you only ever have access to a fixed compute, counted in FLOPs (floating point operations). So if your model is bigger, you will have less compute to train on many tokens, and if you want to train on more tokens, your model will be smaller. When model trainings cost million, you absolutely need to get this right.

The new paper "Chinchilla Scaling: A replication attempt" by Epoch AI sets on on the ambitious goal of reproducing this.

But since the authors do not have infinite money, they decided to directly run their computations from DeepMind's own experiments! They took the figure from the last experiment (cf slide below), measured point positions, picked color codes, and ended up reconstructing the underlying data.

💥 They then just fit the scaling laws proposed by the Chinchilla Authors, but arrived at wildly different results! They find that as a rough rule of thumb, you should use 20 training tokens for each parameter in your model, instead of the 70 obtained in the original paper. They also point out inconsistencies in the paper, and unrealistically narrow confidence intervals.

➡️ This only contradicts the results from the last (out of 3) experiments in the Chinchilla paper. And the model trained at the end of the Chinchilla paper still seems properly scaled.

✅ But it does show that a tiny bit more theoretical work can go a long way, especially given the huge financial costs that such an error can have!

posted an update 3 months ago

Post

2439

𝐏𝐚𝐩𝐞𝐫 𝐑𝐞𝐯𝐢𝐞𝐰: 𝐑𝐡𝐨-𝟏 - 𝐃𝐨 𝐧𝐨𝐭 𝐮𝐬𝐞 𝐚𝐥𝐥 𝐭𝐨𝐤𝐞𝐧𝐬 𝐞𝐪𝐮𝐚𝐥𝐥𝐲 𝐢𝐧 𝐲𝐨𝐮𝐫 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠! ⚖️⛔️

A new paper topping Daily papers questions a hidden assumption in LLM training:

🤔 𝙎𝙝𝙤𝙪𝙡𝙙 𝙬𝙚 𝙧𝙚𝙖𝙡𝙡𝙮 𝙪𝙨𝙚 𝙖𝙡𝙡 𝙩𝙤𝙠𝙚𝙣𝙨 𝙚𝙦𝙪𝙖𝙡𝙡𝙮 𝙞𝙣 𝙤𝙪𝙧 𝙇𝙇𝙈'𝙨 𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 ?

Some tokens are more relevant than others, and some are mostly noise (just look up the history of 𝘚𝘰𝘭𝘪𝘥𝘎𝘰𝘭𝘥𝘔𝘢𝘨𝘪𝘬𝘢𝘳𝘱).

So this paper introduces 𝗦𝗲𝗹𝗲𝗰𝘁𝗶𝘃𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴, which is actually really simple:
➡️ A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.

Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).

➡️ Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
◆ ⏱️ Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
◆ 💪 Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
◆ 🚀 Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens

𝐀𝐝𝐝𝐢𝐭𝐢𝐨𝐧𝐚𝐥 𝐢𝐧𝐬𝐢𝐠𝐡𝐭𝐬 💡
◆ Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens 😖
◆ Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! ✅

Find great reads in @akhaliq 's Daily Papers 👉 https://huggingface.co/papers
Paper added to my collection 👉 m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7

posted an update 3 months ago

Post

2121

𝗡𝗲𝘄 𝗦𝗽𝗮𝗰𝗲: 𝘼𝙄 𝙏𝙧𝙖𝙫𝙚𝙡 𝙥𝙡𝙖𝙣𝙣𝙚𝙧 🗺️🏕️ Plan your next vacation in a few minutes!

I wanted to try out if a powerful LLM like Mixtral-8x7b had geographical reasoning capabilities.
So I built a small space that prompts the LLM to provide a JSON list of places based on a user input.

And the result was impressive! 🤯

⇒ 𝗜𝘁 𝘀𝗲𝗲𝗺𝘀 𝗹𝗶𝗸𝗲 𝗠𝗶𝘅𝘁𝗿𝗮𝗹 𝗵𝗮𝘀 𝗮 𝗴𝗿𝗮𝘀𝗽 𝗼𝗳 𝗴𝗲𝗼𝗴𝗿𝗮𝗽𝗵𝗶𝗰𝗮𝗹 𝗰𝗼𝗻𝗰𝗲𝗽𝘁𝘀 𝗹𝗶𝗸𝗲 𝗡𝗼𝗿𝘁𝗵 - 𝗦𝗼𝘂𝘁𝗵, 𝗼𝗿 𝘀𝗽𝗮𝘁𝗶𝗮𝗹 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁.🧭 Not just describing these concepts, but really applying them in practice, for instance to successfully answer "give me 4 European cities that are aligned on the map". This is a 𝗻𝗶𝗰𝗲 𝗲𝘅𝗮𝗺𝗽𝗹𝗲 𝗼𝗳 𝗮𝗻 𝗲𝗺𝗲𝗿𝗴𝗲𝗻𝘁 𝗰𝗮𝗽𝗮𝗯𝗶𝗹𝗶𝘁𝘆, since nothing in the LLM's training data should prepare it for this specific task.

Anyway, I added API calls and a nice visualization on top of the LLM, streaming output, caching for the answers and locations... and ta-da! ✨ I got the 𝗔𝗜 𝗧𝗿𝗮𝘃𝗲𝗹 𝗣𝗹𝗮𝗻𝗻𝗲𝗿.

𝙔𝙤𝙪 𝙘𝙖𝙣 𝙙𝙚𝙨𝙘𝙧𝙞𝙗𝙚 𝙞𝙩 𝙮𝙤𝙪𝙧 𝙩𝙧𝙞𝙥, 𝙖𝙣𝙙 𝙞𝙩 𝙬𝙞𝙡𝙡 𝙘𝙤𝙢𝙚 𝙪𝙥 𝙬𝙞𝙩𝙝 𝙣𝙞𝙘𝙚 𝙖𝙣𝙙 𝙘𝙤𝙣𝙫𝙚𝙣𝙞𝙚𝙣𝙩 𝙡𝙤𝙘𝙖𝙩𝙞𝙤𝙣𝙨!

𝙏𝙧𝙮 𝙞𝙩 𝙝𝙚𝙧𝙚 👉 m-ric/ai-travel-planner

Thank you @freddyaboulton for the 𝚐𝚛𝚊𝚍𝚒𝚘_𝚏𝚘𝚕𝚒𝚞𝚖 component, and @clem , @pngwn , @abidlabs for your ideas and support!

1 reply

·

posted an update 3 months ago

Post

2051

[𝐍𝐞𝐰 𝐏𝐚𝐩𝐞𝐫] 𝐀𝐥𝐥 𝐭𝐨𝐤𝐞𝐧𝐬 𝐬𝐡𝐨𝐮𝐥𝐝 𝐧𝐨𝐭 𝐫𝐞𝐪𝐮𝐢𝐫𝐞 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐞𝐟𝐟𝐨𝐫𝐭 𝐭𝐨 𝐜𝐨𝐦𝐩𝐮𝐭𝐞! ⇒ 𝐌𝐢𝐱𝐭𝐮𝐫𝐞 𝐨𝐟 𝐝𝐞𝐩𝐭𝐡𝐬 🫧🐠

Google Researchers were unhappy with the way current decoding generally works: all tokens go through the same layers, thus requiring exactly the same effort to compute.

Whereas in reality, completing the answer to a difficult math problem for instance should be more computationally intense than completing the text of the Declaration of Independence: 𝗻𝗼𝘁 𝗮𝗹𝗹 𝘁𝗼𝗸𝗲𝗻𝘀 𝗮𝗿𝗲 𝗰𝗿𝗲𝗮𝘁𝗲𝗱 𝗲𝗾𝘂𝗮𝗹!

➡️ 𝗧𝗵𝗲𝘆 𝗵𝗮𝗱 𝘁𝗵𝗶𝘀 𝗴𝗲𝗻𝗶𝘂𝘀 𝗶𝗱𝗲𝗮: 💡 𝗵𝗮𝘃𝗶𝗻𝗴 𝗮 𝘁𝗼𝗸𝗲𝗻 𝗴𝗼 𝘁𝗵𝗿𝗼𝘂𝗴𝗵 𝗮 𝗯𝗹𝗼𝗰𝗸 𝘀𝗵𝗼𝘂𝗹𝗱 𝗯𝗲 𝗼𝗽𝘁𝗶𝗼𝗻𝗮𝗹. The token can go through the block (thus undergoing expensive self-attention computation) or avoid it through a skip connection.
The routing decision is taken on the block level: each block selects from the total sequence the top-k tokens that will go through it, and the others tokens will skip it. 𝘛𝘩𝘪𝘴 𝘢𝘭𝘭𝘰𝘸𝘴 𝘵𝘰 𝘤𝘩𝘰𝘰𝘴𝘦 𝘵𝘩𝘦 𝘦𝘹𝘢𝘤𝘵 𝙘𝙖𝙥𝙖𝙘𝙞𝙩𝙮 𝘰𝘧 𝘢 𝘣𝘭𝘰𝘤𝘬, 𝘪.𝘦. 𝘵𝘩𝘦 𝘱𝘳𝘰𝘱𝘰𝘳𝘵𝘪𝘰𝘯 𝘰𝘧 𝘵𝘰𝘬𝘦𝘯𝘴 𝘵𝘩𝘢𝘵 𝘨𝘰 𝘵𝘩𝘳𝘰𝘶𝘨𝘩 𝘪𝘵, 𝘸𝘩𝘪𝘤𝘩 𝘥𝘪𝘳𝘦𝘤𝘵𝘭𝘺 𝘪𝘯𝘧𝘭𝘶𝘦𝘯𝘤𝘦𝘴 𝘵𝘩𝘦 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯𝘢𝘭 𝘪𝘯𝘵𝘦𝘯𝘴𝘪𝘵𝘺 𝘰𝘧 𝘵𝘩𝘦 𝘧𝘰𝘳𝘸𝘢𝘳𝘥 𝘱𝘢𝘴𝘴.

This yields Mixture-of-Depths (MoD), with spectacular results.

✨ 𝗥𝗲𝘀𝘂𝗹𝘁𝘀:
🎚️ 𝗖𝗮𝗽𝗮𝗰𝗶𝘁𝘆 𝗰𝗮𝗻 𝗯𝗲 𝘁𝘂𝗻𝗲𝗱 𝗮𝗹𝗹 𝘁𝗵𝗲 𝘄𝗮𝘆 𝗱𝗼𝘄𝗻 𝘁𝗼 𝟭𝟮.𝟱% for every second block: thus 87.5% of tokens just skip the block!
🚀 For the same training time and performance, >𝟲𝟬% 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝗽𝗲𝗲𝗱!
🤝 𝗖𝗮𝗻 𝗯𝗲 𝗰𝗼𝗺𝗯𝗶𝗻𝗲𝗱 𝘄𝗶𝘁𝗵 𝗠𝗶𝘅𝘁𝘂𝗿𝗲-𝗼𝗳-𝗘𝘅𝗽𝗲𝗿𝘁𝘀 for further improvements.

📄 𝗣𝗮𝗽𝗲𝗿 𝗵𝗲𝗿𝗲 👉 Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258)
📚 I added it to my paper collection 👉 m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7

1 reply

·

posted an update 3 months ago

Post

1839

𝟐𝟎𝟐𝟒, 𝐭𝐡𝐞 𝐲𝐞𝐚𝐫 𝐨𝐟 𝐚𝐠𝐞𝐧𝐭 𝐰𝐨𝐫𝐤𝐟𝐥𝐨𝐰𝐬 🔧🦾🤖

I've just watched Andrew Ng's talk at Sequoia last week.
If you're interested in Agents, you should really watch it!

𝗪𝗵𝘆 𝘂𝘀𝗲 𝗮𝗴𝗲𝗻𝘁 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄𝘀?
The current LLM task solving workflow is not very intuitive:
We ask it “write an essay all in one shot, without ever using backspace.”

Why not allow the LLM a more similar process to what we would do?
- “Write an essay outline”
- “Do you need wen research?”
- “Write a first draft”
- “Consider improvements”
…

This is called an Agentic workflow. Existing ones bring a huge performance boost. With HumanEval: GPT-4 zero-shot gets 67% score, agentic with either one of tool use or reflection goes over 90%, and the combination of the two scores even higher!

𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗱𝗲𝘀𝗶𝗴𝗻 𝗽𝗮𝘁𝘁𝗲𝗿𝗻𝘀
On the following two points, the tech is robust:

⚙️ 𝗥𝗲𝗳𝗹𝗲𝘅𝗶𝗼𝗻: For instance: add a critic step after the writing step
🛠️ 𝗧𝗼𝗼𝗹 𝘂𝘀𝗲: extends the capabilities of the LLM by allowing it to call tools, like search or calculator

The next two will be needed to go further, but the tech for them is more emerging and not reliable yet:
🗺️ 𝗣𝗹𝗮𝗻𝗻𝗶𝗻𝗴 forward to decompose task into subtasks. This allows great behaviours like an AI Agent re-routing after a failure
🐝 𝗠𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝗰𝗼𝗹𝗹𝗮𝗯𝗼𝗿𝗮𝘁𝗶𝗼𝗻: Program a flock of agents with tasks.
Improving the two above points will unlock huge performance boosts!

Andrew NG says Research agents are already part of his workflow!

𝗖𝗹𝗼𝘀𝗶𝗻𝗴 𝘁𝗵𝗼𝘂𝗴𝗵𝘁𝘀
Andrew speculates that through agentic workflows, maybe generating many tokens fast from a small LLM will give better results than slower throughput from a powerful LLM like GPT-5.

🎬 Watch the talk here 👉 https://www.youtube.com/watch?v=sal78ACtGTc
📚 I've added his recommended reads to m-ric/agents-65ba776fbd9e29f771c07d4e

1 reply

·

posted an update 3 months ago

Post

1779

𝐓𝐡𝐞 𝐫𝐞𝐭𝐮𝐫𝐧 𝐨𝐟 𝐭𝐡𝐞 𝐑𝐍𝐍𝐬 ⚔ 𝐍𝐞𝐰 𝐌𝐚𝐦𝐛𝐚-𝐛𝐚𝐬𝐞𝐝 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 "𝐉𝐚𝐦𝐛𝐚"

Since the release of BERT by Google in 2019, Transformers architecture have taken over machine learning thanks to their 𝗮𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗺𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺, that gives them the ability to focus on important points of the input. But 𝙖𝙩𝙩𝙚𝙣𝙩𝙞𝙤𝙣 𝙘𝙤𝙢𝙥𝙪𝙩𝙖𝙩𝙞𝙤𝙣 𝙞𝙨 𝙦𝙪𝙖𝙙𝙧𝙖𝙩𝙞𝙘 𝙞𝙣 𝙩𝙝𝙚 𝙞𝙣𝙥𝙪𝙩 𝙡𝙚𝙣𝙜𝙩𝙝.

💫 The Mamba paper, published in December 2023, announced the return of the RNNs: it has no attention, but integrates a selection mechanism, which should be able to reproduce the “focus” ability of attention, in an architecture for which the compute requirements 𝗴𝗿𝗼𝘄 𝗼𝗻𝗹𝘆 𝗹𝗶𝗻𝗲𝗮𝗿𝗹𝘆 𝗶𝗻 𝗶𝗻𝗽𝘂𝘁 𝗹𝗲𝗻𝗴𝘁𝗵!
🤔 Would this work? We had yet to see a large Mamba model recovering the performance of Attention-based Transformers.

💥 But now it's done! A (Mamba + Transformers) hybrid just beat Transformers!

The AI21 Labs team just released Jamba.
They insert a few Transformer layers to inject some attention in a big pile of Mamba layers, thus getting the best of both worlds.

𝙏𝙇;𝘿𝙍:
🏗️ 𝗡𝗲𝘄 𝗠𝗼𝗘 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: 4 Jamba blocks, each of these being 7 Mamba layers for 1 Transformer.
🏋️ 𝟱𝟮𝗕 𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀, 𝟭𝟮𝗕 𝗮𝗰𝘁𝗶𝘃𝗲 𝗮𝘁 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲: This reduction is enabled by Mixture of Experts, and similar to Mixtral (47B parameters - 13B active).
🏎️ 𝗦𝗽𝗲𝗲𝗱: 𝘅𝟯 𝘁𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁. Jamba is much faster than similar-sized Transformer models on long contexts.
📏 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗹𝗲𝗻𝗴𝘁𝗵: 𝟭𝟰𝟬𝗞 𝘁𝗼𝗸𝗲𝗻𝘀 on a single 80GB A100!
💪 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: 𝘀𝘁𝗮𝘁𝗲-𝗼𝗳-𝘁𝗵𝗲-𝗮𝗿𝘁 𝗳𝗼𝗿 𝘁𝗵𝗶𝘀 𝘀𝗶𝘇𝗲. The small injection of attention seems sufficient since Jamba beats the open-source reference Mixtral-8x7B on many benchmarks!

Try it here 👉 ai21labs/Jamba-v0.1

posted an update 3 months ago

Post

1676

𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗯𝗲𝗮𝗺 𝘀𝗲𝗮𝗿𝗰𝗵 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝘄𝗼𝗿𝗸? ➡️ 𝙉𝙚𝙬 𝙫𝙞𝙨𝙪𝙖𝙡𝙞𝙯𝙖𝙩𝙞𝙤𝙣 𝙩𝙤𝙤𝙡! 👀

In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.

𝐒𝐨 𝐡𝐨𝐰 𝐢𝐬 𝐭𝐡𝐞 𝐧𝐞𝐱𝐭 𝐭𝐨𝐤𝐞𝐧 𝐬𝐞𝐥𝐞𝐜𝐭𝐞𝐝?

📊 Given its input sentence like "𝘞𝘩𝘢𝘵 𝘪𝘴 𝘵𝘩𝘦 7𝘵𝘩 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳? 𝘛𝘩𝘦 7𝘵𝘩 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next.
For instance: "𝙞𝙨" gets score 0.56, and "𝙘𝙖𝙣" gets score 0.35.

🤑 𝐆𝐫𝐞𝐞𝐝𝐲 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day).
In our example, the next highest score token might be "𝙞𝙨", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "𝙘𝙖𝙣" could have been completed with "𝘣𝘦 𝘰𝘣𝘵𝘢𝘪𝘯𝘦𝘥 𝘧𝘳𝘰𝘮 𝘤𝘰𝘮𝘱𝘶𝘵𝘪𝘯𝘨 𝘱𝘳𝘦𝘷𝘪𝘰𝘶𝘴 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳𝘴 𝘧𝘪𝘳𝘴𝘵", which steers the LLM towards a correct reasoning!

🗺️ 𝐁𝐞𝐚𝐦 𝐬𝐞𝐚𝐫𝐜𝐡 improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "𝙞𝙨" and the "𝙘𝙖𝙣" completion could be tested. ✅

👉 I've created a tool to let you visualize it, thank you @joaogante for your great help!
𝙏𝙧𝙮 𝙞𝙩 𝙝𝙚𝙧𝙚: m-ric/beam_search_visualizer

posted an update 4 months ago

Post

2021

𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲 🧑‍⚖️ 𝗳𝗼𝗿 𝗮𝗻 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗮𝗻𝗱 𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗹𝗲 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻

Evaluating LLM outputs is often hard, since many tasks require open-ended answers for which no deterministic metrics work: for instance, when asking a model to summarize a text, there could be hundreds of correct ways to do it. The most versatile way to grade these outputs is then human evaluation, but it is very time-consuming, thus costly.

🤔 Then 𝘄𝗵𝘆 𝗻𝗼𝘁 𝗮𝘀𝗸 𝗮𝗻𝗼𝘁𝗵𝗲𝗿 𝗟𝗟𝗠 𝘁𝗼 𝗱𝗼 𝘁𝗵𝗲 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻, by providing it relevant rating criteria? 👉 This is the idea behind LLM-as-a-judge.

⚙️ To implement a LLM judge correctly, you need a few tricks.
✅ So 𝗜'𝘃𝗲 𝗷𝘂𝘀𝘁 𝗽𝘂𝗯𝗹𝗶𝘀𝗵𝗲𝗱 𝗮 𝗻𝗲𝘄 𝗻𝗼𝘁𝗲𝗯𝗼𝗼𝗸 𝘀𝗵𝗼𝘄𝗶𝗻𝗴 𝗵𝗼𝘄 𝘁𝗼 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗶𝘁 𝗽𝗿𝗼𝗽𝗲𝗿𝗹𝘆 𝗶𝗻 𝗼𝘂𝗿 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲 𝗖𝗼𝗼𝗸𝗯𝗼𝗼𝗸! (you can run it instantly in Google Colab)
➡️ 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲 𝗰𝗼𝗼𝗸𝗯𝗼𝗼𝗸: https://huggingface.co/learn/cookbook/llm_judge

The Cookbook is a great collection of notebooks demonstrating recipes (thus the "cookbook") for common LLM usages. I recommend you to go take a look!
➡️ 𝗔𝗹𝗹 𝗰𝗼𝗼𝗸𝗯𝗼𝗼𝗸𝘀: https://huggingface.co/learn/cookbook/index

Thank you @MariaK for your support!

2 replies

·

posted an update 4 months ago

Post

Interesting paper: 𝐆𝐚𝐋𝐨𝐫𝐞: 𝐭𝐫𝐚𝐢𝐧 𝟕𝐁 𝐦𝐨𝐝𝐞𝐥𝐬 𝐨𝐧 𝐜𝐨𝐧𝐬𝐮𝐦𝐞𝐫-𝐠𝐫𝐚𝐝𝐞 𝐆𝐏𝐔𝐬 💪
It's now possible to 𝙛𝙪𝙡𝙡𝙮 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣 a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!

The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!

The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits

Another technique is to 𝙥𝙧𝙤𝙟𝙚𝙘𝙩 𝙩𝙝𝙚 𝙬𝙚𝙞𝙜𝙝𝙩 𝙢𝙖𝙩𝙧𝙞𝙭 𝙩𝙤 𝙡𝙤𝙬𝙚𝙧-𝙧𝙖𝙣𝙠 𝙨𝙥𝙖𝙘𝙚𝙨, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.

➡️ Enter the authors of 𝘎𝘢𝘓𝘰𝘳𝘦: 𝘔𝘦𝘮𝘰𝘳𝘺-𝘌𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵 𝘓𝘓𝘔 𝘛𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘣𝘺 𝘎𝘳𝘢𝘥𝘪𝘦𝘯𝘵 𝘓𝘰𝘸-𝘙𝘢𝘯𝘬 𝘗𝘳𝘰𝘫𝘦𝘤𝘵𝘪𝘰𝘯. They gather (and prove) interesting insights:
⛔ The weight matrix does not reliably converge to lower ranks during training.
✅ But the gradient matrix does!

Based on these insights, 𝘁𝗵𝗲𝘆 𝗯𝘂𝗶𝗹𝗱 𝗚𝗮𝗟𝗼𝗿𝗲, that projects the gradient to lower ranks.
🗺️ 𝗚𝗿𝗲𝗮𝘁 𝗶𝗱𝗲𝗮: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).

🤝 This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).

➡️ 𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
📉 Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
💪 No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) ⇒ this is essential, it confirms that the method is viable!

Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)

posted an update 5 months ago

Post

📚🔎 If you're building RAG applications, you should check this out:

⚙️ I've built a new space to let you visualize the chunks you get with different text splitting methods!

➡️ Visualize your chunks here:
m-ric/chunk_visualizer

2 replies

·

Aymeric Roucher

AI & ML interests

Articles

Our Transformers Code Agent beats the GAIA benchmark!

Extracting Concepts from LLMs: Anthropic’s recent discoveries 📖

License to Call: Introducing Transformers Agents 2.0

Open-source LLMs as LangChain Agents

Organizations

m-ric's activity