@anakin87 on Hugging Face: "How to alter the behavior of a Language Model without fine-tuning or…"

Post

920

How to alter the behavior of a Language Model without fine-tuning or prompting? Say hello to 🎤 yo-Llama 🦙!

Model anakin87/yo-Llama-3-8B-Instruct

This experiment steers Llama-3-8B-Instruct to respond in a rap style.
How? Amplifying the rap direction in the activation space. 😎

𝐖𝐡𝐚𝐭 𝐬𝐩𝐚𝐫𝐤𝐞𝐝 𝐭𝐡𝐢𝐬 𝐢𝐝𝐞𝐚?

Lately, I got interested in mechanistic interpretability of LLMs.

💡 A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it.
A clever jailbreak method for open weights models.

Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.

𝐇𝐨𝐰 𝐝𝐢𝐝 𝐈 𝐜𝐫𝐞𝐚𝐭𝐞 𝐲𝐨-𝐋𝐥𝐚𝐦𝐚?
(📓 notebook in the HF repository, heavily inspired by Failspy's work)

1️⃣ Load the Llama-3-8B-Instruct model.
2️⃣ Load 1024 examples from Alpaca (instruction dataset).
3️⃣ Prepare a system prompt to make the original model act like a rapper.
4️⃣ Run inference on the examples, with and without the system prompt, and cache the activations.
5️⃣ Compute the rap feature directions (one for each layer) from the activations.
6️⃣ Apply the feature directions one by one, checking the results on some examples.
7️⃣ Pick the best-performing feature direction.
8️⃣ Apply this feature direction and voilà!
yo-Llama-3-8B-Instruct is born! 🥳🎶

This was a fun experiment.

📚 Resources

Refusal in Language Models Is Mediated by a Single Direction - https://arxiv.org/abs/2406.11717

Uncensor any LLM with abliteration: great practical blog post by @mlabonne https://huggingface.co/blog/mlabonne/abliteration

Practical materials by @failspy
- abliterator library https://github.com/FailSpy/abliterator
- Llama-MopeyMule-3-8B-Instruct model (+ notebook) failspy/Llama-3-8B-Instruct-MopeyMule

Join the conversation