Post
847
How to alter the behavior of a Language Model without fine-tuning or prompting? Say hello to ๐ค yo-Llama ๐ฆ!
Model anakin87/yo-Llama-3-8B-Instruct
This experiment steers Llama-3-8B-Instruct to respond in a rap style.
How? Amplifying the rap direction in the activation space. ๐
๐๐ก๐๐ญ ๐ฌ๐ฉ๐๐ซ๐ค๐๐ ๐ญ๐ก๐ข๐ฌ ๐ข๐๐๐?
Lately, I got interested in mechanistic interpretability of LLMs.
๐ก A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it.
A clever jailbreak method for open weights models.
Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.
๐๐จ๐ฐ ๐๐ข๐ ๐ ๐๐ซ๐๐๐ญ๐ ๐ฒ๐จ-๐๐ฅ๐๐ฆ๐?
(๐ notebook in the HF repository, heavily inspired by Failspy's work)
1๏ธโฃ Load the Llama-3-8B-Instruct model.
2๏ธโฃ Load 1024 examples from Alpaca (instruction dataset).
3๏ธโฃ Prepare a system prompt to make the original model act like a rapper.
4๏ธโฃ Run inference on the examples, with and without the system prompt, and cache the activations.
5๏ธโฃ Compute the rap feature directions (one for each layer) from the activations.
6๏ธโฃ Apply the feature directions one by one, checking the results on some examples.
7๏ธโฃ Pick the best-performing feature direction.
8๏ธโฃ Apply this feature direction and voilร !
yo-Llama-3-8B-Instruct is born! ๐ฅณ๐ถ
This was a fun experiment.
๐ Resources
Refusal in Language Models Is Mediated by a Single Direction - https://arxiv.org/abs/2406.11717
Uncensor any LLM with abliteration: great practical blog post by @mlabonne https://huggingface.co/blog/mlabonne/abliteration
Practical materials by @failspy
- abliterator library https://github.com/FailSpy/abliterator
- Llama-MopeyMule-3-8B-Instruct model (+ notebook) failspy/Llama-3-8B-Instruct-MopeyMule
Model anakin87/yo-Llama-3-8B-Instruct
This experiment steers Llama-3-8B-Instruct to respond in a rap style.
How? Amplifying the rap direction in the activation space. ๐
๐๐ก๐๐ญ ๐ฌ๐ฉ๐๐ซ๐ค๐๐ ๐ญ๐ก๐ข๐ฌ ๐ข๐๐๐?
Lately, I got interested in mechanistic interpretability of LLMs.
๐ก A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it.
A clever jailbreak method for open weights models.
Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.
๐๐จ๐ฐ ๐๐ข๐ ๐ ๐๐ซ๐๐๐ญ๐ ๐ฒ๐จ-๐๐ฅ๐๐ฆ๐?
(๐ notebook in the HF repository, heavily inspired by Failspy's work)
1๏ธโฃ Load the Llama-3-8B-Instruct model.
2๏ธโฃ Load 1024 examples from Alpaca (instruction dataset).
3๏ธโฃ Prepare a system prompt to make the original model act like a rapper.
4๏ธโฃ Run inference on the examples, with and without the system prompt, and cache the activations.
5๏ธโฃ Compute the rap feature directions (one for each layer) from the activations.
6๏ธโฃ Apply the feature directions one by one, checking the results on some examples.
7๏ธโฃ Pick the best-performing feature direction.
8๏ธโฃ Apply this feature direction and voilร !
yo-Llama-3-8B-Instruct is born! ๐ฅณ๐ถ
This was a fun experiment.
๐ Resources
Refusal in Language Models Is Mediated by a Single Direction - https://arxiv.org/abs/2406.11717
Uncensor any LLM with abliteration: great practical blog post by @mlabonne https://huggingface.co/blog/mlabonne/abliteration
Practical materials by @failspy
- abliterator library https://github.com/FailSpy/abliterator
- Llama-MopeyMule-3-8B-Instruct model (+ notebook) failspy/Llama-3-8B-Instruct-MopeyMule