Stefano Fiorucci

anakin87

AI & ML interests

Contributing to Haystack, the LLM Framework ๐Ÿ—๏ธ. NLP / LLMs.

Organizations

Posts 5

view post
Post
847
How to alter the behavior of a Language Model without fine-tuning or prompting? Say hello to ๐ŸŽค yo-Llama ๐Ÿฆ™!

Model anakin87/yo-Llama-3-8B-Instruct

This experiment steers Llama-3-8B-Instruct to respond in a rap style.
How? Amplifying the rap direction in the activation space. ๐Ÿ˜Ž


๐–๐ก๐š๐ญ ๐ฌ๐ฉ๐š๐ซ๐ค๐ž๐ ๐ญ๐ก๐ข๐ฌ ๐ข๐๐ž๐š?

Lately, I got interested in mechanistic interpretability of LLMs.

๐Ÿ’ก A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it.
A clever jailbreak method for open weights models.

Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.


๐‡๐จ๐ฐ ๐๐ข๐ ๐ˆ ๐œ๐ซ๐ž๐š๐ญ๐ž ๐ฒ๐จ-๐‹๐ฅ๐š๐ฆ๐š?
(๐Ÿ““ notebook in the HF repository, heavily inspired by Failspy's work)

1๏ธโƒฃ Load the Llama-3-8B-Instruct model.
2๏ธโƒฃ Load 1024 examples from Alpaca (instruction dataset).
3๏ธโƒฃ Prepare a system prompt to make the original model act like a rapper.
4๏ธโƒฃ Run inference on the examples, with and without the system prompt, and cache the activations.
5๏ธโƒฃ Compute the rap feature directions (one for each layer) from the activations.
6๏ธโƒฃ Apply the feature directions one by one, checking the results on some examples.
7๏ธโƒฃ Pick the best-performing feature direction.
8๏ธโƒฃ Apply this feature direction and voilร !
yo-Llama-3-8B-Instruct is born! ๐Ÿฅณ๐ŸŽถ

This was a fun experiment.


๐Ÿ“š Resources

Refusal in Language Models Is Mediated by a Single Direction - https://arxiv.org/abs/2406.11717

Uncensor any LLM with abliteration: great practical blog post by @mlabonne https://huggingface.co/blog/mlabonne/abliteration

Practical materials by @failspy
- abliterator library https://github.com/FailSpy/abliterator
- Llama-MopeyMule-3-8B-Instruct model (+ notebook) failspy/Llama-3-8B-Instruct-MopeyMule
view post
Post
1526
๐ŸŒŒ Creating adventures with local LLMs

What if ๐Ÿค”... Homer Simpson met Spider-Man and they went on a quest for donuts? ๐Ÿฉ
Or if Fred Astaire and Corporal Hicks teamed up to fight xenomorphs? ๐Ÿ‘พ

In the words of Karpathy, LLMs are dream machines...
they seem specially made to simulate these wild scenarios!

๐„๐ฑ๐ฉ๐ž๐ซ๐ข๐ฆ๐ž๐ง๐ญ๐ข๐ง๐  ๐ฐ๐ข๐ญ๐ก ๐ญ๐ก๐ข๐ฌ ๐ข๐๐ž๐š ๐Ÿ‘‡
Nous Research / @teknium recently released NousResearch/CharacterCodex:
a massive dataset with information on 16k characters, both fictional and real.
I couldn't wait to play it...

After a few attempts, I found that combining the information in this dataset with a good model (like meta-llama/Meta-Llama-3-8B-Instruct) opens the doors to a myriad of chat adventures.

๐Ÿ› ๏ธ Stack:
๐Ÿ”นHaystack for orchestration ๐Ÿ—๏ธ
๐Ÿ”นllamafile ๐Ÿฆ™๐Ÿ—‚๏ธ to run our model locally.

๐Ÿ““ Check out the notebook: https://t.ly/y6jrZ
(includes a bonus ๐Ÿ•ต๏ธ Mystery Character Quiz)