I was excited to explore Llama 3.2, but as a simple ๐ช๐บ EU guy, I don't have access to Meta's multimodal models ๐ฟ
๐ค So I thought: why not challenge the small 3B text model with Agentic RAG?
๐ฏ The plan: - Build a system that tries to answer questions using a knowledge base. - If the documents don't contain the answer, use Web search for additional context.
Looking to fine-tune Language Models efficiently and save on computational resources?
One popular method is QLoRa, which quantizes the original model and trains low-rank adapters on top. It's quite effective and uses less GPU than full fine-tuning.
However, QLoRa applies Low-Rank Adaptation uniformly across the entire model.
What if we could identify the most informative layers and only fine-tune those? ๐ค
This is exactly what Spectrum does! ๐
๐ฌ Spectrum analyzes the weight matrices for all layers in a Language Model and calculates a Signal to Noise Ratio (SNR) for each one. (It uses Random Matrix Theory and Marchenko-Pastur distribution to distinguish signal from noise.)
๐ฏ Based on a chosen percentage (say, 25%), Spectrum selects the most informative layers of each type (mlp.down_proj, self_attn.o_proj, etc.).
You can then โ๏ธ freeze the rest of the model and focus your ๐๏ธโโ๏ธ training on the chosen layers.
๐ Results/Evaluation - Spectrum is competitive with full fine-tuning and beats QLoRA on benchmarks. - While QLoRA is more memory-efficient on a single GPU, Spectrum shines in distributed training setups. - Great models trained with Spectrum: Dolphin models, Llama 3.1 Storm, numerous models by VAGO Solutions...
---
For a practical guide, check out the article above.
๐ฏ Targeted training with Spectrum I used Spectrum, a relatively new technique for parameter-efficient learning. The idea is to train only the layers of the model with high Signal-to-Noise Ratio (SNR) and โ๏ธ freeze the rest. I trained the top 30% of model layers.
Lately, I got interested in mechanistic interpretability of LLMs.
๐ก A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it. A clever jailbreak method for open weights models.
Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.
๐๐จ๐ฐ ๐๐ข๐ ๐ ๐๐ซ๐๐๐ญ๐ ๐ฒ๐จ-๐๐ฅ๐๐ฆ๐? (๐ notebook in the HF repository, heavily inspired by Failspy's work)
1๏ธโฃ Load the Llama-3-8B-Instruct model. 2๏ธโฃ Load 1024 examples from Alpaca (instruction dataset). 3๏ธโฃ Prepare a system prompt to make the original model act like a rapper. 4๏ธโฃ Run inference on the examples, with and without the system prompt, and cache the activations. 5๏ธโฃ Compute the rap feature directions (one for each layer) from the activations. 6๏ธโฃ Apply the feature directions one by one, checking the results on some examples. 7๏ธโฃ Pick the best-performing feature direction. 8๏ธโฃ Apply this feature direction and voilร ! yo-Llama-3-8B-Instruct is born! ๐ฅณ๐ถ
What if ๐ค... Homer Simpson met Spider-Man and they went on a quest for donuts? ๐ฉ Or if Fred Astaire and Corporal Hicks teamed up to fight xenomorphs? ๐พ
In the words of Karpathy, LLMs are dream machines... they seem specially made to simulate these wild scenarios!
๐๐ฑ๐ฉ๐๐ซ๐ข๐ฆ๐๐ง๐ญ๐ข๐ง๐ ๐ฐ๐ข๐ญ๐ก ๐ญ๐ก๐ข๐ฌ ๐ข๐๐๐ ๐ Nous Research / @teknium recently released NousResearch/CharacterCodex: a massive dataset with information on 16k characters, both fictional and real. I couldn't wait to play it...
After a few attempts, I found that combining the information in this dataset with a good model (like meta-llama/Meta-Llama-3-8B-Instruct) opens the doors to a myriad of chat adventures.
๐ ๏ธ Stack: ๐นHaystack for orchestration ๐๏ธ ๐นllamafile ๐ฆ๐๏ธ to run our model locally.
๐ Check out the notebook: https://t.ly/y6jrZ (includes a bonus ๐ต๏ธ Mystery Character Quiz)
When evaluating LLMs' responses, ๐ฉ๐ซ๐จ๐ฉ๐ซ๐ข๐๐ญ๐๐ซ๐ฒ ๐ฆ๐จ๐๐๐ฅ๐ฌ like GPT-4 are commonly used due to their strong performance. However, relying on closed models presents challenges related to data privacy ๐, transparency, controllability, and cost ๐ธ.
On the other hand, ๐จ๐ฉ๐๐ง ๐ฆ๐จ๐๐๐ฅ๐ฌ typically do not correlate well with human judgments and lack flexibility.
๐ฅ Prometheus 2 is a new family of open-source models designed to address these gaps: ๐น two variants: prometheus-eval/prometheus-7b-v2.0; prometheus-eval/prometheus-8x7b-v2.0 ๐น trained on open-source data ๐น high correlation with human evaluations and proprietary models ๐น highly flexible: capable of performing direct assessments and pairwise rankings, and allowing the definition of custom evaluation criteria.
See my experiments with RAG evaluation in the links above.
When building applications with LLMs, writing effective prompts is a long process of trial and error. ๐ Often, if you switch models, you also have to change the prompt. ๐ฉ What if you could automate this process?
๐ก That's where DSPy comes in - a framework designed to algorithmically optimize prompts for Language Models. By applying classical machine learning concepts (training and evaluation data, metrics, optimization), DSPy generates better prompts for a given model and task.
Recently, I explored combining DSPy with the robustness of Haystack Pipelines.
Here's how it works: โถ๏ธ Start from a Haystack RAG pipeline with a basic prompt ๐ฏ Define a goal (in this case, get correct and concise answers) ๐ Create a DSPy program, define data and metrics โจ Optimize and evaluate -> improved prompt ๐ Build a refined Haystack RAG pipeline using the optimized prompt
๐๐จ๐ฐ ๐ข๐ญ ๐ฐ๐จ๐ซ๐ค๐ฌ You provide an URL -> A multiple-choice quiz is instantly generated.
๐น You can play the quiz yourself.
๐น You can let the LLM play in two different ways ๐ Closed book: the LLM responds only by knowing the general topic and using its parametric knowledge and reasoning abilities. ๐๐ Web RAG: for each question, a Google search is done and the top 3 snippets are included in the prompt for the LLM.