Fine-tuning LLMs with Singular Value Decomposition

Community Article Published June 2, 2024

This blog post will talk about a parameter-cheap way of fine-tuning language models through the means of truncated Singular Value Decomposition (SVD). This technique is similar to LoRA but with considerably fewer trainable parameters.

In the following, the main goal is to reproduce the results in the recent paper LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters and make this technique easily available as a Python package.

Training with Low Rank Adaptation (LoRA)

The main idea behind LoRA is to decompose the weights in each linear transformation into two parts: an untrainable matrix - which contains the initial weights - and a trainable set of weigths that induces a transformation to and from a lower-dimensional latent space, typically a small power of two.

image/png

In the image above, the original weight matrix is frozen and only the two projection matrices are trainable.

This technique has been incredibly successfull to allow fine tuning on cheaper GPU setups. By confining training strictly to the low rank projection matrices one can fine-tune all of a model's layers using about 1/1000 of the original parameters count.

Training only the Singular Values

Singular Value Decomposition approximates a matrix through a linear algebra technique that decomposes it into three elements. Given an original m×nm \times n weight matrix MM and an integer qq, SVD builds an approximate matrix

M~=UΣV\tilde{M} = U\,\Sigma\,V

by calculating the q×nq \times n matrix UU, the m×qm \times q matrix VV, and the diagonal square q×qq \times q matrix Σ\Sigma.

image/png

Notice that while Σ\Sigma is a square matrix, the only non-zero elements are in the diagonal and are called singular values. The value of qq is the number of singular values chosen in the approximation. The rank r of the original m×nm \times n matrix is defined within this post as min(m,n)\min(m, n).

This is the core element of SVD training: Instead of training the whole weight matrix M, we only train the singular values.

More specifically, let us note that the matrix M is equivalent to

M=MUΣV+UΣVM = M - U\,\Sigma\,V + U\,\Sigma\,V

Which can be rewritten as

M=M+UΣVM = M' + U\,\Sigma\,V

where M=MUΣVM' = M - U\,\Sigma\,V

If we freeze the parameters of MM', UU, and VV, then only the singular values are updated in a training loop. Memory-wise this is cheaper that LoRA, since the matrices UU and VV remain unchanged.

image/png

In the image above, only the singular values are trainable. All the other parameters are frozen.

This method is the SVD-training technique, very similar to the method described in LoRA-XS. The main difference is that the LoRA-XS directly uses MM instead of M=MUΣVM' = M - U\,\Sigma\,V.

Number of trainable parameters

In the following, I conduct some experiments on the model Phi-3-mini-128k-instruct, developed by Microsoft research.

Using the SVD-training technique, the number of trainable parameters depends on the rank fraction q/rq/r used in the approximation. The SVD decomposition is applied to all linear layers. The tables below exemplify the difference between typical hyperparameters for both LoRA and SVD-training. These are common values to employ when using SVD-training

SVD Rank fraction Trainable parameters
0.1 239_283
0.2 278_886
0.5 397_824

This needs to be compared agains the following common LoRA configurations

Low-rank dimension Trainable parameters
8 12_864_000
16 25_728_000
32 51_456_000

Please note that the low rank dimension is not directly commensurate to the SVD rank fraction. This is because the SVD approximation is applied to all linear layers - each with different dimensions. A rank fraction of 0.1 yields a variable number qq depending on the weight.

For ease of comparison, the number of traininable parameters above does not include the token embeddings in both cases. This number is a constant, which for the model under consideration is 98_500_608.

SVD-training library

A simple library that leverages the SVD-training is svd-training

pip install svd-training

This library requires the model to be pre-processed to generate the matrices MM', UU, Σ\Sigma, and VV.

from transformers import AutoTokenizer, AutoModelForCausalLM
from svd_training.svd_model import SVDForCausalLM

filename = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(filename)
model = AutoModelForCausalLM.from_pretrained(filename)

svd_model = SVDForCausalLM.create_from_model(model, rank_fraction=0.1) # Create the SVD model

### Train the model using your favourite training loop
...
###

svd_model.merge()  # Merge the SVD layers back into the model
svd_model.save_pretrained("svd_model/")  # Save the model

Eventually all the elements are merged together into the original weight matrix. After the merge operation the model can be saved and loaded as any other HF models.

Fine-tuning on Ultrachat

As a test, the Phi-3 model above is fine-tuned on the Ultrachat dataset. This test includes full model fine-tuning, 8-dim LoRA, and SVD-training with 0.1 rank fraction. To the best of my understanding, the Ultrachat dataset was not part of the original training set for Phi-3.

All these models are trained using the standard SFTTrainer with different learning rates. These are the values chosen for each after a brief hyperparameter sweep.

Model Learning rate
Full model fine-tuning (except embeddings) 5e-5
8-dim LoRA 1e-4
SVD-training with r=0.1 1e-2

The evaluation losses on the Ultrachat dataset are as follows (lower is better)

Model Ultrachat eval_loss
Original Phi-3-mini-128k-instruct 4.01
Full model fine-tuning (except embeddings) 2.60
8-dim LoRA 2.78
SVD-training with r=0.1 2.67

The best result comes from full model fine-tuning. The second best are from the SVD-training technique.

Conclusions

The SVD-training method seems to perform on-par with the more parameter-heavy LoRA. This is remarkable, since SVD use two order of magnitudes fewer parameters. While more work is need to confirm these results, the method is quite promising.

A model with fewer parameters is ostensibly less data-hungry. This might imply that the SVD technique is particularly useful when fine-tuning on small datasets. Future works will better define the conditions that make this technique most successfull.

References

[1] LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

[2] The Phi-3 model SVD-tuned on Ultrachat