Low Rank Adaptation

2024/08/02

1663 words

8 mins

Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA)

Pretrained models are often referred to as foundation models for a good reason: they perform well on various tasks, and we can use them as a foundation for finetuning on a target task.

Still, it can be computationally very costly — the larger the model, the more expensive it is to update its layers.

As an alternative to updating all layers, parameter-efficient methods such as prefix tuning and adapters have been developed. A parameter-efficient finetuning technique: Low-rank adaptation (LoRA) by Hu et al .

What is LoRA? How does it work? And how does it compare to the other popular finetuning approaches? Let’s see.

Making Weight Updates More Efficient

The method proposes to decompose the weight changes, ΔW, into a lower-rank representation. To be technically precise, LoRA does not decompose the matrices directly, but it learns the decomposed matrices via backpropagation).

Before we take a closer look at LoRA, let’s briefly explain the training procedure during regular finetuning.

What are weight changes ΔW?

Suppose W represents the weight matrix in a given neural network layer. Then, using regular backpropagation, we can obtain the weight update ΔW, which is typically calculated as a negative gradient of the loss times the learning rate:

ΔW = α ( -∇ LW).

Then, when we have ΔW, we can update the original weights as follows: W‘ = W + ΔW.

Alternatively, we can keep the weight update matrix separate and compute the outputs as follows: _h = W x + ΔW x

where x represents the inputs.

Why would we do this?

When we train fully connected (i.e., “dense”) layers in a neural network, the weight matrices usually have full rank, which is a technical term meaning that a matrix does not have any linearly dependent (i.e., “redundant”) rows or columns. In contrast, to full rank, low rank means that the matrix has redundant rows or columns.

So, while the weights of a pretrained model have full rank on the pretrained tasks, the LoRA authors point out that pretrained large language models have a low “intrinsic dimension” when they are adapted to a new task, according to Aghajanyan et al. (2020).

A low intrinsic dimension means the data can be effectively represented or approximated by a lower-dimensional space while retaining most of its essential information or structure. In other words, this means we can decompose the new weight matrix for the adapted task into lower-dimensional (smaller) matrices without losing too much important information.

For example, suppose ΔW is the weight update for an A × B weight matrix. Then, we can decompose the weight update matrix into two smaller matrices: ΔW = WA WB, where WA is an an A × r-dimensional matrix, and WB is an an r × B-dimensional matrix. Here, we keep the original weight W frozen and only train the new matrices WA and WB. That’s is the LoRA method, pretty much.

Choosing the rank

Note that r, in the formula above is a hyperparameter that can be used to specify the rank of the low-rank matrices used for adaptation. A smaller r leads to a simpler low-rank matrix, which results in fewer parameters to learn during adaptation. This can lead to faster training and potentially reduced computational requirements. However, with a smaller r, the capacity of the low-rank matrix to capture task-specific information decreases. This may result in lower adaptation quality, and the model might not perform as well on the new task compared to a higher r. In summary, choosing a smaller r in LoRA has a trade-off between model complexity, adaptation capacity, and the risk of underfitting or overfitting. It’s thus important to experiment with different r values to find the right balance to achieve the desired performance on the new task.

Parameter efficiency

Now, how is this parameter efficient if we introduce new weight matrices? The new matrices WA and WB can be very small. For example, suppose A=100 and B=500, then the size of ΔW is 100 × 500 = 50,000. Now, if we decompose this into two smaller matrices, a 100×5-dimensional matrix WA and a 5×500-dimensional matrix WB. These two matrices only have 5× 100 + 5 × 500 = 3,000 parameters in total.

Reducing inference overhead**

Note that in practice, if we keep the original weights W and the matrices WA and WB separate after training as shown above, we will incur a small efficiency penalty during inference as this introduces an additional computation step. Instead, we can update the weights after training via W’ = W + WA WB, which is analogous to W’ = W + ΔW mentioned earlier.

However, there can be practical advantages in keeping the weight matrices WA and WB separate. For example, imagine we want to keep our pretrained model as a base model for various sub use cases, and we want to create a finetuned model for each use cases starting from the base model. In this case, we don’t need to store the full weight matrices W’ for each cases, where storing all the weights W’ = W + WA WB for a model can be very large, since models typically have billions to trillions of weight parameters. So instead, we can keep the original model W and only need to store the new lightweight matrices WA and WB.

To illustrate this point with concrete numbers, a full 7B LLaMA checkpoint requires 23 GB of storage capacity, while the LoRA weights can be as small as 8 MB (if we choose a rank of _r=8).

How good is it in practice?

How good is LoRA in practice, and how does it compare to full finetuning and other parameter-efficient approaches? According to the LoRA paper , the modeling performance of models using LoRA performs slightly better than models using Adapters , prompt tuning , or prefix tuning across several task-specific benchmarks. Often, LoRA performs even better than finetuning all layers, as shown in the annotated table from the LoRA paper below. (ROUGE is a metric for evaluating language translation performance).

Here, it’s worth noting that LoRA is orthogonal to the other finetuning methods, meaning it can also be combined with prefix tuning and adapters, for example.

In the next section, we will compare the 7B LLaMA base model with the 7B LLaMA base finetuned using LoRA and LLaMA-Adapter. (Note that this requires a GPU with at least 24 Gb RAM).

Computational Performance Benchmarks

The finetuning dataset is the Alpaca 52k instruction dataset described here .

The dataset itself was generated following the method described in the Self-Instruct paper and consists of 49,759 training examples and 2000 validation examples. The Self-Instruct procedure can be summarized in 4 steps:

How does this work?

It’s 4-step process:

Seed task pool with a set of human-written instructions (175 in this case) and sample instructions
Use a pretrained model to determine the task category
Given the new instruction, let a pretrained model generate the response
Collect, prune, and filter the responses before adding it to the task pool.

Note that the Alpaca 52k dataset was collected using the automated self-instruct procedure above. However, you may also use (or compare it with) an alternative dataset. For example, an interesting candidate is a released open-source databricks-dolly-15k dataset that contains ~15k instruction/response finetuning records. The Lit-LLaMA repository contains a dataset preparation script in case you want to use the Dolly 15k dataset instead of the Alpaca 52k dataset, it will go quicker.

Given the following hyperparameter settings (block size, batch size, and LoRA r) both Adapter and LoRA can finetune the 7B parameter LLaMA base model on a single GPU with 24 Gb RAM using bfloat-16 mixed precision training.

For comparison, Adapter uses about 22 Gb and finishes 62,400 iterations in 161 min. LoRA uses 21 Gb of memory and finished in about 192 min. In sum, Adapter and LoRA use approximately the same amount of RAM and have roughly the same performance.

For comparison, full finetuning (LLaMA 7B consists of 32 transformer blocks and 3 fully connected output layers) requires at least 2 GPUs with at least 30 Gb and fully sharded training to distribute the weights. Alternatively, you can use 4 GPUs with a maximum memory usage of 22 Gb per GPU. The training on 4 GPUs and the training takes much longer: 1956 min. This would be at least 6,000 min on a single GPU, which would be 30-40x more expensive than the parameter-efficient LLaMA-Adapter or LoRA alternatives.

Next, let’s look at the model outputs after applying the different finetuning strategies.

Evaluating Modeling Performance

There are several metrics for evaluating the text generated by LLMs. For example, perplexity, BLEU, and ROUGE scores are some of the most common evaluation metrics used in natural language processing to assess the performance of models across various tasks. However, all of these metrics have substantial shortcomings, and human evaluations remain more accurate.

A Qualitative Comparison

To compare the models qualitatively, we will be using the prompt "Please explain how weight decay works"

> _Please explain how weight decay works_
> 
> _In the simplest terms, weight decay removes features from the model
> (as a function of how important they are). It is also known as regularization._
> 
> 
> Time for inference 1: 4.18 sec total, 11.95 tokens/sec

Any decay?

Weight decay is a way to penalize terms for their word-count. By choosing sentences with fewer words, a model is more likely to select those sentences, which may provide important information.
Time for inference: 7.04 sec total, 14.21 tokens/sec
Memory used: 13.56 GB

Note that the LoRA approach above currently uses the most memory. We can reduce this memory usage by merging the LoRA weights with the pretrained model weights.

This qualitative overview is only a thin slice of the capabilities of each of these models since evaluating quality is a big topic in itself.

Bottom line, LoRA can be used to finetuning an LLM on an instruction dataset with better cost-effectiveness.

Until something else comes out I suppose.

#Training #Machine Learning #Engineering #Fine-Tuning #LoRA #Adapter #Llama