Fine Tuning methods | Blog - Apply Creatures

Fine Tuning methods

2024/08/02

608 words

3 mins

Finetuning Open-Source LLMs: A Comparison of two Parameter-Efficient Methods

Finetuning is an essential step in adapting them to specific tasks and domains. However, conventional finetuning methods can be computationally and memory-intensive. In this article, we explore the use of parameter-efficient finetuning methods.

What is Finetuning

Models are typically trained in two stages: pretraining and finetuning. Pretraining involves training the model on unlabeled dataset, resulting in a foundation model with general capabilities. Finetuning involves training the pre-trained model on a specific dataset or task to adapt it to the target domain.

Why Finetuning?

Finetuning allows the model to better adapt to specific domains or types of text not well-represented in its original training data. Also, it may lead to inference performance improvement.

Parameter-Efficient Finetuning Methods

We will focus on three parameter-efficient finetuning paradigms:

LLaMA Adapter: Adds a small number of trainable tensors to an existing model, allowing for more tailored adaptation across different model layers.
LLaMA Adapter v2: Similar to LLaMA Adapter, but with a few key differences, including the addition of bias units and trainable RMSNorm layers.
Low-Rank Adaptation (LoRA): Decomposes a weight matrix into two smaller weight matrices, allowing for more efficient computation.

Performance Comparison

Parameter-efficient finetuning methods can finetune a 7B to 13B params model on a single GPU, 9 times faster than conventional finetuning, with 15x less GPU memory.

Llama Adapter

This method adds a small number of trainable tensors (parameters) to an existing LLM. Here, the idea is that only the new parameters are trained, whereas the original parameters are left frozen. This can save a lot of compute and memory during backpropagation.

In a bit more detail LLaMA-Adapter adds prepends tunable prompt tensors (prefixes) to the embedded inputs. In the LLaMA-Adapter method, these prefixes are learned and maintained within an embedding table rather than being provided externally. Each transformer block in the model has its own distinct learned prefix, allowing for more tailored adaptation across different model layers.

In addition, it introduces a zero-initialized attention mechanism coupled with gating. The motivation behind this so-called zero-init attention and gating is that adapters and prefix tuning could potentially disrupt the linguistic knowledge of the pretrained model by incorporating randomly initialized tensors (prefix prompts or adapter layers), resulting in unstable finetuning and high loss values during initial training phases.

GPU Memory Requirements

Conventional finetuning a 7B blob requires ~40 GB RAM, on about 6 GPUs such as the A100. Parameter-efficient finetuning methods only require ~16 GB RAM, allowing users to finetune the model on a single high end consumer-grade GPU.

LLaMA-Adapter v2

When finetuning LLMs on text and instructions, the more recent LLaMA-Adapter v2 ( Gao et al 2023 ) increases the number of tunable parameters compared to LLaMA-Adapter V1 ( Zhang et al. 2023 ), the first difference is that it adds bias units to the fully connected (linear) layers. Since it merely modifies the existing linear layers from input * weight to input * weight + bias, it only has a small impact on the finetuning and inference performance.

The second difference is that it makes the aforementioned RMSNorm layers trainable. While this has a small effect on the training performance due to updating additional parameters, it doesn’t impact the inference speed as it doesn’t add any new parameters to the network.

Low-Rank Adaptation (LoRA)

Low-Rank Adaptation ( Hu et al 2021 ) is similar to the Adapter methods above in that it adds a small number of trainable parameters to the model while the original model parameters remain frozen. The underlying approach differs.

LoRA decomposes a weight matrix into two smaller weight matrices.

Inference Speed

Parameter-efficient finetuning methods achieve similar inference speeds as conventional finetuning methods, with LoRA achieving slightly better performance.

#Training #Machine Learning #Engineering #Fine-Tuning #Llama #Adapter