Finetuning is an essential step in adapting them to specific tasks and domains. However, conventional finetuning methods can be computationally and memory-intensive. In this article, we explore the use of parameter-efficient finetuning methods.
Models are typically trained in two stages: pretraining and finetuning. Pretraining involves training the model on unlabeled dataset, resulting in a foundation model with general capabilities. Finetuning involves training the pre-trained model on a specific dataset or task to adapt it to the target domain.
Why Finetuning?
Finetuning allows the model to better adapt to specific domains or types of text not well-represented in its original training data. Also, it may lead to inference performance improvement.
We will focus on three parameter-efficient finetuning paradigms:
Parameter-efficient finetuning methods can finetune a 7B to 13B params model on a single GPU, 9 times faster than conventional finetuning, with 15x less GPU memory.
This method adds a small number of trainable tensors (parameters) to an existing LLM. Here, the idea is that only the new parameters are trained, whereas the original parameters are left frozen. This can save a lot of compute and memory during backpropagation.
In a bit more detail LLaMA-Adapter adds prepends tunable prompt tensors (prefixes) to the embedded inputs. In the LLaMA-Adapter method, these prefixes are learned and maintained within an embedding table rather than being provided externally. Each transformer block in the model has its own distinct learned prefix, allowing for more tailored adaptation across different model layers.
In addition, it introduces a zero-initialized attention mechanism coupled with gating. The motivation behind this so-called zero-init attention and gating is that adapters and prefix tuning could potentially disrupt the linguistic knowledge of the pretrained model by incorporating randomly initialized tensors (prefix prompts or adapter layers), resulting in unstable finetuning and high loss values during initial training phases.
Conventional finetuning a 7B blob requires ~40 GB RAM, on about 6 GPUs such as the A100. Parameter-efficient finetuning methods only require ~16 GB RAM, allowing users to finetune the model on a single high end consumer-grade GPU.
When finetuning LLMs on text and instructions, the more recent LLaMA-Adapter v2 (
Gao et al 2023
) increases the number of tunable parameters compared to LLaMA-Adapter V1 (
Zhang et al. 2023
), the first difference is that it adds bias units to the fully connected (linear) layers. Since it merely modifies the existing linear layers from input * weight
to input * weight + bias
, it only has a small impact on the finetuning and inference performance.
The second difference is that it makes the aforementioned RMSNorm layers trainable. While this has a small effect on the training performance due to updating additional parameters, it doesn’t impact the inference speed as it doesn’t add any new parameters to the network.
Low-Rank Adaptation ( Hu et al 2021 ) is similar to the Adapter methods above in that it adds a small number of trainable parameters to the model while the original model parameters remain frozen. The underlying approach differs.
LoRA decomposes a weight matrix into two smaller weight matrices.
Parameter-efficient finetuning methods achieve similar inference speeds as conventional finetuning methods, with LoRA achieving slightly better performance.