Title: LoRA: Low-Rank Adaptation of Large Language Models
Authors: Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
Published: 17th June 2021 (Thursday) @ 17:37:18
Link: http://arxiv.org/abs/2106.09685v2
Abstract
An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example —deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
todo circle back to the stuff on Subspace similarity between different r once you have a better understanding of Singular Value Decomposition and read §7.3 How does the adaptation matrix ∆W compare to W?
- Builds on the discovery that despite matrices in the LM networks typically being full rank from Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning, the intrinsic dimension is quite small
- make the same assumption for adapters
- Instead of full-fine-tuning, adapters or prefix tuning, learn a decomposed matrix update where is the fine-tuned model and the delta is the matrix update we learn during fine-tuning, which gets decomposed into and with being the low rank tunable hyperparameter
- Reduction of memory and storage usage - during PEFT (parameter-efficient fine-tuning) no need to store the optimiser states for the weights from the frozen parameters that are not being optimised!
- up to 2/3 VRAM utilisation reduction if
- 25% speedup in training vs full fine-tuning - tested on GPT-3 175B
- Benchmarks: WikiSQL, MNLI-m, SAMSum
- test on RoBERTa base/large, DeBERTa XXL (NLU); GPT-2 medium/large, GPT-3 175B (NLG)
- Which weight matrices in transformer should we apply lora to?
- used a fixed parameter budget: 18M params - roughly 35MB if stored in FP16) on GPT-3 175B
- best to adapt all attention matrices (q, k, v, o), each with a smaller rank (e.g. 2 instead of 8 for a single one)
- What is the optimal rank for lora?
- Table 6 shows that, surprisingly, LoRA already performs competitively with a very small r (more so for {Wq, Wv} than just Wq). This suggests the update matrix ∆W could have a very small “intrinsic rank”
- To our surprise, a rank as small as one suffices for adapting both Wq and Wv on these datasets while training Wq alone needs a larger r. - from the caption to Table 6
- Subspace similarity between different r. Given Ar=8 and Ar=64 which are the learned adaptation matrices with rank r = 8 and 64 using the same pre-trained model, we perform singular value decomposition and obtain the right-singular unitary matrices
This is an interesting baseline that extends Prefix Tuning - i.e. Prefix-Tuning Optimizing Continuous Prompts for Generation - from Li and Liang
Prefix-layer tuning (PreLayer) is an extension to prefix-embedding tuning. Instead of just learning the word embeddings (or equivalently, the activations after the embedding layer) for some special tokens, we learn the activations after every Transformer layer. The activations computed from previous layers are simply replaced by trainable ones. The resulting number of trainable parameters is |Θ| = L × dmodel × (lp + li), where L is the number of Transformer layers. Adapter tuning as proposed in Houlsby et al. (2019) inserts adapter layers b
Nice summary of the adapter techniques that were available at the time:
Adapter tuning as proposed in Houlsby et al. (2019) inserts adapter layers between the selfattention module (and the MLP module) and the subsequent residual connection. There are two fully connected layers with biases in an adapter layer with a nonlinearity in between. We call this original design AdapterH . Recently, Lin et al. (2020) proposed a more efficient design with the adapter layer applied only after the MLP module and after a LayerNorm. We call it AdapterL . This is very similar to another deign proposed in Pfeiffer et al. (2021), which we call AdapterP . We also include another baseline call AdapterDrop (Ruckl ¨ e et al., 2020) which drops some adapter layers for ´ greater efficiency (AdapterD ).
LoRA matches or exceeds the fine-tuning baseline on all three datasets
Interesting that more parameters can hurt performance when using prefix tuning to do PEFT:
not all methods benefit monotonically from having more trainable parameters, as shown in Figure 2. We observe a significant performance drop when we use more than 256 special tokens for prefix-embedding tuning or more than 32 special tokens for prefix-layer tuning. This corroborates similar observations in Li & Liang (2021).
Relationship between capacity (number of trainable parameters) and performance (validation accuracy)