LORA(Low Rank Adaptation) : A Deeper Dive | Rajan Ghimire
Excerpt
Exploring and Implementating LoRA in PyTorch.
LoRA is a fast fine-tuning approach developed by Microsoft researchers for adapting huge models to specific tasks and datasets. The idea behind LoRA is that a single LLM model can be used for various tasks by incorporating different neurons or features to handle each task. By identifying the appropriate features from a pool of many and improving them, we can obtain better outcomes for specific tasks.
Fine-tuning
Let,
Loss function Input and output data. = Weights from a pre-trained network.
The task of fine-tuning a neural network can be expressed as : Our goal is to find that minimizes . For the parameter , its dimension is equal to that of i.e. . If the is a very large-scale pre-trained model, then finding the becomes computationally challenging.
During the training of fully connected layers in a neural network, the weight matrices are typically full rank, meaning that they do not have any redundant rows or columns. The authors of LoRA pointed out that while the weights of a pre-trained model have full rank for the pre-trained tasks, large language models have a low âintrinsic dimensionâ. This means that the data can be represented or approximated effectively by a lower-dimensional space while retaining most of its essential information or structure. In simpler terms, this implies that we can break down the new weight matrix for the adapted task into lower-dimensional components.
LoRA applies a simple matrix decomposition to each weight matrix update. i.e â . Considering â the update of th weight in network, Lora approx with:
Where, A â and B â and the rank . This means that for forward pass of the layer, originally , is modified to (as shown in the figure above). Thus instead of learning parameters we now need to learn which is easily a lot smaller given the multiplicative aspect. A random Gaussian initialization is used for and is initially set to 0, so at the start of training. The update is additionally scaled with a factor which can be interpreted as a learning rate for the LoRA update.
If we limit the to a smaller value in the middle, we can greatly reduce the number of trainable parameters and decrease the dimensionality of the features to . This will result in an overall parameter count of where, is the number of modules used in the entire model
Once the fine-tuning is done, we can just simply update weights in by adding with its respective .
PyTorch Minimal Implementation
Letâs train a simple implementation of linear regression using PyTorch.
We will create simple training data using
Then we will build a LinearRegressionModel to estimate the value of . Letâs assume it to be our pre-trained model.
import math import torch import torch.nn as nn # Define dimensions n = 10000 # Total number of samples d_in = 1001 d_out = 1000 hidden_dim = 1000 # Moving data to the device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Define the data thetas = torch.randn(d_in, d_out).to(device) X = torch.randn(n, d_in).to(device) y = torch.matmul(X, thetas).to(device) print(f"Shape of X : {X.shape}") print(f"Shape of y : {y.shape}") print(f"Shape of θ : {thetas.shape}")
<span>Shape</span> of X : torch.Size([<span>10000</span>, <span>1001</span>])
<span>Shape</span> of y : torch.Size([<span>10000</span>, <span>1000</span>])
<span>Shape</span> of θ : torch.Size([<span>1001</span>, <span>1000</span>])
Now, letâs define our LinearRegressionModel
. It consists of two simple linear layers.
class LinearRegressionModel(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super(LinearRegressionModel, self).__init__() self.layer1 = nn.Linear(input_dim, hidden_dim, bias=False) self.layer2 = nn.Linear(hidden_dim, output_dim,bias=False) def forward(self, x): out = self.layer1(x) out = self.layer2(out) return out def train(model, X, y, batch_size=128, epochs=100): opt = torch.optim.Adam(model.parameters()) for epoch in range(epochs): # randomly shuffle the input data permutation = torch.randperm(X.size()[0]) for i in range(0, X.size()[0], batch_size): opt.zero_grad() indices = permutation[i:i+batch_size] batch_x, batch_y = X[indices], y[indices] outputs = model(batch_x) loss = torch.nn.functional.mse_loss(outputs, batch_y) loss.backward() opt.step() if epoch % 10 == 0: with torch.no_grad(): outputs = model(X) loss = torch.nn.functional.mse_loss(outputs, y) print(f"Epoch : {epoch }/{epochs} Loss : {loss.item()} ")
# Define the model model = LinearRegressionModel(d_in, hidden_dim, d_out).to(device) train(model, X, y)
<span>Epoch</span> : <span>0</span>/<span>100</span> Loss : <span>868</span>.<span>3592529296875</span>
<span>Epoch</span> : <span>10</span>/<span>100</span> Loss : <span>18</span>.<span>999113082885742</span>
<span>Epoch</span> : <span>20</span>/<span>100</span> Loss : <span>1</span>.<span>2845144271850586</span>
<span>Epoch</span> : <span>30</span>/<span>100</span> Loss : <span>0</span>.<span>1564238965511322</span>
<span>Epoch</span> : <span>40</span>/<span>100</span> Loss : <span>0</span>.<span>028503887355327606</span>
<span>Epoch</span> : <span>50</span>/<span>100</span> Loss : <span>0</span>.<span>006223085802048445</span>
<span>Epoch</span> : <span>60</span>/<span>100</span> Loss : <span>0</span>.<span>0016892347484827042</span>
<span>Epoch</span> : <span>70</span>/<span>100</span> Loss : <span>0</span>.<span>7939147353172302</span>
<span>Epoch</span> : <span>80</span>/<span>100</span> Loss : <span>0</span>.<span>2283499538898468</span>
<span>Epoch</span> : <span>90</span>/<span>100</span> Loss : <span>0</span>.<span>2333495020866394</span>
Now that we have our base model that has been pre-trained, letâs assume that we have data from a slightly different distribution
thetas2 = thetas + 1 X2 = torch.randn(n, d_in).to(device) y2 = torch.matmul(X2, thetas2).to(device)
As we know this data is from a different distribution, if we apply this data to our base model we wont get good result.
loss = torch.nn.functional.mse_loss(model(X2), y2) print(f"Loss on different distribution: {loss}")
<span>Loss</span> <span>on</span> different distribution: <span>1013</span>.<span>2288818359375</span>
We now fine-tune our initial model . The distribution of the new data is just slighly different from the initial one. Itâs just a rotation of the data points, by adding 1 to all thetas. This means that the weight updates are not expected to be complex, and we shouldnât need a full-rank update in order to get good results.
class LoRAAdapter(nn.Module): def __init__(self, model, r=16, alpha=1): super(LoRAAdapter, self).__init__() self.module_list = nn.ModuleList() self.scaling = alpha / r self.original_linears = [] # Go through the layers of the model # if the layer is linear layer, add an adpter to it. for layer in model.children(): if isinstance(layer, nn.Linear): # Keep a reference to the original linear layers # we may need them to add A and B praramters self.original_linears.append(layer) # Create an adapted layer for each Linear layer adapted_layer = AdaptedLinear(layer, r, self.scaling) self.module_list.append(adapted_layer) else: # Keep other types of layers as they are self.module_list.append(layer) def forward(self, x): for layer in self.module_list: x = layer(x) return x def update_original_weights(self): with torch.no_grad(): for adapted_layer, original_layer in zip(self.module_list, self.original_linears): delta_theta = torch.matmul(adapted_layer.A, adapted_layer.B) * adapted_layer.scaling original_layer.weight.add_(delta_theta.t()) class AdaptedLinear(nn.Module): def __init__(self, linear, r, scaling ) -> None: super().__init__() linear.requires_grad_(False) self.linear = linear self.A = nn.Parameter(torch.randn(linear.in_features, r)) self.B = nn.Parameter(torch.zeros(r, linear.out_features)) self.scaling = scaling def forward(self, x): return self.linear(x) + torch.matmul(x, torch.matmul(self.A, self.B) * self.scaling)
lora_model = LoRAAdapter(model, r=1).to(device)
We have now initialized our Lora model. For simplicity letâs put . Now, letâs train the model.
train(lora_model,X=X2,y=y2)
<span>Epoch</span> : <span>0</span>/<span>100</span> Loss : <span>1007</span>.<span>549072265625</span>
<span>Epoch</span> : <span>10</span>/<span>100</span> Loss : <span>679</span>.<span>202880859375</span>
<span>Epoch</span> : <span>20</span>/<span>100</span> Loss : <span>317</span>.<span>93316650390625</span>
<span>Epoch</span> : <span>30</span>/<span>100</span> Loss : <span>124</span>.<span>77867889404297</span>
<span>Epoch</span> : <span>40</span>/<span>100</span> Loss : <span>39</span>.<span>598350524902344</span>
<span>Epoch</span> : <span>50</span>/<span>100</span> Loss : <span>9</span>.<span>39522933959961</span>
<span>Epoch</span> : <span>60</span>/<span>100</span> Loss : <span>1</span>.<span>6521010398864746</span>
<span>Epoch</span> : <span>70</span>/<span>100</span> Loss : <span>0</span>.<span>4204731583595276</span>
<span>Epoch</span> : <span>80</span>/<span>100</span> Loss : <span>0</span>.<span>3215165138244629</span>
<span>Epoch</span> : <span>90</span>/<span>100</span> Loss : <span>0</span>.<span>3118535876274109</span>
Up to this point, we just trained the A and B parameters but we still havenât performed changes in i.e. . So the model wonât show any improvements.
loss = torch.nn.functional.mse_loss(model(X2), y2) print(f"Loss on different distribution: {loss}")
<span>Loss</span> <span>on</span> different distribution: <span>1013</span>.<span>2288818359375</span>
Now after performing for each of the linear layers in the model, the loss will converge. i.e We have successfully finetuned our model on new distribution.
lora_model.update_original_weights()
loss = torch.nn.functional.mse_loss(model(X2), y2) print(f"Loss on different distribution: {loss}")
<span>Loss</span> <span>on</span> different distribution: <span>0</span>.<span>3048411011695862</span>
Conclusion
To sum it all up: LoRA has two major applications. The first is to finetune large models with low compute, and the second is to adapt large models in a low-data regime.Transformer models are predominantly a smart arrangement of matrix multiplication operations. By applying LoRA exclusively to these layers, the cost of fine-tuning is significantly decreased, yet high performance is still achieved. The experiments detailing this can be found in the LoRA paper.