Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
shaheennabi committed Nov 24, 2024
1 parent 64e63a0 commit c4d2dc2
Showing 1 changed file with 31 additions and 19 deletions.
50 changes: 31 additions & 19 deletions docs/7. Low Rank Adaptation(LORA).md
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
## Working of LORA

# In-Depth Explanation of LoRA (Low-Rank Adaptation) and Its Key Concepts

LoRA (Low-Rank Adaptation) is a technique designed to efficiently fine-tune large pre-trained models, like transformer-based models, without modifying the original weight matrix \( W \). Instead of modifying the original weights, LoRA introduces additional parameters that represent **low-rank updates**, allowing efficient training while preserving the integrity of the pre-trained model. This is done through the introduction of two smaller matrices, \( A \) and \( B \), whose product approximates the update to the original weight matrix.
Expand All @@ -25,10 +24,12 @@ Fine-tuning a large model normally involves updating this large weight matrix \(
LoRA addresses the problem of fine-tuning large models by introducing a **low-rank approximation** to the update of \( W \), instead of directly modifying \( W \). The idea is to keep the pre-trained model frozen and instead learn only a small number of additional parameters that represent the update to the weights.

- **Low-rank approximation**: Rather than updating \( W \) directly, LoRA approximates the change to \( W \) using two matrices, \( A \) and \( B \), such that:
\[
W_{\text{new}} = W + AB^T
\]
where:

$$
W_{\text{new}} = W + AB^T
$$

where:
- \( A \) has shape \( d_{\text{out}} \times r \),
- \( B \) has shape \( d_{\text{in}} \times r \),
- \( r \) is a small rank, typically a small number like 4 or 8.
Expand Down Expand Up @@ -69,15 +70,18 @@ This means that LoRA is **injecting two smaller matrices** (not modifying \( W \
- **Matrix \( B \)**: It has dimensions \( d_{\text{in}} \times r \), learning how to project the input dimension into the \( r \)-dimensional space.

- The product \( AB^T \) gives the low-rank update matrix:
\[
AB^T \quad \text{of size} \quad d_{\text{out}} \times d_{\text{in}}
\]

$$
AB^T \quad \text{of size} \quad d_{\text{out}} \times d_{\text{in}}
$$

- This matrix \( AB^T \) is added to the original weight matrix \( W \), which results in:
\[
W_{\text{new}} = W + AB^T
\]
This update is what allows the model to adapt to new data or tasks, without changing the core pre-trained model.

$$
W_{\text{new}} = W + AB^T
$$

This update is what allows the model to adapt to new data or tasks, without changing the core pre-trained model.

---

Expand All @@ -99,12 +103,20 @@ Here’s how the process works from a high level:

1. **Initialization**: The original pre-trained weight matrix \( W \) is frozen. LoRA introduces two new matrices, \( A \) and \( B \), with small random values, and \( r \) is set (e.g., 4).
2. **Forward Pass**: During the forward pass, the model computes the output by using \( W \) along with the low-rank updates \( AB^T \). So, the output is:
\[
y = W_{\text{new}} \cdot x = (W + AB^T) \cdot x
\]
The original \( W \) remains unchanged, and the change comes from the product \( AB^T \).
3. **Backpropagation**: During backpropagation, gradients are computed and used to update \( A \) and \( B \). Since \( W \) is frozen, only \( A \) and \( B \) are updated. This is the key aspect of LoRA — **the original model parameters are not modified**, only the low-rank update parameters \( A \) and \( B \).
4. **Final Model**: Once training is complete, the new weight matrix is \( W_{\text{new}} = W + AB^T \), where \( AB^T \) represents the task-specific adjustment. This allows you to use the pre-trained model for the new task without modifying its original weights.

$$
y = W_{\text{new}} \cdot x = (W + AB^T) \cdot x
$$

The original \( W \) remains unchanged, and the change comes from the product \( AB^T \).
3. **Backpropagation**: During backpropagation, gradients are computed and used to update \( A \) and \( B \). Since \( W \) is frozen, only \( A \) and \( B \) are updated. This is the key aspect of LoRA — **the original model parameters are not modified**, only the low-rank update parameters \( A \) and \( B \) are.
4. **Final Model**: Once training is complete, the new weight matrix is:

$$
W_{\text{new}} = W + AB^T
$$

where \( AB^T \) represents the task-specific adjustment. This allows you to use the pre-trained model for the new task without modifying its original weights.

---

Expand All @@ -114,7 +126,7 @@ Here’s how the process works from a high level:
- The matrices \( A \) and \( B \) serve to approximate the changes to the original weight matrix \( W \), with \( r \) being the key hyperparameter that constrains the complexity of these updates.
- **Choosing \( r \)** is a trade-off between computational efficiency and the model’s ability to express the necessary task-specific changes.

The main advantage of LoRA is that it allows efficient fine-tuning of large pre-trained models without the need to modify the large weight matrices directly. Instead, you introduce a low-rank update that is small and computationally efficient, making LoRA a powerful tool for adapting large models to new tasks.
The main advantage of LoRA is that it allows efficient fine-tuning of large pre-trained models without the need to modify the large weight matrices directly. Instead, you introduce a low-rank update that is computationally efficient and easy to learn.


This document is written in simple terms to help readers understand **LORA**. The content is a mix of original writing and contributions from ChatGPT for better clarity.

0 comments on commit c4d2dc2

Please sign in to comment.