update

shaheennabi · Nov 24, 2024 · c4d2dc2 · c4d2dc2
1 parent 64e63a0
commit c4d2dc2
Showing 1 changed file with 31 additions and 19 deletions.
diff --git a/docs/7. Low Rank Adaptation(LORA).md b/docs/7. Low Rank Adaptation(LORA).md
@@ -1,5 +1,4 @@
 ## Working of LORA
-
 # In-Depth Explanation of LoRA (Low-Rank Adaptation) and Its Key Concepts
 
 LoRA (Low-Rank Adaptation) is a technique designed to efficiently fine-tune large pre-trained models, like transformer-based models, without modifying the original weight matrix \( W \). Instead of modifying the original weights, LoRA introduces additional parameters that represent **low-rank updates**, allowing efficient training while preserving the integrity of the pre-trained model. This is done through the introduction of two smaller matrices, \( A \) and \( B \), whose product approximates the update to the original weight matrix.
@@ -25,10 +24,12 @@ Fine-tuning a large model normally involves updating this large weight matrix \(
 LoRA addresses the problem of fine-tuning large models by introducing a **low-rank approximation** to the update of \( W \), instead of directly modifying \( W \). The idea is to keep the pre-trained model frozen and instead learn only a small number of additional parameters that represent the update to the weights.
 
 - **Low-rank approximation**: Rather than updating \( W \) directly, LoRA approximates the change to \( W \) using two matrices, \( A \) and \( B \), such that:
-  \[
-  W_{\text{new}} = W + AB^T
-  \]
-  where:
+
+$$
+W_{\text{new}} = W + AB^T
+$$
+
+where:
   - \( A \) has shape \( d_{\text{out}} \times r \),
   - \( B \) has shape \( d_{\text{in}} \times r \),
   - \( r \) is a small rank, typically a small number like 4 or 8.
@@ -69,15 +70,18 @@ This means that LoRA is **injecting two smaller matrices** (not modifying \( W \
 - **Matrix \( B \)**: It has dimensions \( d_{\text{in}} \times r \), learning how to project the input dimension into the \( r \)-dimensional space.
 
 - The product \( AB^T \) gives the low-rank update matrix:
-  \[
-  AB^T \quad \text{of size} \quad d_{\text{out}} \times d_{\text{in}}
-  \]
+
+$$
+AB^T \quad \text{of size} \quad d_{\text{out}} \times d_{\text{in}}
+$$
 
 - This matrix \( AB^T \) is added to the original weight matrix \( W \), which results in:
-  \[
-  W_{\text{new}} = W + AB^T
-  \]
-  This update is what allows the model to adapt to new data or tasks, without changing the core pre-trained model.
+
+$$
+W_{\text{new}} = W + AB^T
+$$
+
+This update is what allows the model to adapt to new data or tasks, without changing the core pre-trained model.
 
 ---
 
@@ -99,12 +103,20 @@ Here’s how the process works from a high level:
 
 1. **Initialization**: The original pre-trained weight matrix \( W \) is frozen. LoRA introduces two new matrices, \( A \) and \( B \), with small random values, and \( r \) is set (e.g., 4).
 2. **Forward Pass**: During the forward pass, the model computes the output by using \( W \) along with the low-rank updates \( AB^T \). So, the output is:
-   \[
-   y = W_{\text{new}} \cdot x = (W + AB^T) \cdot x
-   \]
-   The original \( W \) remains unchanged, and the change comes from the product \( AB^T \).
-3. **Backpropagation**: During backpropagation, gradients are computed and used to update \( A \) and \( B \). Since \( W \) is frozen, only \( A \) and \( B \) are updated. This is the key aspect of LoRA — **the original model parameters are not modified**, only the low-rank update parameters \( A \) and \( B \).
-4. **Final Model**: Once training is complete, the new weight matrix is \( W_{\text{new}} = W + AB^T \), where \( AB^T \) represents the task-specific adjustment. This allows you to use the pre-trained model for the new task without modifying its original weights.
+
+$$
+y = W_{\text{new}} \cdot x = (W + AB^T) \cdot x
+$$
+
+The original \( W \) remains unchanged, and the change comes from the product \( AB^T \).
+3. **Backpropagation**: During backpropagation, gradients are computed and used to update \( A \) and \( B \). Since \( W \) is frozen, only \( A \) and \( B \) are updated. This is the key aspect of LoRA — **the original model parameters are not modified**, only the low-rank update parameters \( A \) and \( B \) are.
+4. **Final Model**: Once training is complete, the new weight matrix is:
+
+$$
+W_{\text{new}} = W + AB^T
+$$
+
+where \( AB^T \) represents the task-specific adjustment. This allows you to use the pre-trained model for the new task without modifying its original weights.
 
 ---
 
@@ -114,7 +126,7 @@ Here’s how the process works from a high level:
 - The matrices \( A \) and \( B \) serve to approximate the changes to the original weight matrix \( W \), with \( r \) being the key hyperparameter that constrains the complexity of these updates.
 - **Choosing \( r \)** is a trade-off between computational efficiency and the model’s ability to express the necessary task-specific changes.
 
-The main advantage of LoRA is that it allows efficient fine-tuning of large pre-trained models without the need to modify the large weight matrices directly. Instead, you introduce a low-rank update that is small and computationally efficient, making LoRA a powerful tool for adapting large models to new tasks.
+The main advantage of LoRA is that it allows efficient fine-tuning of large pre-trained models without the need to modify the large weight matrices directly. Instead, you introduce a low-rank update that is computationally efficient and easy to learn.
 
 
 This document is written in simple terms to help readers understand **LORA**. The content is a mix of original writing and contributions from ChatGPT for better clarity.