Update 5. Understanding ULMA.md

shaheennabi · Nov 23, 2024 · cbd8cb8 · cbd8cb8
1 parent 55f0788
commit cbd8cb8
Showing 1 changed file with 17 additions and 9 deletions.
diff --git a/docs/5. Understanding ULMA.md b/docs/5. Understanding ULMA.md
@@ -70,17 +70,25 @@ After instruction fine-tuning, the model goes through a process of **Reinforceme
 - **Loss Function**:
   The model is updated using **Proximal Policy Optimization (PPO)**. The reward function is used to guide the model's behavior:
 
-  $$
-  L_{\text{PPO}} = - \mathbb{E}_{\pi_\theta} \left[ \text{reward}(s_t, a_t) \right]
-  $$
+  The PPO objective function is defined as:
 
-  Where:
-  - \(\pi_\theta\) is the policy (language model),
-  - \(s_t\) is the state (the prompt),
-  - \(a_t\) is the action (the response),
-  - \(\text{reward}(s_t, a_t)\) is the reward assigned to the response.
+$$
+L^{CLIP}(\theta) = \mathbb{E}_t[\min(
+r_t(\theta)\hat{A}_t,
+\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t
+)]
+$$
+
+Where:
+- $\pi_\theta$ is the policy (language model)
+- $s_t$ is the state (the prompt)
+- $a_t$ is the action (the response)
+- $\text{reward}(s_t, a_t)$ is the reward assigned to the response
+- $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio
+- $\epsilon$ is a hyperparameter that controls the clipping range (typically 0.2)
+
+The PPO algorithm optimizes the language model by adjusting parameters based on the reward scores provided by the reward model.
 
-  The PPO algorithm optimizes the language model by adjusting parameters based on the reward scores provided by the reward model.
 
 ---