Skip to content

Commit

Permalink
Update 5. Understanding ULMA.md
Browse files Browse the repository at this point in the history
  • Loading branch information
shaheennabi authored Nov 23, 2024
1 parent 55f0788 commit cbd8cb8
Showing 1 changed file with 17 additions and 9 deletions.
26 changes: 17 additions & 9 deletions docs/5. Understanding ULMA.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,17 +70,25 @@ After instruction fine-tuning, the model goes through a process of **Reinforceme
- **Loss Function**:
The model is updated using **Proximal Policy Optimization (PPO)**. The reward function is used to guide the model's behavior:

$$
L_{\text{PPO}} = - \mathbb{E}_{\pi_\theta} \left[ \text{reward}(s_t, a_t) \right]
$$
The PPO objective function is defined as:

Where:
- \(\pi_\theta\) is the policy (language model),
- \(s_t\) is the state (the prompt),
- \(a_t\) is the action (the response),
- \(\text{reward}(s_t, a_t)\) is the reward assigned to the response.
$$
L^{CLIP}(\theta) = \mathbb{E}_t[\min(
r_t(\theta)\hat{A}_t,
\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t
)]
$$

Where:
- $\pi_\theta$ is the policy (language model)
- $s_t$ is the state (the prompt)
- $a_t$ is the action (the response)
- $\text{reward}(s_t, a_t)$ is the reward assigned to the response
- $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio
- $\epsilon$ is a hyperparameter that controls the clipping range (typically 0.2)

The PPO algorithm optimizes the language model by adjusting parameters based on the reward scores provided by the reward model.

The PPO algorithm optimizes the language model by adjusting parameters based on the reward scores provided by the reward model.

---

Expand Down

0 comments on commit cbd8cb8

Please sign in to comment.