added

shaheennabi · Nov 21, 2024 · 2997157 · 2997157
1 parent e77f7e5
commit 2997157
Showing 1 changed file with 35 additions and 24 deletions.
diff --git a/docs/reward_model.md b/docs/reward_model.md
@@ -1,53 +1,64 @@
-##  Reward Model Working 
+## Reward Model Working
 
 ![Reinforcement-Learning-1](https://github.com/user-attachments/assets/be55dc2c-02b4-4779-b5e4-7d5e0a00ca45)
 
-So, before going and looking deep inside this project, I will explain some important things that how the **Large Language Models** are actually trained. In this file I will explain about the **reward model**.
+So, before going deep inside this project, I will explain some important concepts regarding how **Large Language Models** are actually trained. In this file, I will focus on the **reward model**.
 
+As you can see in the image, when we prompt a large language model like ChatGPT based on GPT-4 or other models running in the backend, we get a response. For example, I might ask ChatGPT: *I like to play cricket*, and it responds with: 
 
-As you can see in the image that we prompt a large language model like ChatGPT based on GPT4o or other models running in their backend, we get a response back. For Example, I am asking ChatGPT I like to play Cricket and gives the response back as (Response 1): Ok! It's good to know that you like Cricket.
+(Response 1): *Ok! It's good to know that you like cricket.*
 
-Now what you observe here, let me tell you when ChatGPT was initially launched in Nov-Dec 2022, it was good at many things, sometimes it was misleading like giving some wrong answre as in the example, you can see, It says Ok! It's good to know that you like Cricket, what can you observe from my opinion this is not the best response I am expecting, the response should be like quote(Response 2): Wow! Cricket is a great game as it improves your health and you mental abitlity etc. Now can you see the differences of the reponses.
+Now, observe here: when ChatGPT was initially launched in November-December 2022, it was good at many things but sometimes gave misleading or suboptimal answers. For instance, in the example above, the response *Ok! It's good to know that you like cricket* might not be the best response. A better response could be:
 
-Response 2 is far better than Response1, so it was big thing for researchers to make sure llm's generate the responses that are more human preferred (i mean how we humans expect). There is the story of **Reward Model**, it is a different model than main llm in our case let it be any GPT model. So what reward model does is when the main llm (GPT model) generates the responses, it generates the score (aka reward signal). Let me describe it briefly
+(Response 2): *Wow! Cricket is a great game as it improves your health and your mental ability.*
+
+Can you see the difference between these responses? Response 2 is far better than Response 1. This posed a challenge for researchers: ensuring LLMs generate responses that are more human-preferred (aligned with human expectations). This is where the **Reward Model** comes into play. 
+
+The reward model is distinct from the main LLM (e.g., any GPT model). What it does is assign a score (a.k.a. reward signal) to the responses generated by the main LLM. Let me describe this briefly:
 
 ## Working of Reward Model
 
-Remember reward model is a seperate model, that is trained seperately on Responses the main llm(GPT model) generates.
+The reward model is a **separate model**, trained independently on the responses generated by the main LLM (e.g., GPT).
 
 For example: 
-let's ask ChatGPT 
+Let's ask ChatGPT: 
 
-prompt: Hi, how are you?
+**Prompt:** Hi, how are you?
 
-**Response1:** I am good?
+**Response 1:** I am good.
 
-and few more responses:
+And a few more responses:
 
-**Response2:** Hi, there I am doing great!
+**Response 2:** Hi there, I am doing great!
 
-**Response3:** Thanks for asking, I am doing great, how have you been.
+**Response 3:** Thanks for asking, I am doing great. How have you been?
 
-Now doo you observe something in these response can I say one thing that **Response3** is far better or more human aligned than **Response2** and **Response1**. Can I give it a ranking like this: **Response3**>**Response2**>**Response1**.
+Now, observe these responses. We can see that **Response 3** is far better and more human-aligned than **Response 2** and **Response 1**. We can assign rankings like this: **Response 3 > Response 2 > Response 1**.
 
-This is the same anotomy inside reward model, atually when the main llm(ChatGPT model) generates responses, these responses are collected and bought together and human labellers in this case **researchers** rank these responses that are best aligned with human prefereces(like we humans expect)
+This is exactly how the reward model works internally. When the main LLM generates responses, these responses are collected, and human labelers (in this case, researchers) rank them based on how well they align with human preferences (i.e., what humans expect).
 
-So as you can see in the figure, the responses from llm are labelled by human annotators and they rank, and for each **Response** the score is assigned like in our case:
-**Response3** = 0.95
-**Response2** = 0.7
-**Response1** = 0.4
+As seen in the figure, the responses from the LLM are labeled by human annotators, who rank them. For each **response**, a score is assigned. For example:
 
-So, we want **Response3** like responses our main llm should generate. So what we do we combine these response I mean, prompt: Hi, how are you, Response: Thanks for asking, I am doing great, how have you been, Score = 0.95
+- **Response 3** = 0.95  
+- **Response 2** = 0.7  
+- **Response 1** = 0.4  
 
-So here we use prompt, response as **independent variable** for model, and **score** is dependent variable, I mean the target model has to predict.
+We want the main LLM to generate responses like **Response 3**. So, we pair the prompt and response as input and use the score as the output to train the reward model.
 
-Reward model is trained on the same architecture like the transformer based architecture, same like how we train deep neural networks, as you can see we give prompt, resposne in **input** and **score** as output. So what happens model is trained on it and it gives out the probability or score as **output**, when we give it a prompt, response.
+### Training the Reward Model
 
-Let me clear it briefly:
+The reward model is trained using a transformer-based architecture, similar to how we train deep neural networks. The prompt and response serve as **independent variables**, and the **score** is the dependent variable (the target the model has to predict).
 
-Same as we train any deep neural network **Reward Model** is trained the same and it has to provide the scalar output (**score**).
+For example:  
+Input:  
+- **Prompt:** Hi, how are you?  
+- **Response:** Thanks for asking, I am doing great. How have you been?  
+- **Score:** 0.95  
 
-So what happens when the reward model is trained on these mentioned things, it is used for finetuning the main llm (i mean the ChatGPT or GPT model), **and I will tell you, how this **Reward Model** is used later to finetune or guide the main llm in the next file named as: RLHF with PPO**.
+The reward model is trained on this input-output pair. When given a prompt and response, it outputs a **scalar score**.
 
+### Summary
 
+The reward model is trained like any deep neural network and provides a scalar output (**score**). Once trained, the reward model is used to fine-tune the main LLM (e.g., ChatGPT or GPT model). 
 
+**In the next file named: RLHF with PPO, I will explain how this Reward Model is used to fine-tune or guide the main LLM.**