[Reward] initial CLoud Reward trainer #2432

kashif · 2024-12-03T11:52:09Z

What does this PR do?

Adds an option to the RewardTrainer to implement the CLoud method.

HuggingFaceDocBuilderDev · 2024-12-03T12:17:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2024-12-08T13:19:59Z

docs/source/reward_trainer.mdx

+
+[Critique-out-Loud reward models](https:/huggingface.co/papers/2408.11791) are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. 
+
+To train a Critique-out-Loud reward model, you can use the `feedback_method="teacher"` and set the `lm_weight` to a high value. 


How to set the lm_height to a high value? Maybe add a ref or a code example

qgallouedec · 2024-12-08T13:25:55Z

docs/source/reward_trainer.mdx

+
+To train a Critique-out-Loud reward model, you can use the `feedback_method="teacher"` and set the `lm_weight` to a high value. 
+
+The dataset should contain the columns `"prompt"`, `"chosen"`, `"rejected"` as well as the `"chosen_feedback` and `"rejected_feedback"` columns which contain the Chain-of-Thought like critiques for the chosen and rejected responses respectively generated by the same model or a different model. A script to generate such a dataset from a dataset of preferences is available in the `examples/datasets/critique_out_loud_vllm.py` file.


Add a link to the example

qgallouedec · 2024-12-08T13:34:47Z

tests/test_reward_trainer.py

+    def test_train_with_feedback(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            # Create a dummy dataset with feedback
+            dummy_dataset_dict = {


Ok now I get why you want a tiny version: the zen dataset misses two columns.
I'll think about the best way to do it

kashif added 3 commits December 3, 2024 12:51

initial Cloud Reward trainer

b91b135

fix imports

3985c88

fix typo

713c6ba

kashif added 8 commits December 6, 2024 12:01

initial dataset creation script

1200477

tokenizer needs feedback method

0faf183

lm_weight default

69e9ce3

add docs

c9158a5

add example script

27330d1

fix RewardDataCollatorWithPadding

0246f6a

tests still failing

346d8ba

fix test

ca8b970

qgallouedec reviewed Dec 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reward] initial CLoud Reward trainer #2432

[Reward] initial CLoud Reward trainer #2432

kashif commented Dec 3, 2024

HuggingFaceDocBuilderDev commented Dec 3, 2024

qgallouedec Dec 8, 2024

qgallouedec Dec 8, 2024

qgallouedec Dec 8, 2024


		[Critique-out-Loud reward models](https:/huggingface.co/papers/2408.11791) are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces.

		To train a Critique-out-Loud reward model, you can use the `feedback_method="teacher"` and set the `lm_weight` to a high value.


		To train a Critique-out-Loud reward model, you can use the `feedback_method="teacher"` and set the `lm_weight` to a high value.

		The dataset should contain the columns `"prompt"`, `"chosen"`, `"rejected"` as well as the `"chosen_feedback` and `"rejected_feedback"` columns which contain the Chain-of-Thought like critiques for the chosen and rejected responses respectively generated by the same model or a different model. A script to generate such a dataset from a dataset of preferences is available in the `examples/datasets/critique_out_loud_vllm.py` file.

[Reward] initial CLoud Reward trainer #2432

Are you sure you want to change the base?

[Reward] initial CLoud Reward trainer #2432

Conversation

kashif commented Dec 3, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Dec 3, 2024

qgallouedec Dec 8, 2024

Choose a reason for hiding this comment

qgallouedec Dec 8, 2024

Choose a reason for hiding this comment

qgallouedec Dec 8, 2024

Choose a reason for hiding this comment