custom reward function support for ppo trainer #2540

August-murr · 2025-01-03T09:41:10Z

What does this PR do?

Adding support for a custom reward function for the PPO trainer.

How it works

Write a custom function that takes a list of texts as input, representing a batch of responses, and outputs a list of scores.

def custom_reward_function(texts: list) -> list:
    """
    Custom reward function that applies a given reward logic to each item in the list.
    
    Args:
        items (list): List of items to evaluate.
    
    Returns:
        list: List of rewards based on the provided reward logic.
    """
    rewards = [reward_logic(item) for item in items]

I will add more documentation and explanations later after running several tests to make sure the implementation is functional.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

@qgallouedec

August-murr · 2025-01-03T09:43:34Z

trl/trainer/utils.py

@@ -1049,14 +1049,20 @@ def first_true_indices(bools: torch.Tensor, dtype=torch.long):


 def get_reward(


This is where the primary change are:
modifying the get_reward function to work with both a nn.Module and a Callable.

HuggingFaceDocBuilderDev · 2025-01-03T09:44:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

custom reward function support for ppo trainer

a7b91ba

August-murr requested a review from qgallouedec January 3, 2025 09:41

August-murr commented Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom reward function support for ppo trainer #2540

custom reward function support for ppo trainer #2540

August-murr commented Jan 3, 2025

August-murr Jan 3, 2025

HuggingFaceDocBuilderDev commented Jan 3, 2025

		@@ -1049,14 +1049,20 @@ def first_true_indices(bools: torch.Tensor, dtype=torch.long):


		def get_reward(

custom reward function support for ppo trainer #2540

Are you sure you want to change the base?

custom reward function support for ppo trainer #2540

Conversation

August-murr commented Jan 3, 2025

What does this PR do?

How it works

Before submitting

Who can review?

August-murr Jan 3, 2025

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 3, 2025