Skip to content

Conversation

@devpatelio
Copy link
Collaborator

@devpatelio devpatelio commented Jan 19, 2026

GDPO is an extension of GRPO for multi-reward settings where we do group-wise normalization of each reward function prior to computing the advantage. This is then followed by a batch-norm across all prompts belonging to a given batch and it's respective advantages ( GDPO Paper)

Points of clarification:

TODOs:

  • add multiple reward functionality
  • test GDPO performance against GRPO

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Group-wise Distributional Policy Optimization (GDPO) advantage estimation by adding a new advantage estimator, compute_gdpo_outcome_advantage, to the PPO utilities. A medium-severity vulnerability was identified in the new GDPO implementation, specifically a critical bug in the group-wise normalization step that could lead to division by zero. Additionally, a broken link to the reference paper in the docstring was found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant