MaxRL is a slick one line change to GRPO that optimizes a maximum likelihood objective instead of expected reward
— alphaXiv (@askalphaxiv) 19 février 2026
Scaling advantages with 1/μ as opposed to traditional REINFORCE or GRPO’s 1/σ allows the model to make better progress from low initial pass rates and better… pic.twitter.com/sJCAlsezeq
MaxRL is a slick one line change to GRPO that optimizes a maximum likelihood objective instead of expected reward Scaling advantages with 1/μ as opposed to traditional REINFORCE or GRPO’s 1/σ allows the model to make better progress from low initial pass rates and better