AI Dynamics

Global AI News Aggregator

About

MaxRL: One-Line GRPO Improvement for Better Training Scaling

MaxRL is a slick one line change to GRPO that optimizes a maximum likelihood objective instead of expected reward Scaling advantages with 1/μ as opposed to traditional REINFORCE or GRPO’s 1/σ allows the model to make better progress from low initial pass rates and better

→ View original post on X — @askalphaxiv