Trending on alphaXiv: The proper way to do multi-reward GRPO from NVIDIA GRPO’s “normalize the summed reward” step can wash out real tradeoffs by collapsing different reward combinations into identical advantage values eg. tradeoffs into the same advantage where “got 1
NVIDIA GRPO Multi-Reward Optimization Technical Advances
By
–
