"Co-Evolving Policy Distillation" A lot of post-training today follows a simple recipe where you train separate experts, then distill them into one model. But the problem is, by the time distillation starts, the expert and the student have already drifted too far apart, so a
Co-Evolving Policy Distillation Tackles Expert-Student Drift in Post-Training
By
–
