Training on pure synthetic data has no information gain, thus there is little reason the model *should* improve. Oftentimes when evals go up from “self-distillation”, that might be from some more invisible tradeoff, i.e. mode collapse in exchange for individual eval improvement
Synthetic Data Training Limitations and Mode Collapse in Self-Distillation
By
–
