The way I read the paper, it was the intermediate model that generated the SFT data, and then they just reused that for distillation. But pls correct me if there some counterpoint to that in the paper.
Intermediate model SFT data reuse for distillation
By
–
Leave a Reply