Also super interesting in terms of SFT: a two-stage setup outperforms a single stage with the same data. Curious to see if that's still the case post-pref alignment. In general, it feels like there's a missed opportunity with no experiment around DPO.
Two-stage SFT outperforms single stage setup in preference alignment
By
–
Leave a Reply