here's some free alpha: if we do RL for too long after pretraining, we will surely overwrite parameters and start to forget things in the original instructGPT paper, their best model mixed RLHF with pretraining gradients to avoid exactly this model drift issue yet no one is
RLHF Training: Avoiding Model Drift Through Gradient Mixing
By
–
Leave a Reply