Self-training: F(x,y) is a filtering/ranking function, eg., what we call a reward/return. The input x may be chosen by humans, but the model generates the y’s and F ranks and selects for further rounds of self-training. F can be explicit or implicit (human in the loop as in RLHF)
Self-Training: Filtering Functions and Model Ranking Systems
By
–
Leave a Reply