Good question. In their original finetuning, they train a reward model based on relative preference (rankings among multiple responses). And from the user feedback, there's only thumbs up & down. You can probably use that for supervised finetuning I guess.
Reward Model Training vs User Feedback: Preferences and Finetuning
By
–
Leave a Reply