For those without context, I’m referring to this sort of thing. Competition going right now as to whether we need “RLHF” (which I think is just reinforcement learning w/ a trained reward model) or if we can do “DPO” (essentially supervised learning) no one really knows https://
x.com/yoavartzi/stat
/yoavartzi/status/1730252149370548598
…
RLHF vs DPO: Training Methods for AI Models Compared
By
–