10/ SimPO – a simpler and more effective approach for preference optimization with a reference-free reward; uses the average log probability of a sequence as an implicit reward (i.e., no reference model required) which makes it more compute and memory efficient.
SimPO: Reference-Free Preference Optimization for Language Models
By
–