4/ Self-Rewarding Models – proposes a self-alignment method that uses the model itself for LLM-as-a-Judge prompting to provide its rewards during training; Iterative DPO is used for instruction following training using the preference pairs.
Self-Rewarding Models: LLM Self-Alignment Training Method
By
–
