I wonder if it just learned to prefer least-to-most output style implicitly from RLHF because it leads to better answers.
Did RLHF implicitly teach least-to-most output style?
By
–
By
–
I wonder if it just learned to prefer least-to-most output style implicitly from RLHF because it leads to better answers.