Naive weak supervision isn't enough—current techniques, like RLHF, won't be sufficient for future superhuman models. But we also show that it's feasible to drastically improve weak-to-strong generalization—making iterative empirical progress on a core challenge of superalignment
Weak-to-Strong Generalization: Beyond RLHF for Superalignment
By
–
Leave a Reply