"Scaling Self-Play with Self-Guidance" The main problem with self-play for theorem proving is that generator usually learns to reward hack, which produces messy hard problems that do not help the solver. This paper suggests by adding a Guide model that scores generated
Scaling Self-Play with Self-Guidance for Theorem Proving
By
–
Leave a Reply