DeepMind showcases iterative self-improvement for Natural Language Generation. They basically took the "human" out of Reinforcement Learning from Human Feedback (RLHF) cycle and are calling it Reinforced Self-Training (ReST) [link to paper in ALT text]
DeepMind’s Reinforced Self-Training removes human from RLHF
By
–
