research from @OpenAI on improving math reasoning by RLHF with a reward model trained on 800k human-generated chain-of-thought data (which @scale_AI partnered w/
@OpenAI on!) RLHF seems to be a scalable technique for making LLMs smarter in many ways
OpenAI Improves Math Reasoning with RLHF and Chain-of-Thought
By
–
Leave a Reply