thanks man its a good thread. i feel like people dont explain PPO very well in these threads, thats the main part that always feels handwavy. i’d also like an idea of order of magnitude of finetune data needed for instructgpt but havent seen good numbers
PPO Explanation Gaps and InstructGPT Fine-tuning Data Requirements
By
–