We first fine-tune the model to follow instructions, stay within guardrails, and keep language consistent. Then we run on‑policy RL to improve search accuracy and tool efficiency while preserving those behaviors.
Fine-tuning and On-Policy RL for Model Optimization
By
–
Leave a Reply