Key takeaways:
• terminals > GUIs for stable agent control
• task design inspired by SWE-bench, but with a more general abstraction
• eval + RL need the same “rollout” substrate, so they created Harbor
• Harbor = a unified framework for scalable parallel deployment
• TB2 is
Terminal-Based Agent Control and Harbor Evaluation Framework
By
–
Leave a Reply