Our Research team just dropped a few behind-the-scenes blogs on scaling agentic SWE-bench evaluation, including the failure modes we hit and what finally worked. I'm curious to hear your thoughts about our work
AI21 Labs Shares Research on Scaling Agentic SWE-bench Evaluation
By
–
Leave a Reply