Datacurve's audit found three structural problems with SWE-Bench Pro. First, contamination. The tasks come from public GitHub commits. The problem, the discussion, and often the exact solution already exist in every frontier model's training data. No way to tell if a model is