4/ While LMSYS and other efforts in the community are awesome, we still think there's a lot to be desired in 3rd party evaluations. One of our design principles is to produce evals that are impossible to overfit. As we saw with our prior GSM1k research, we think it's critical