mythos obviously looks incredibly capable and im psyched to use it also if you're panicking about it: benchmarks don't measure model capability alone they measure model capability after a human has done the work of finding a prompt that lets the model’s capability appear that
