Benchmark Dominance
Grok 4 Heavy smashed “Humanity’s Last Exam” with a 44–50% score, nearly doubling its single-agent sibling and outpacing Gemini & OpenAI. It even nailed 100% on AIME! This is frontier AI territory.
Grok 4 Heavy Dominates Humanity’s Last Exam Benchmark
By
–
Leave a Reply