AI Dynamics

Global AI News Aggregator

About

Anthropic Research on LLM Benchmark Contamination and Behavioral Adaptation

This wasn't isolated. Anthropic ran NLAs across dozens of evaluations. Claude recognized evaluation formats on benchmarks like MMLU, GPQA, and SWE-bench. It identified test conditions and adjusted its behavior. None of this appeared in its responses. When they rewrote

→ View original post on X — @godofprompt