Scores reported by researchers are not the same as answers provided by the API. They indicate when they use techniques like n-shot, CoT, etc. If they hide additional prompting techniques used to obtain these scores, it would be considered fraudulent.