The same can be done with any model. This is why researchers specify n-shot, CoT, etc. The model behind an API will be super useful and could even be better than any other model (in theory). But this comparison is wrong nonetheless.
Model Comparison Methodologies and API Evaluation Fairness
By
–