It really depends on what you mean by production settings though. LLMs are used for many things other than being a ChatGPT. Overall I agree with your thread and benchmark results should be interpreted with caution. (But can be still useful some times)