friendly reminder to everyone that there isn't yet a good & proper systematic blind eval/benchmark of LLMs yet, especially those on real world data/use-cases. if i were in academia this is something i'll work on immediately.
Systematic Blind Evaluation Benchmarks for LLMs Urgently Needed
By
–
Leave a Reply