any reasonably competent LLM should be sufficient to evaluate, as LLMs are a lot more robust to comparing two texts (than to evaluate an answer in isolation). I'd say one can even just do cosine distance on sentence embeddings if we want to avoid the LLM evaluates an LLM
LLM Evaluation Methods: Comparing Text Robustness Techniques
By
–