22/ Say you've trained a perfect LLaMA 65B reproduction & evaluated it with EAI harness (score 0.488). Comparing it to the published number (evaluated w. original implementation, score 0.637), it's a 30% difference so you're likely thinking "Oh no " But these numbers are…
LLaMA 65B Evaluation Discrepancies Between Implementations
By
–
Leave a Reply