Why is this interesting? It lets us use question-answer pairs without needing the full chain-of-thought input or even a perfect ground-truth output for SFT. Graders can assign nuanced scores (0–1), improving model outputs incrementally—even when answers are partially
Question-Answer Pairs Training Without Perfect Ground Truth
By
–