"TIP: Token Importance in On-Policy Distillation" This paper introduces selective token training for on-policy distillation, relying on student entropy to find high-signal tokens. A key point is that entropy misses confident mistakes, so they add teacher-student divergence to
Token Importance in On-Policy Distillation with Selective Training
By
–
Leave a Reply