Now given that next-word prediction is multi-task learning, we can write the overall loss is the weighted sum of loss of individual tasks. When overall loss improves smoothly, do all individual tasks improve smoothly, or do some improve at different rates than others?
Multi-task Learning in Next-word Prediction: Task-specific Loss Dynamics
By
–
Leave a Reply