Here are the juicy lessons from the 1T model Tele-FLM-1T: – Trained progressively in 3 stages: 52B -> 102B -> 1T parameters on ~2T tokens – Depth growth technique: selects layers to duplicate based on their input-output distance metrics, prioritizing middle layers with smaller
Tele-FLM-1T: Progressive Training and Depth Growth Techniques
By
–
