7/ DoReMi – trains a small proxy model over domains to produce domain weights without knowledge of downstream tasks; it resamples a dataset with the domain weights which allows using a 280M proxy model to train an 8B model (30x larger) more efficiently.
DoReMi: Domain-Weighted Resampling for Efficient Model Training
By
–
