Or https://
arxiv.org/abs/1503.02531, which describes how distbelief was used to train the baseline model (a quite large model for the circa 2014/2015 time frame), and later describes using the framework to train specialists and then distill them into a single model.
DistBelief: Distributed Training and Knowledge Distillation Framework
By
–
