The 300k+ hours of audio clips were used to train a "generator model" that turns the text into an intermediate representation, and a "cascader model" that uses this intermediate representation to produce high-quality audio.
Training generator and cascader models on 300k+ hours of audio
By
–