This is a challenge for sure. However, in some cases it is possible to find an academic/ open source dataset for the language that you’re interested in fine tuning. There is also always the possibility of using a Wav2Vec2.0 checkpoint to transcribe YT videos & use that.
Fine-tuning Speech Recognition Models with Open Datasets
By
–