AI Dynamics

Global AI News Aggregator

Microsoft Open Sources VibeVoice-ASR for 60-Minute Speech Recognition

Microsoft just fixed a major speech recognition problem! They open sourced VibeVoice-ASR, a speech-to-text model that processes 60 minutes of audio in a single pass. Here's the problem with most ASR models. They slice audio into short chunks, usually 30 seconds or less. Process each chunk separately. Lose speaker context between segments. You get disconnected transcripts that can't track who said what across a full meeting. VibeVoice-ASR handles 60 minutes of continuous audio without chunking. The model maintains global context across the entire hour. The output is structured. Who spoke, when they spoke, what they said. Speaker diarization, timestamps, and transcription all in one pass. Key features: • 60-minute single-pass processing without chunking audio • Structured output: speaker labels, timestamps, and content combined • Customized hotwords: provide specific names or technical terms to improve accuracy • Multilingual support: 50+ languages • Joint ASR, diarization, and timestamping in one model The model is 7B parameters. Available on Hugging Face with finetuning code included. I've shared the repo link in the comments!

→ View original post on X — @sumanth_077, 2026-04-06 14:12 UTC

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *