AI Dynamics

Global AI News Aggregator

VoxCPM: Real-time Voice Cloning Without Tokenization

Clone a human voice in real time without tokenization! VoxCPM is an open-source text-to-speech system that models speech in continuous space instead of discrete tokens. Most TTS systems convert speech to discrete tokens before generation. This quantization creates a fundamental trade-off: tokens provide stability but lose acoustic details like breath, vocal texture, and subtle articulation. VoxCPM skips tokenization entirely. It models speech directly in continuous space using an end-to-end diffusion autoregressive architecture built on MiniCPM-4. The system uses hierarchical language modeling with two specialized components: a Text-Semantic Language Model that captures high-level prosody and structure, and a Residual Acoustic Model that recovers fine-grained acoustic details. This separation eliminates dependency on external speech tokenizers and prevents error accumulation from multi-stage pipelines. Two flagship capabilities: 1. Context-aware speech generation: The model comprehends text to infer appropriate prosody and speaking style. Explanations slow down naturally, emphasis appears in the right places, questions sound like questions. 2. Zero-shot voice cloning: With just 3-10 seconds of reference audio, it replicates speaker timbre, accent, emotional tone, rhythm, and pacing. Key features: • Tokenizer-free architecture with continuous speech modeling
• Context-aware prosody generation without manual tuning
• Zero-shot voice cloning from short reference audio
• Streaming synthesis support for real-time applications
• SFT and LoRA fine-tuning support It's 100% open source Link to the GitHub repo in the comments! [Translated from EN to English]

→ View original post on X — @sumanth_077, 2026-04-03 13:14 UTC

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *