AI Dubbing: 2025 Market Update
- Jourik Ciesielski
- May 3
- 2 min read
AI Startups Funding
Over the last couple of weeks, I've worked closely with the Dubformer team on a deep dive into the AI dubbing market. My interest in AI dubbing has always been high, but now it has reached an all-time high.
While the dubbing industry and the language services industry haven't agreed yet on the definition of the term "speech-to-speech" or the feasibility of "real-time," one thing is clear — multimodal AI sounds like music to investors' ears. The companies in the table ⬇️ all have one thing in common: AI voice products. ElevenLabs leads the way with their massive $180m round at a staggering $3bn valuation. The presence of an MT company is noteworthy.

Cascaded Workflow
Nonetheless, multimedia localization still faces several challenges:
Translated TTS-generated speech is often longer than the original audio track.
Traditional AI dubbing is decoupled from the source video and doesn't optimize for audiovisual alignment or lip sync.
Modifying visual elements (e.g., stretching or squeezing frames) for sync purposes affects quality, raises ethical concerns, and compromises the integrity of the original video.
Ongoing research in the field is fascinating, from Amazon and Sony experimenting with duration and synchronization loss in model training to Meta's "massively multilingual and multimodal" SeamlessM4T model. Meanwhile, cascaded workflows (intelligent transcription → speaker designation and labeling → machine translation → voice casting → speech synthesis → mixing of voices, music and effects) are still a solid paradigm.
💡 What's particularly interesting about cascaded workflows:
Translation is the only workflow step where LLMs are deployed. While LLMs add value, they aren't the most critical component in the AI dubbing process, and translation quality isn't even considered to be the most challenging aspect.
Text-to-speech (TTS) is now more commonly referred to as neural speech synthesis, focusing on pitch, speed, energy, intonation, and emotion. Controllability is the new differentiating factor.
Certain disciplines such as voice cloning are relatively straightforward — perhaps even a bit overhyped?
The TMS Struggle
At the same time, while integrating OpenAI Whisper or Amazon Polly isn't necessarily more complex than integrating DeepL, the number of TMS offering the option to start a project from a video is still shockingly low. The challenge for TMS is that AI dubbing involves far more than simple text input and output, allowing specialized systems that are equipped to accommodate both technical corrections and creative modifications to take more and more video localization market share.
Conclusion
No, all of this doesn't mean the end of the human interpreter or voice actor, but one thing is clear: AI dubbing has evolved from a services to a product market, it's ready to attack typical human services territory such as TV broadcasts, and it has the resources to do so.
👏 Kudos to the Dubformer team, and congrats on the $3.6m seed round!



Comments