Mistral releases a new open-source model for speech generation

Post Views: 2

👁 0 views

French AI firm Mistral launched a new open-source text-to-speech model on Thursday that can be utilized by voice AI assistants or in enterprise use circumstances like buyer assist. The model, which lets enterprises construct voice brokers for gross sales and buyer engagement, places Mistral in direct competitors with the likes of ElevenLabs, Deepgram, and OpenAI.

The new model, referred to as Voxtral TTS, helps 9 languages, together with English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

“Our customers have been asking for a speech model. So we built a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices. The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance,” Pierre Stock, vp of science operations at Mistral AI, advised TechCrunch throughout a cellphone interview.

Mistral mentioned the new model can adapt a customized voice with a pattern of lower than 5 seconds, and likewise seize traits like delicate accents, inflections, intonations, and irregularities within the circulate of speech. The model, primarily based on Ministral 3B, can swap between languages simply with out shedding the traits of the voice, which is helpful for use circumstances like dubbing or real-time translation. Stock mentioned the corporate wished the model to sound human and never robotic.

The model has been constructed for real-time efficiency, in response to the corporate. It has a time-to-first-audio (TTFA) — a measure of when the model begins ‘speaking’ after receiving enter — of 90ms for a 10-second pattern of 500 characters. The model additionally has a real-time issue (RTF) of 6x, which implies it could possibly render a 10-second clip in roughly 1.6 seconds.

Earlier this 12 months, Mistral launched a pair of transcription fashions, one for massive batch processing and the opposite for real-time use circumstances with low latency. With the new speech model, the corporate is probably going aiming to supply a full suite of voice merchandise to enterprises.

“We plan to have an end-to-end platform that can handle multimodal streams of input, including audio, text, and image and output as well. The main benefit of that is you get way more information with an end-to-end agentic system that supports audio as an input or output,” Stock mentioned.

Techcrunch occasion

San Francisco, CA
|
October 13-15, 2026

Mistral’s positioning is that its open supply and customization bit will assist enterprises undertake its voice fashions over opponents, as they will tune it the way in which they need.

Related Posts