ElevenLabs, a leader in voice AI technology, has announced the launch of the alpha version of its latest text-to-speech model, designated Eleven v3. The company describes this new iteration as its most expressive model to date, signaling a significant advancement in generating humanlike speech from text.
Built upon an entirely new architecture, Eleven v3 is designed to move beyond simple text recitation, focusing instead on delivering speech with enhanced vocal realism and nuanced emotional expressiveness. The stated goal is for the AI to perform the text, rather than merely reading it aloud, allowing for more dynamic and engaging audio content.
Key Innovations Driving Expressiveness
Among the standout features of Eleven v3 is its extensive language support, now covering over 70 languages. This broad linguistic capability aims to empower creators globally, enabling the generation of expressive speech across a vast array of content types and audiences.
A pivotal addition for guiding the AI’s performance is the introduction of emotional cues via audio tags. Users can now embed specific tags within the text, such as `[whispers]`, `[angry]`, and `[laughs]`, to prompt the AI to deliver the subsequent speech with the corresponding emotion or nonverbal sound. This level of granular control over vocal performance is a key differentiator.
Furthermore, the model introduces a sophisticated dialogue mode. This feature is engineered to handle natural multi-speaker conversations, including the complexities of managing interruptions and smoothly shifting tones between different virtual speakers, creating a more fluid and believable interactive experience.
CEO’s Vision for Control and Performance
Mati Staniszewski, Co-Founder and CEO of ElevenLabs, highlighted the transformative control offered by the new model. According to Staniszewski, v3 provides users with “full control over emotions, delivery, and nonverbal cues.” He elaborated that this empowers users to guide the AI to “whisper, laugh, change accents, or even sing,” demonstrating the ambitious scope of v3’s capabilities and its potential to unlock new creative avenues in audio production.
Alpha Phase and Future Potential
Currently released as an alpha version, Eleven v3 requires more prompt engineering compared to ElevenLabs’ previous models. However, the company states that this initial phase is part of the process to refine the model and prepare it for broader application.
The aim of v3 is to deliver a substantial leap in realism and humanlike delivery. This advanced capability is expected to be particularly impactful for applications requiring deep character work and dynamic interactions, such as in the creation of immersive media, video game dialogue, audiobooks, and sophisticated multilingual storytelling. By enabling the generation of highly expressive and controllable voices, Eleven v3 seeks to set a new standard for AI-driven speech synthesis.