Welcome to this week's podcast analysis, where we explore the transformative power of voice technology through ElevenLabs and its implications for various sectors.
Identify the "One Big Thing":
The "One Big Thing" is that voice is rapidly becoming the primary interface for human-computer interaction, moving beyond static content creation to dynamic, personalized, and deeply integrated agentic experiences across diverse sectors like customer service, immersive media, and especially education. ElevenLabs is building the foundational models and product layers to enable this shift, differentiating itself through architectural breakthroughs and a product-led ecosystem approach rather than just scale.
Extract Themes:
- Theme 1: Voice as the Future Interface & the Challenge of Foundational Models
- Quote 1: "We build foundational audio models. So models in a space to help you create speech that sounds human, understand speech in a much better way or orchestrate all those components to make it interactive and then build products on top of that foundational models."
- Quote 2: "I think the main part that I think is different in audio space is that you don't need the scale as much as you need the architectural breakthroughs, the model breakthroughs to really to really make a a dent."
- Theme 2: The Agentic Revolution: Beyond Reactive Customer Support
- Quote 1: "The second exciting piece within that domain which is happening is the shift from effectively a reactive customer support... into more of like a proactive part of the experience customer support."
- Quote 2: "I think the one that's that I'm most excited about for the world and for the shift is going to be education where you will just be able to have like effectively a personal tutor on your headphone and you like actually study something in a in a in an amazing way."
- Theme 3: Product-Led Ecosystem as the Moat in a Commoditizing Model Landscape
- Quote 1: "Our superpower and our focus for a long time was building the foundational models to actually make that experience seamless... but our model and the way it's different to a lot of the use case specific ones is that our platform is relatively open where you can use pieces of that platform and not all of them for those different use cases."
- Quote 2: "Research is a head start. This gives us we can give advantage to the customer earlier and it's six to twelve months of advantage... But the kind of the thing that will really give that long-term value is the ecosystem that you create around whether that's the run and distribution whether that's the collection of voices you can have the collection of integrations you can build the workflows that you can build."
Synthesize Insights:
- Theme 1: Voice as the Future Interface & the Challenge of Foundational Models
- The Interface Shift: Voice is evolving from a niche content creation tool (like dubbing or narration) to the primary, natural interface for interacting with all technology, from smartphones to robots.
- Foundational Model Imperative: Achieving truly human-like, controllable, and interactive voice requires deep foundational model research, not just scaling existing solutions. Analogy: Just as a chef needs the best ingredients to make a gourmet meal, ElevenLabs focuses on building the best "audio ingredients" (foundational models) before assembling them into diverse "dishes" (products).
- Architectural Breakthroughs > Scale: Unlike some other AI domains (e.g., LLMs), audio quality and interactivity are more dependent on architectural innovation and specialized talent than sheer data scale. ElevenLabs claims to beat larger labs on benchmarks due to this focus.
- Real-time & Expressive: The next frontier involves low-latency, accurate speech-to-text, text-to-speech, and sophisticated orchestration to enable natural, emotionally contextualized conversations, moving beyond static content.
- Global Ambition: The initial insight came from the poor dubbing experience in Poland, highlighting a global need for high-quality, emotionally resonant voice translation and real-time communication (the "Babel Fish" idea).
- Theme 2: The Agentic Revolution: Beyond Reactive Customer Support
- Proactive Customer Experience: AI agents are moving from reactive problem-solving (e.g., "I want a refund") to proactive assistance, guiding users through discovery, purchases, and full experiences (e.g., Misho's e-commerce assistant, Square's voice ordering).
- Immersive Media: Voice AI enables new forms of interactive content, allowing users to converse with characters (e.g., Darth Vader in Fortnite) or stories, blurring lines between passive consumption and active engagement.
- Personalized Education: The most transformative application is personalized AI tutors, offering on-demand, expert instruction (e.g., Hikaru Nakamura teaching chess, Chris Voss for negotiation). Analogy: Imagine having a personal tutor like a historical genius (e.g., Richard Feynman) in your ear, adapting to your learning style, much like a personal trainer customizes workouts.
- Government Transformation: Ukraine's "agentic government" initiative demonstrates the potential for voice AI to streamline public services, provide proactive citizen information, and deliver personalized education at a national scale.
- Theme 3: Product-Led Ecosystem as the Moat in a Commoditizing Model Landscape
- Research as a Head Start: Model advantages are temporary (6-12 months). The real long-term value comes from building a robust product layer and an ecosystem around that research.
- Open Platform Strategy: ElevenLabs offers an open platform, allowing enterprises to integrate specific components rather than forcing an all-or-nothing solution, catering to diverse needs and existing tech stacks.
- Voice Coaching & Personalization: Selecting the "right" voice is highly subjective and use-case dependent. ElevenLabs provides "voice coaches" and a celebrity marketplace, recognizing that voice choice is a critical, personalized branding decision. Analogy: Choosing a voice is like choosing a brand's spokesperson – it needs to resonate with the audience and convey the right tone, not just be technically "good."
- Ecosystem & Distribution: Defensibility comes from distribution, integrations, a diverse voice library, and established workflows, not just raw model quality. This allows them to serve both self-serve creators and large enterprises.
- Product-Research Parallelization: ElevenLabs integrates product and research teams, allowing them to build product features in parallel with research breakthroughs, accelerating time-to-market and extending the "head start."
Filter for Action:
- For Investors:
- Opportunity: Bet on companies building deep product layers and ecosystems around foundational AI models, especially in domains where architectural breakthroughs matter more than brute-force scale (like audio).
- Warning: Be wary of "model-only" plays; defensibility in AI is increasingly shifting to product, distribution, and specialized application.
- Opportunity: Look for companies enabling the shift to proactive, personalized, and immersive experiences, particularly in large, underserved markets like education and government services.
- For Builders:
- Action: Prioritize building robust product layers and seamless integrations on top of foundational models. Don't just chase model quality; focus on how users interact with and derive value from the AI.
- Action: Consider voice as the default interface for new applications, especially for agents and interactive experiences. Think beyond text-based interactions.
- Action: Invest in understanding and curating voice quality and personalization. This is a critical, often overlooked, aspect of user experience and brand identity in voice AI.
- Warning: Don't wait for perfect research. Parallelize product development with research initiatives to capitalize on temporary advantages and build momentum.
- Action: Explore "agentic" applications that move beyond reactive responses to proactive assistance and immersive experiences.
New Podcast Alert: No Priors Ep. 143 | With ElevenLabs Co-Founder Mati Staniszewski
By No Priors: AI, Machine Learning, Tech, & Startups
ElevenLabs, co-founded by Mati Staniszewski, is on a rocket ship, hitting a $300M ARR in just three years. Their mission? To redefine how humans interact with technology through voice. This isn't just about making voices; it's about building the foundational models and product layers for a future where voice is the primary, natural interface for everything from customer service to personalized education.
Voice: The Next Frontier of Human-Computer Interaction
"We build foundational audio models... to help you create speech that sounds human, understand speech in a much better way or orchestrate all those components to make it interactive and then build products on top of that foundational models."
- The Interface Evolution: Voice is rapidly transcending its role in static content (like dubbing) to become the default, natural interface for interacting with all technology – from smart devices to robots.
- Architectural Breakthroughs, Not Just Scale: Unlike some AI domains, superior audio quality and interactivity hinge more on architectural innovation and specialized talent than brute-force data scale. ElevenLabs claims to outcompete larger labs by focusing on these "model breakthroughs."
- Real-time & Expressive: The cutting edge is low-latency, accurate speech-to-text, text-to-speech, and sophisticated orchestration. This enables natural, emotionally contextualized conversations, moving beyond robotic responses to truly human-like interaction.
- Global Communication: The initial spark for ElevenLabs came from the poor dubbing experience in Poland, highlighting a universal need for high-quality, emotionally resonant voice translation and real-time communication – the "Babel Fish" dream.
The Agentic Revolution: Beyond Reactive Support
"The second exciting piece... is the shift from effectively a reactive customer support... into more of like a proactive part of the experience customer support."
- Proactive Customer Experience: AI agents are evolving from mere problem-solvers to proactive assistants, guiding users through discovery, purchases, and entire experiences. Think of an e-commerce agent that helps you navigate products and check out, not just process a refund.
- Immersive Media: Voice AI unlocks new forms of interactive content. Imagine conversing with Darth Vader in Fortnite or a character from your favorite book, blurring the lines between passive consumption and active engagement.
- Personalized Education: The most transformative application is the personalized AI tutor. This means having an expert (like chess grandmaster Hikaru Nakamura or negotiation guru Chris Voss) in your ear, adapting to your learning style, offering on-demand, tailored instruction.
- Government Transformation: Ukraine's "agentic government" initiative showcases voice AI's potential to streamline public services, provide proactive citizen information, and deliver personalized education at a national scale.
Ecosystem as the Enduring Moat
"Research is a head start... But the kind of the thing that will really give that long-term value is the ecosystem that you create around whether that's the run and distribution whether that's the collection of voices you can have the collection of integrations you can build the workflows that you can build."
- Temporary Advantages: Model quality advantages are fleeting, often lasting 6-12 months. The real long-term value and defensibility come from building a robust product layer and a comprehensive ecosystem around that research.
- Open Platform Strategy: ElevenLabs offers an open platform, allowing enterprises to integrate specific components rather than forcing an all-or-nothing solution. This flexibility caters to diverse needs and existing tech stacks.
- Voice Coaching & Personalization: Choosing the "right" voice is a critical, subjective branding decision. ElevenLabs provides "voice coaches" and a celebrity marketplace, recognizing that voice choice is as important as visual branding.
- Product-Research Parallelization: By integrating product and research teams, ElevenLabs builds product features in parallel with research breakthroughs, accelerating time-to-market and extending their "head start."
Key Takeaways:
- Strategic Shift: The future of human-computer interaction is voice-first, moving from static content to dynamic, personalized, and agentic experiences.
- Builder/Investor Note: Defensibility in AI is increasingly found in deep product layers, specialized architectural breakthroughs (especially in audio), and robust ecosystems, not just raw model scale.
- The "So What?": Over the next 6-12 months, expect to see significant advancements in proactive AI agents, immersive media, and personalized education, with voice as the core interface.
Podcast Link: https://www.youtube.com/watch?v=WHWAYiY_RnQ

ElevenLabs Co-Founder Mati Staniszewski reveals how architectural breakthroughs in audio AI, not just scale, are driving the next wave of human-computer interaction, challenging the dominance of generalist LLM providers and unlocking multi-segment growth.
ElevenLabs' Ascent: Voice as the Universal Interface
- ElevenLabs, founded in 2022, has achieved a $300 million ARR with 5 million monthly active users (MAU) on its creative platform.
- The company operates with 350 global employees and maintains a 50/50 split between self-serve (creators, subscriptions) and enterprise clients (sales-led agents platform).
- The initial insight stemmed from the "terrible experience" of single-voice Polish dubbing, sparking a vision for original voice, emotions, and intonation carried across languages.
- Mati argues voice is the interface of the future, transcending keyboards and screens for natural communication with smartphones, computers, and robots.
"Instead of having it just translated, could you have the original voice, original emotions, original intonation carried across?" – Mati Staniszewski
The "Lab" Approach: Sequencing Research and Product for New Markets
- The Voice Lab initially focused on high-quality voice recreation for narration and dubbing, quickly realizing existing models were "robotic."
- The Agent Lab then combined text-to-speech (TTS), speech-to-text (STT), and Large Language Models (LLMs – AI models trained on vast text data to understand and generate human-like language) to orchestrate interactive, low-latency, and accurate conversational agents.
- ElevenLabs also launched a Music Lab for licensed music creation and is integrating partner models for image and video, aiming for a multimodal content suite.
- Voice evaluation remains a complex challenge; ElevenLabs employs "voice silvers" (coaches) for enterprises and is building a celebrity marketplace to help clients select the right voice for specific use cases, demographics, and desired brand personas.
"You need that research expertise to be able to do that well." – Mati Staniszewski
Agent Platform: From Reactive Support to Proactive Engagement
- Proactive Customer Support: Companies like Meesho (India's largest e-commerce) and Square are deploying agents as front-end assistants, guiding users through discovery, navigation, and even checkout via voice.
- Immersive Media: ElevenLabs partnered with Epic Games to bring the voice of Darth Vader to Fortnite, allowing millions of players to interact live with the character, signaling a shift from static to immersive content.
- Personalized Education: Initiatives with Chess.com allow users to learn from AI-powered tutors voiced by grandmasters like Hikaru Nakamura or the Polgar sisters. MasterClass offers interactive negotiation practice with an AI Chris Voss (renowned FBI hostage negotiator).
- "Agentic Government": Mati highlights a groundbreaking project in Ukraine, where the Ministry of Transformation is building a fully agentic government for citizen services, proactive information dissemination, and personalized education.
"The whole breaking down language barriers that are the barriers to communication, to creation—all of that will break." – Mati Staniszewski
Defending Against Giants: Architectural Breakthroughs Over Scale
- ElevenLabs' "superpower" is building foundational audio models that are "seamless, human, and controllable," often beating benchmarks for text-to-speech and speech-to-text.
- Mati argues that in the audio space, architectural breakthroughs are more critical than sheer scale, allowing smaller, focused teams to achieve superior results.
- While open-source models will commoditize basic narration, the key differentiator for ElevenLabs is controllability and the complex orchestration required for real-time, interactive agents.
- The company views research as a "head start" (6-12 months), which is then amplified by parallel product development and a growing ecosystem of integrations, voices, and workflows.
"We are happily beating them on benchmarks with text-to-speech or speech-to-text or the orchestration mechanisms... It's just a mighty researchers just continuing their work." – Mati Staniszewski
The Future of Interaction: Super Assistants and Personalized Learning
- He envisions "Jarvis-like" super assistants that understand personal preferences, manage environments, and provide relevant information, rather than purely social companions.
- Voice will be a key interface for robotics, with most robot interactions likely to be personified.
- The most significant, yet-to-be-realized impact will be in education, where personalized, voice-driven AI tutors (potentially voiced by historical figures like Richard Feynman or Albert Einstein) will deliver tailored content on demand.
- This future will blend AI-driven learning with dedicated human-to-human interaction, fostering both knowledge acquisition and social development.
"Learning with AI will with voice where you it's like on your headphone or in a speaker it's just going to be such a big thing where you have like your own teacher on demand and who understands you very personified and kind of delivers the right content through your life." – Mati Staniszewski
Investor & Researcher Alpha
- Capital Shift: The market for foundational audio models is not solely about scale; specialized architectural breakthroughs are yielding superior results, suggesting a potential shift in investment towards focused, domain-expert AI labs over generalist LLM giants for specific modalities.
- New Bottleneck: Voice quality evaluation and the nuanced description of audio data remain significant challenges. Companies that can effectively benchmark, personalize, and interpret complex audio data will gain a competitive edge in deploying sophisticated voice AI.
- Research Direction: The "cascaded" approach (separate STT, LLM, TTS) remains reliable for enterprise use cases, but research into "fused" or "speech-to-speech" models that integrate emotional context and parallel processing represents the next frontier for truly expressive and human-like conversational AI, potentially rendering current sequential methods less competitive for advanced interactions.
Strategic Conclusion
ElevenLabs demonstrates that deep specialization in foundational audio AI, coupled with a robust product layer, can carve out significant market share even against larger players. The next step for the industry is to fully embrace voice as the primary interface, unlocking truly personalized and immersive experiences across all sectors.