
By a16z
Date: October 2023
Quick Insight: This summary unpacks ElevenLabs' audacious vision for voice AI, revealing how they're building the next fundamental human-computer interface. Investors and builders will grasp the strategic shifts in AI modality and organizational design driving their exponential growth.
This episode answers:
Matty, co-founder of ElevenLabs, sits down with a16z to discuss their journey from a Polish dubbing problem to a global AI voice powerhouse. They are not just building better text-to-speech; they are redefining how humans will interact with machines, moving beyond screens to a future where voice is the primary, emotionally resonant interface.
"Voice is poised to become the next fundamental interface for humans interacting with computers just as mouth touch screens and keyboards."
"If you train a general audio generation model, you're training on raw audio. If you can make a model that is smart in audio, you can imagine you can make a model that is smart in any raw data domain."
"We removed all the titles and it's a great way both initially of filtering for people who are very low ego."
Podcast Link: Click here to listen

We've been trying to make these human voices for literally since the 1700s. Then in the early 1900s, we had the first digital synthesizers.
Are you that it doesn't cross that threshold of actually sounding like a human and actually making you feel something? Then it shifted into Siri, which has kind of a bit of back and forth that sounds more realistic, but again, it doesn't cross that threshold of actually sounding like a human and actually making you feel something.
Marty, it's so great to have you here at the headquarters of Anderson Haritz.
No, thanks so much for having me. It's incredible to be here and speak here together about some of the work we do.
You have said voice is poised to become the next fundamental interface for humans interacting with computers just as mouth touch screens and keyboards. Help us imagine what it looks like.
A lot of things are screen first. Most people will have the laptop, the phone most of the day in front of them. I think a lot of that will move into the background where you will be able to be a lot more present.
When I imagine say studying in a classroom in the future, you have on headphones you can have the most smart physicist, mathematician, a historian helping you through learning the subject.
There will be an interesting shift where voice will be a big part of the technology where today when you go to other countries, other cultures, you cannot fully immerse inside the culture unless you know the language.
And with voice and with technology suddenly this will become possible where you can speak any language in the world and fully understand the not only what is said but how it's said kind of feel a closer part which will be just incredible future where the true language barriers but also the cultural barriers or the things that we have never learned will become possible.
Let's start at the beginning. You and Peter grew up in Poland. Tell us the experience that sparked idea of 11 labs in Poland.
If you watch a foreign movie, all the voices where it's a male or female voice are narrated with one single character. So, one voice speaks all the lines.
They ought to make the day that what? Well, it's 8:00 and it's not a good day. All the emotionality, all the inonation just disappears.
And then back in 2021, we realized that it's still happening. K was at Google, I was at Palanteer. We would explore different projects together on the weekends and we invited the first group of users then and started kind of iterating a little bit deeper and then we started getting good signal on what are some of the use cases that will really resonate.
So when we launched in early January we already have a few thousand people lined up that we knew are very interested in actually using the product but then of course the few thousand turned into a few hundred thousand of users and that was a magnitude probably higher than we expected in the first order.
Introducing voice design V3. Introducing 11 Labs image and video. Proudly introduces studio 3.0.
What has been the guiding principle of the product philosophy?
It always was a combination of where do we think we can deliver value with some of the research work but then layering the product on top. Two, where do we think there's actually real problem?
Like there are companies who have the research, there are companies who have the product and we try to have the both and I think we have it's great because product can directly talk to the kind of provide the feedback what is needed to the research research then immediately is able to iterate on that they can also test their models directly on the product and with this way you know it just like the both kind of accelerates.
Talking about the team you went from just the two of you around the pre time to I believe seven people when you're raising the series A and did the launch and then quickly to a few dozen a year later. How did you approach imbuing the team? What qualities are you looking for when you're hiring?
We were especially in the early days hiring from very non-traditional backgrounds. So I did astrophysics in my undergrad and then applied physics in my masters.
Yeah. So I first met Mattie when we did a hackathon together when we were 21. I was working at the White House for President Biden and an 11 Labs investor told me that I should do everything in my power to try to go work there.
I was always pretty ambitious, but like most of my ambition I put into video games. I have like 12,000 hours of Dota or something. I was actually like ranked 250 or something on the on the European leaderboard.
We were especially in the early days trying to hire for some proof of excellence that people would do and it could be an open source project. It could be doing something outside of work.
Yeah, I was doing my master's degree. I wasn't really going to university much. I was developing this text of speech project and like kind of like Peter wore me through a guitar.
When I finished my thesis, I posted online one of the samples of the music generation model and P saw this this example and contacted me.
So when I first joined, we had a 11 desk room. Now we have offices in over 11 cities, over 300 employees. We're doubling every 6 months.
But because we're remote first and we work in very small teams with high autonomy, you actually forget how big the company is.
We wanted to hire the best people in the world and we don't think there's that many researchers in the world that are at that top level especially in voice maybe 50 maybe 100. So we wanted to hire wherever they are.
There is as you know this like very strong cultural obsession to be in person. How do you contrast these two different setups?
When we started the aspiration was very global both in what we wanted to create as a technology. We wanted to make it available across all languages, across all geographies.
11 Labs had a culture when I came in and I think that was also what enticed me. I understood the vision that Matty and Peter had for what type of company they want to build and the type of people that they are, which essentially is reflected in the culture.
Matty and Peter, they're childhood best friends. They know each other super well. They're both incredible operators and they're high trust.
Honestly, what really got us excited about investing in the company was chatting with the founders, Maddie and Podor. They had a really unique vision of what the world could look like in the future that a lot of people didn't see yet.
Matty and Peter are like ying and yang in a way. Peter is very focused on the research. He's an absolute genius in that space. Working with him is is very nice because he's is very technically can go very technically in depth.
The second smartest person I know is significantly less smart than him. Let's put it like that. It is a bit like good cop bad cop. Maybe maybe kind of like Matthew is the good cop and Peter as the bad cop.
Well, we'll start a little bit with thanking you for being here. You're hard guy to catch. So where are you today?
Now I'm in Dubai.
How has your role evolved as you became larger more remote team?
Yeah, it's definitely you don't know all engineers which is definitely sad that you at some point you just will not know all these people in the company. Matthew knew everybody at the previous offsite already on the previous offsite when there were 100 people already failed.
If you have great people, there's very little effort needed to run the company because you can just trust these many founders. People that really take ownership and care about the company because you because you love working here and you love the product.
When the product is built out of love, then users can see that everyone's very high autonomy. They're low bureaucracy, very flat, fuzzy hierarchy. They're doing whatever is needed to move the needle for the customers to ship quickly.
We removed all the titles and it's a great way both initially of filtering for people who are very low ego. And so if you're coming in, yes, I want to be VP of blahy blah, you're not going to get VP. And so it actually will turn off those people. But I'd argue that's a good thing.
No implicit bias of asking a question or asking for help or giving advice to someone or proposing ideas because there's no explicit hierarchy.
Get access to a training cluster and train a model that you have an idea for. We are apply rigorous screen for ensuring a cultural fit before we bring someone in. And I think that's essential to being able to scale this quickly and still preserve culture.
In fact, when I first spoke about this publicly and we kind of launched the idea that we got rid of titles, I had someone that I used to work with reach out and she said like, "I heard that you got rid of titles. I love that notion. What roles do you have? I want to join." And she's now leading hiring, incredibly successful.
Currently, we have specialized models for audio, for sound effects, and for music. And I think the future of sound is kind of like having one model which can generate any kind of audio.
You could imagine seeing something with voice that is converted to music or like singing something changing the the singing into sound effects.
The new challenge we've really set ourselves is can we be the first company to cross this threshold of the vocal cheuring test. How do you have an AI which really sounds like a human that you can interact back and forth with but is super smart, super empathetic?
I think there's going to be a point where most of the communicate we do with machine might be through audio because one is faster to communicate but also because it's more information rich.
There's things now that machines or LLMs are not capturing. If you train a model on text, you're basically using text units tokens that are created by humans where if you train a general audio generation model, you're training on raw audio.
If you can make a model that is smart in audio, you can imagine you can make a model that is smart in any raw data domain. That I think is one of the most interesting things.
Voice is the only AI modality that can actually make you feel something. And so when you have text, yes, you can have a poem or a story, but it doesn't give you that same kind of emotive feel.
Well, as when you hear a voice, whether it's like a ASMR whispering voice or whether it's a deep booming cinematic voice, it can really kind of transport you and make you feel make you feel alive.
I love to end the conversation with this question. What drives you personally?
Definitely seeing people react is one of the always the best moments. But I feel like I'm in just such a lucky position where I can work on a company with my best friends.
But now it feels like we have this this incredible team of somewhere between sports team to family where just everybody is driving on the same passion and vision.
But I think now especially it's just so rare that you get a chance to be the voice of the change or voice of the technology and be able to be at the frontier and define how voice will be that interface for everybody around us.
It's just such an unique opportunity to create a bike that we are lucky and happy to be able to be part of it.