Voice's impending iPhone moment

Ben Thompson recently wrote in Stratechery about “The Impending VR Moment” (link).

I’d summarize his argument as:

Recent AI models (like Sora) show major improvements in generating complex and photorealistic video content that may be useful for virtual reality experiences (despite its limitations in how it represents the physical world)
Transformer-based models like Sora will run much more quickly in chips with architectures designed for their computational needs rather than GPUs, like Groq has shown (link)
Thus, we’re directed towards an “iPhone moment” for VR given the development of these technologies and AI, and VR hardware like the Apple Vision Pro or Meta Quest

I think Ben’s right, but I also think it misses an equally significant moment: these same drivers will make voice the next dominant interface.

“So three things: a widescreen iPod with touch controls, a revolutionary mobile phone and a breakthrough internet communications device. An iPod, a phone, and an internet communicator… Are you getting it?”

The iPhone — and all succeeding smartphones — collapsed all devices into 1. It replaced paper maps (does anyone use them anymore?), personal music players like the iPod or Walkman, long-distance phone calls through WhatsApp, cameras, newspapers and magazines, the TV and radio… and so much more.

We went from many devices to one: a monochromatic rectangle with a touchscreen.

But language is our original interface — hundreds of thousands if not millions of years old. No wonder why old people struggle with new technologies but can still solve problems by talking with others.

My former teacher, the inspiring Javier Cañada, recently spoke about why he thinks voice is the best interface. He described voice as:

Inmaterial: it doesn’t take space, unlike pixels on a screen
Freeing up the senses: you don’t need to use multiple senses simultaneously (unlike sight and gesture with graphical interfaces)
Accessible: it’s an interface almost everyone case use because you just need to speak
Emotive: voice has intonation, which conveys information about the meaning of our words
Personality: we all speak in a unique way

This is much more expressive, intuitive, and personal than gestures on a smartphone screen, and a key reason for why I think the natural interface for AI agents is voice.

Reinventing hardware around voice

The above leads to my main predictions for the next 10 years:

Voice will be the dominant new interface of the next decade
This will create a long-tail of hardware around voice

While we’re not exactly there, better computation — leading to faster and more accurate LLM results — should enable this. Some early example of this are the Rabbit R1 and the Humane Pin.

While smartphones will continue to exist because of the value of displaying information on screens (just like desktops and laptops coexisted with smartphones), these new devices will complement — or in some cases even substitute — smartphones.

I’m particularly excited about these devices eventually being substitutes of smartphones. Our phones are essential but invasive: we can’t imagine not having it, but its interruptions and distractions have few limits. Something like a screen-less smartwatch that can still do the essentials like access our calendar, call our Uber, or make a call, can be a full replacement to smartphones for many people — especially those sensitive to smartphone’s invasiveness.

To extend this further, I think we’ll see the reverse trend from the smartphone: instead of collapsing all devices into one, we’ll see an explosion of a long-tail of AI-powered devices. With an internet connection to make calls to an LLM, and maybe some additional hardware like cameras to understand the user’s environment, you can make devices more helpful. Imagine a mining hat with a camera that helps you understand the geological conditions around you and security cameras that help show the type of recorded footage you’re looking for.

Just like vertical SaaS helps you deliver more value to customers with more tailored products, “vertical” AI-powered hardware should provide more consumer value for specific use cases than generic alternatives.

Reinventing hardware around voice#

Reinventing hardware around voice