Amazon.com, Inc. (AMZN) Earnings Call Transcript & Summary

May 17, 2021

NASDAQ US Consumer Discretionary Broadline Retail conference_presentation 18 min

Earnings Call Speaker Segments

Miriam Daniel

executive
#1

Hello, everyone. My name is Miriam Daniel, and I'm VP of Alexa and Echo Devices at Amazon. I'm excited to be here with you today to talk about something that I am very passionate about, designing new human-computer interaction and ambient experiences. Our original inspiration for Alexa and Echo came from the Star Trek computer in the TV series. It was a cloud-based service that you could speak to and have changed the course of the starship. I'll tell you about the distant galaxy. I still remember watching the Star Trek series as a child, and I was fascinated. It seemed like magic to me at that time. Over the last 6 years, it's been remarkable to see that magic come to life. Today, customers around the world are using their voice to ask for music, make phone calls, get information, control the smart home and more. Designing for human-computer interaction using voice as a medium was quite a challenge. Voice is the most natural form of communication and interaction for most humans. In fact, we hear our parents' voices before we are even born, and using your voice to communicate with the world comes naturally to us as humans. But there wasn't a precedent for how people would use voice to interact with technology in their homes. From day 1, we knew Alexa and Echo would introduce a new human-computer interaction paradigm. Our initial launch of Alexa and Echo taught us that voice interaction is a way to save time and simplify our lives. Voice enables more family time and healthier digital lives by letting us put our screens down. I am probably saying something a bit controversial here. I realize I'm speaking to an audience of diverse display technologies, but I promise you, I will talk more later about how we brought screens to augment the voice interaction over time. In the first few years, our focus was on making Alexa interactions natural, as if you're talking to a friend, a family member or a colleague. We focused on voice modulation and intonation. Humans unconsciously modulate their voices, like I'm doing now, like when your voice goes up at the end of a question or when you put emphasis on certain words. You must also use voice intonation when you're speaking. There's something musical in the way people speak, depending on their emotion, tone or accent. We trained Alexa's voice to be more natural using such modulation and intonation of that voice. We also trained Alexa's voice to adapt to different speaking styles: the newscaster style for reading the news; a DJ style for announcing songs; or even celebrity voices like Samuel L. Jackson's. As Alexa became more natural, we entered a second phase in human-computer interaction. Our customers wanted Alexa to have more human-like traits. We wanted her to have personality and opinions. They expected her to be thoughtful. They want her to have some intuition, to be more conversational or teach Alexa like you would other humans. I will now take you through how we taught Alexa to learn more human-like traits. Early on, we found that customers assumed Alexa would have a personality. They were asking her a lot of questions like what's your favorite color, what's your favorite song, et cetera. Customers told Alexa I love you more than 10 million times last year. She's even received millions of marriage proposals. It's really funny. Realizing the need to give Alexa some personality, we taught Alexa to have opinions on a variety of subjects. You can ask Alexa's favorite colors, sport, actor and more than that, and her response will vary by country. In India, her favorite sport is cricket. Here, it's baseball. Humans are also thoughtful. If I walked into a room right now and you said, "Shh, the baby is sleeping," I would know what to do. I would lower my voice. Our teams taught Alexa how to respond based on context. Today, if you whisper something to Alexa, she will whisper back when she responds. Human brains have an interesting ability to connect to us. It's what we call intuition or a hunch. Now with state-of-the-art AI and sensor technology, we can replicate a human being's hunch. It's a big step forward for ambient computing because it requires multiple devices and sensors to work together. Your Echo device, Alexa and a smartphone device like a smart lock or smart light might work together and produce a hunch. For example, I knew some customers routinely say Alexa, set an alarm or Alexa, good night. Every once in a while, Alexa will have a hunch. She might let you know if you left your kitchen lights on or if your back door is unlocked. Customers can also choose to have Alexa proactively act on her hunches. For example, customers can enable Alexa to adjust the thermostat or turn down the water heater when she has a hunch that you might be away from home. Humans can also follow the flow of a conversation by making connections to past references. It is the ability to answer and ask successive questions without being repetitive. We created a feature called follow-up mode. You can now have back-to-back requests to Alexa without needing to repeat the wake word. It's a much more conversational experience. And finally, last fall, we introduced the ability for Alexa to learn directly from our customers. For example, I can teach Alexa that when I say Alexa, turn on movie mode, I mean dim the lights to 50%. Alexa will learn these new concepts and apply them anytime I say that phrase. This is a great example of how Alexa can learn and adapt to your unique needs so your home runs just the way you want it to. We're excited about how far we've come with our AI models to give Alexa more human-like attributes, but we know that the future of human-computer interaction is so much more than having an AI with human-like voice attributes. We believe the future is ambient, one in which technology all around you adapts to you. It is a multidimensional experience that is proactive but not pushing; personal but also communal; and understands you and your home well enough to act on your behalf in meaningful ways. Let me explain what I mean. I've spent a lot of time on voice, but in 2017, we introduced screens to augment the voice experience. The combination of voice with easily glanceable shows on Echo Show is truly powerful. It's also an excellent example of the ambient experience. Content is proactively shown to you in the background ready when you need it. It can shortly remind you that a friend has a birthday coming up, recommend a new recipe for dinner or show you a favorite photo that you don't even need to ask for. Over time, we expanded our Echo Show portfolio, and we realized that putting information on the screen wasn't enough. You see our screens are stationary but we as humans are not. So we challenge ourselves to create yet another new paradigm, a screen that moves with you. Let me show you how it works. [Presentation]

Miriam Daniel

executive
#2

Echo Show 10 is an entirely new interaction model for humans and computers. You no longer need to choose between moving to look at a screen and hearing Alexa's voice response. Echo Show then turns to face you when you interact with Alexa. You can imagine how useful this is for video calls with loved ones or while you're cooking and need to be moving around the kitchen to prep. Getting the motion right on Echo Show 10 required us to study how humans interact with each other. There are many subtle nuances to how you or I move when talking to someone, how we twist our body or turn our head. Bringing Echo Show to life required us to enable such subtle motion using the quiet brushless motor, sound source localization and computer vision. If your user speaks to the device, turning the display in the direction the user indicates the device heard the customer. Then if interaction continues, it makes sense to keep moving to be in step with that user. The other challenges lie in anticipating which moves would delight the customer and which moves could be jarring to the customer. The current Echo Show 10 was a result of a lot of invention, and the team saw many hard problems along the way to get that human motion interaction model just right. As a result, we created a technology that adapts to humans versus humans adapting to technology. But it's not just about the hardware as always. We thought about the ambient experiences we could enable with that sort of smart motion with that smart display. Let's talk about how this technology helps you stay connected while working its magic ambiently in the background. Imagine a scenario where you're doing something around the kitchen but also talking to your mom. Until now, you have to stay tethered to one spot while on that video call or holding out an arm to squeeze everyone into that field of view or trying to prop up your device in just the right way. You don't have to do that anymore. Not only does Echo Show 10 move with you as you're on a video call, but the camera also digitally pans, tilts, zooms to always keep you centered in that frame. Our customers tell us that they are absolutely in love with this feature. It makes them feel like they're in the same room as the other family member or friend. While voice makes it easy to turn lights on, off, achieving a truly ambient home, however, it means you might actually talk less to Alexa. And in more and more cases, you may not say anything at all. Nearly 1 in 5 smart home interactions are initiated by Alexa these days without customers having to say anything. That's thanks to more powerful hardware, more proactive features like Alexa Guard, smart routines and hunches which I had mentioned earlier. Alexa's understanding of context is incredibly important in delivering convenient features in that ambient home. For example, with Echo Show, you can set up a sunrise alarm that gradually brightens as you wake up in the morning or use occupancy-triggered routines to turn on your lights when you walk into a room. You don't even have to ask. It's great when Alexa's understanding of your environment helps simplify customers' lives, which is even better when it gives customers peace of mind. You can use Echo Show 10 to see who's at the front door with your ring doorbell and securely access a live feed of your Echo Show from another Echo Show device or even the Alexa app on your mobile. With the ability to remotely zoom or pan the display and the camera to see the entire room, you can see it all. It's an easy way to check and see if the dog is on the couch or if you left something for them. And with Alexa Guard, Alexa can notify customers when specific sounds, like a smoke alarm or broken glass, are detected while you're away, and it helps keep your home safe. Soon, Echo Show 10 will also periodically pan the room and send you a smart alert if it detects someone in its field of view. This is the kind of technology you may not notice day-to-day, but it's there when you need it the most, giving you peace of mind. Last year, we introduced natural turn-taking, the ability for customers to speak to Alexa without using a wake word during the course of the conversation, even when multiple people are talking. I could explain the science behind it, but I think it'd be easier to just show to you. [Presentation]

Miriam Daniel

executive
#3

To do this, we had to solve several challenges. For example, are people speaking to each other or should Alexa join that conversation? And if she does, who should she respond to? This requires us to go beyond voice-only understanding to multisensory artificial intelligence. Alexa uses a combination of acoustic, linguistic and visual cues with deep learning-based models running locally on that device. Once Alexa determines that a particular interaction turn is directed at her, she uses the context of the conversation to decide how to respond or what specific action to take. Natural turn-taking is a major step forward in conversational AI, enabling customers to interact with Alexa at their own pace. We look forward to bringing this experience to our customers later this year. So what's next? As we often say at Amazon, it's only day 1. We will continue to build an ambient experience and believe that better sensors, smarter AI, stronger connectivity, evolution of display technology and edge computing will come together to be at customers' homes and neighborhoods that are smarter in more meaningful ways. I hope you can tell how excited we are by our ambient vision. And we look forward to working with many of you on the emerging innovations in the display technology industry, which will make the future of human computer interactions even more compelling. Before I go, I thought I would leave you with one final experience that points to the future we are striving for, and maybe it gives you a few laughs along the way. Thank you for your time today, and enjoy the conference. [Presentation]

For developers and AI pipelines

Programmatic access to Amazon.com, Inc. earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.