Voice command systems are transforming user experience for a wide range of tasks by working on the strengths of each interaction style. One night, I found myself attempting to login to my Netflix account on my TV. I suffer from mild carpal-tunnel syndrome, so instead of pressing 3 buttons on my remote and having to scroll with my thumb, I elected to hold one button and say, “Netflix.” Instantly, I was logged into the home screen. The task was completed quickly and without the hassle of using my hands excessively. Devices like smartphones, tablets, iPads, and televisions are being enhanced with the addition of voice control systems. These devices both rely on a combination of voice and screen interaction. The question is: Which type of interaction do people generally prefer? Also, will people eventually become comfortable with a voice-first-dominant or exclusive type of interaction?
In the following article, I will explain the similarities, differences, and nuances of screen-first vs. voice-first interaction. This post will also cover the ongoing debate: which type of interaction do people prefer? And which type of interaction will dominate in the future? We all know that voice and screen-based interaction are converging—but in the future, will we interact with a device “voice-first” or “screen-first?” Let’s find out!
We’ve divided this guide into the following sections:
- What is screen-first interaction?
- What is voice-first interaction?
- Voice vs. screen: The pros and cons of each
- Will voice interactions replace screens entirely?
1. What is screen-first interaction?
Ever since the first iPhone was released, the screen has been the primary mode for which we have been designing, and the main way we interact with devices. Tapping, scrolling, clicking, dragging, flipping, sliding, hovering, and zooming are examples of interaction techniques a user performs on a screen. Each of these interaction techniques encourages an action. Users have grown accustomed to these techniques. For example, sliding through a virtual photo gallery simulates the action of flipping through a photo album in real life and makes the user feel that the interface is intuitive, natural, and easy to use. If I wanted to “like” a photo on Instagram, I would tap a heart icon. If I wanted to save a photo, I would “press down” on a photo and save it onto my smartphone. There has always been a sense of security and comfort when interacting with devices on-screen.
Using an interaction technique that requires fewer actions to access information or complete a task would be considered easier than one that demands more actions. With online shopping, inputting payment information can be expedited by saving previous payment information. Since users are getting more comfortable with saving their private information on their phone, tapping, scrolling, and typing are becoming less prominent. Users can accomplish tasks—for example, purchasing a book on Amazon—quicker and with less on-screen interaction than before. As the demand of accomplishing tasks and ordering services increases, the emphasis with on-screen interaction and interaction techniques is changing and adapting to a new phase.
2. What is voice-first interaction?
Currently, voice-first interaction is mostly limited to the realm of personal and home use. As people become accustomed to it, businesses and companies will utilize it as well. For example, if you’re trying to set up a conference room projector or phone system menu: what if you could just say ‘Show my screen’ or ‘Start the meeting’ and then you are immediately in a meeting with co-workers from across the world. Sounds fascinating, right?
So what makes voice-first interaction so appealing? Hands-free control lets users multitask, for example while driving, cooking, and doing laundry. It’s becoming increasingly common to use speech to access complex navigation menus, and to initiate and execute familiar tasks with known commands. Usually, a voice program like Siri, Google Assistant, or the Amazon Echo can initiate the first step of a task, and any later steps require the user to talk more. For example, for clarification, Siri might say, “Did you mean, Filipino restaurants in San Francisco?” However, this also requires eventual on-screen input to move beyond the first step of many search results. So can we rely on soley voice interaction? My answer is: not yet!
3. Voice vs. screen: The pros and cons of each
Until very recently, most devices that combined screen and voice control were predominantly screen-first: smartphones with voice-control system added voice programs, or agents, like Siri or Google Assistant.
These screen-first devices, like the iPhone and Android, offer speech recognition and language processing, but overall user experience remained limited because of the gap between the voice program and the functionality of the device.
The inevitable introduction of smart speakers like Amazon’s Echo and Google Home ushered in a new era. These devices offer no visual display at all, and everyday interaction relies on audio for both input and output. Smart speakers allow true hands-free operation, enabling the user to complete a wide array of tasks—from checking the weather to setting an alarm, and even purchasing an item on Amazon—all completely hands-free. However, the absence of a screen is a huge limitation for these speakers. For example, setting a timer with a voice command while cooking is easy and convenient, but users still don’t fully trust that the device set the timer. They might get up and check the stove to make sure. Figuring out the weather forecast is cool while listening to Echo summarizing the weekly forecast, but it’s actually easier and faster to check it at-a-glance from, say, an iPhone screen.
4. Will voice interactions replace screens entirely?
Despite all the hype about voice technology taking over, the future of how we design and interact with our favorite devices is likely to be more multimodal than purely voice-first.
Ultimately, going from an on-screen interaction to a solely voice-based interaction unnecessarily limits the usefulness of the device; it actually increases user frustrations and creates more problems. I do not not expect voice interaction to completely replace written communication, despite common belief and trends. What I do know is that human to machine communication is rapidly expanding to include both types of interaction. For example, if you wanted to initiate a ride with Uber, you might say, “Siri, call me an uber.” The next step would be to see nearby uber vehicles on a map via your smartphone screen to decide whether it’s worth the wait. A visual display is still a more efficient (and necessary) way of letting people access a large amount of information than audio-only output.
Obviously, combining these into a single system sounds like a good idea. But there are many design challenges of successfully integrating two very different interaction modes into a single system. We are yet to realize the optimal benefits of both voice and screen interaction. Who knows which mode will dominate the way we interact with devices. All I know is that the demand for completing tasks and obtaining services will continue to increase—which will lead to the demand for improving voice agents and programs to accomplish tasks more efficiently, quickly, and intuitively.
If you’re curious about designing for voice user interfaces, try out this free voice design short course. Keen to learn more about voice technology? Check out the following articles: