Voice technology — Public Digital

By Catherine Breslin

Talking is a natural way for people to interact with each other. Small children can speak long before they can read, write or type. People can carry on a conversation even when they’re busy doing other tasks.

The introduction of virtual assistants like Amazon’s Alexa, Apple’s Siri and Google’s assistant have brought voice interfaces to the mainstream as a way for people to communicate with devices. These are popular with customers who love the ease with which they can be used. Voice holds the promise of providing a frictionless way to interact with computers, and organisations are looking for ways to take advantage.

Behind these virtual voice-activated assistants lies a pipeline of different machine learning technologies – speech recognition, language understanding, human-computer dialogue and text-to-speech. Machine learning is the broad term for a set of algorithms which learn their behaviour from data.

Machine learning algorithms are used in many different fields for tasks like spam email classification, face recognition, self-driving cars and fraud detection. Voice and language systems using machine learning algorithms are built from audio and text data, often several thousand hours of audio and many millions, or even billions, of words of text.

Still, as with all modern AI technology, voice systems remain specific to the task for which they were built. A speech recognition system trained to transcribe financial queries will fall over in a healthcare setting where the vocabulary and the language is very different. This particular example might seem obvious, but there are other more subtle ways this manifests.

A speech recognition system built entirely on audio from people living in Texas wouldn’t work as well in London as both the pronunciation and use of words differs between the two places. It would be hard to track down the cause in this case without in-depth knowledge of the data that had been used. Building the right system for your organisation requires getting multiple threads in sync with each other to lay the foundations for success – including the data, the design, and the evaluation.

Machine learning systems are heavily dependent on the data with which they are built. Getting your organisation’s data in an easily accessible location and format is a necessary first step to begin building with it. The ability to quickly use realistic data to build and test your system, and iterate to improve it, is key to achieving good performance.

This isn’t the case just in the design stage, but also once your system has been launched and in use. Adapting to real usage is an important part of tuning a machine learning model. In doing so, and making use of users’ data, it’s critical to pay attention to how your organisation handles user data. Relevant laws give minimum requirements for best practices in data management. Europe’s General Data Protection Regulation (GDPR) sets out seven key principles to inform the processing of personal data, and the United States’ Health Insurance Portability and Accountability Act (HIPPA) protects individually identifiable health data.

A second key need is to factor uncertainty into the design from the outset. Machine learning systems make mistakes. Once a system is live, it can be hard to spot where those mistakes are being made. Understanding how your system might go wrong and allowing users to recover gracefully is an important part of the design. In systems which make decisions impacting people, it may be helpful to give information to those using the system to aid their understanding of when to trust the outcome of an automatic decision, and when to probe further.

Thirdly, as with any piece of technology, it’s important to evaluate a voice system with a meaningful business metric. Where there’s a pipeline of technology, it’s typical to find separate teams building each part of the system.

The accuracy of those individual parts can be measured by the individual teams, but it’s also important to find a way to measure the end-to-end performance of the entire system. In a virtual assistant, the speech recognition system could make a mistake and mishear ‘What’s the bank balance?’ as ‘Please the bank balance?’. In this case it’s likely that the rest of the system can recover from the mistake.

A different mishearing as ‘What’s the ambulance?’ would be far harder to recover from. Here, measuring the performance of the speech recognition component in isolation doesn’t give a good measure of the overall user experience. In addition to end-to-end performance, it’s also illuminating to measure performance on subsets of your users to monitor for bias which creeps in due to biases in the real world data used to build the model.

Voice technology is improving rapidly, and being used in applications like virtual assistants, video and audio search, call centres, automatic subtitling, and more. By being careful about the data, design and evaluation, your organisation will have a good foundation from which to build its own voice and language applications.