Sign In

Communications of the ACM

ACM News

Decoding Speech

View as: Print Mobile App Share:
An illuminated view of the human brain.

Brain activity can be captured by surgically implanting a device into speech-related areas of the brain, or by using non-invasive systems such as electroencephalographs.

Credit: A Health Blog

Instead of speaking to digital voice assistants such as Alexa and Siri, we soon could interact with our devices simply by thinking of what we want to say to them. To make this possible, researchers are trying to decode brain activity linked to speech by tapping into advancements in brain-computer interfaces (BCIs)—systems that capture brain signals, analyze them, and translate them into commands—and artificial intelligence (AI).

"[If we make progress in the next few years], I'm quite confident that we can drive these solutions towards real-world applications," says Maurice Rekrut, a researcher and head of the Cognitive Assistants BCI-Lab at the German Research Center for Artificial Intelligence (DFKI) in Saarbrücken, Germany.

Speech-decoding BCIs are also of particular interest to help people with certain conditions to communicate. Nerve cells that send messages to muscles involved in speech can become damaged due to diseases such as motor neuron disease (MND) and amyotrophic lateral sclerosis (ALS), for example, affecting a person's ability to speak. Patients often use gaze control systems combined with predictive text to type out what they want to communicate, but that can be a slow and frustrating process. "The important thing is to try and give people back not just naturalness of speech, but fluency and rapidity of speech," says Scott Wellington, a research assistant with the dSPEECH project at the University of Bath in the U.K. "That's what we can do with BCIs."

Brain activity can be captured by surgically implanting a device into speech-related areas of the brain, or by using non-invasive systems such as electroencephalographs, which pick up electrical signals in the same brain areas through sensors placed on the scalp (the tests they perform are known as electroencephalograms, or EEGs). However, there are many challenges to overcome before speech signals can be captured effectively.

For example, implants are currently more promising since they can be placed directly in areas of the brain that process speech, resulting in higher-resolution signals. However, existing devices cannot be removed without damaging the brain.

Many researchers are also experimenting with EEGs, but signals are heavily attenuated by the time they reach the scalp.

"We have to find some very clever ways of doing the signal processing to decompose that signal into a suite of statistical features of interest," says Wellington.

In recent work, Wellington and his colleagues investigated the speech-decoding capabilities of commercially-available EEG headsets using brainwave data. Their goal was to ascertain whether they could achieve similar decoding accuracies to research-quality EEG devices by incorporating more sophisticated machine learning and signal-processing techniques.

For their experiment, they focused on 16 English phonemes—distinct units of sound such as p, b, d and t. Twenty-one participants were asked to wear off-the-shelf EEG headsets while hearing the phonemes, imagining them, and speaking them out loud. The brain activity picked up from the EEG sensors was recorded in each instance.

Using the data, the researchers then trained a classic machine learning model, and a more complex deep learning CNN model, to decode different classes of phonemes. They were surprised to find that the traditional model performed better. "Time and time again, people doing research on decoding speech in the brain discover that the classical machine learning models still tend to perform reasonably well, even in comparison to the deep learning models," says Wellington. His team is not certain why it is the case, but they suspect it is because deep learning models typically require large amounts of data to be effective.

The classic machine learning model, however, was able to distinguish between certain phonemes reasonably well, a far cry from successfully deciphering intelligible speech. Wellington says its performance could be improved significantly by incorporating a large language model such as Open AI's GPT-3, which is now common practice in the field. These models consider the probability of potential words depending on the context. "Given the rules of the English language and the statistics behind the distribution of all of the English phonemes, [a large language model] can say with a very high [degree of] confidence that the word you're trying to say is probably 'house', for example," says Wellington.

Another issue is that speech-decoding systems often focus on signals from nerve cells involved in moving articulators that produce speech, which are inhibited in people with nerve damage from conditions such as MND and ALS. However, in healthy individuals, these signals lead to actual speech, and so would only be suitable for some people who have lost the ability. "Cutting-edge research for the decoding of attempted speech has also shown that for individuals with loss of natural speech, attempting to speak can in fact be an increasingly exhaustive task to perform for extended periods," says Wellington.

Instead, decoding imagined speech—the content of our internal monologue or reading voice—could lead to a system that anyone could use and that would require less effort. Decoding imagined speech can be a challenge, though, for several reasons. Patterns of brain activity can be highly variable, for example, since different individuals often think about speaking in different ways: some people might imagine themselves speaking a word, while others form a mental image of moving their muscles while talking, resulting in different types of brain signals.

Furthermore, the background activity which results from our mental state, such as whether we slept well the night before or not, can affect the signals captured. This means that speech-related brain activity is not consistent for a single individual, either. "You will experience problems in applying a [machine learning] classifier [algorithm] that you've trained on day one, when [a participant] was really hyped, on day two, when they were really tired," says Rekrut.

In a project starting in October, Rekrut and his colleagues are aiming to tackle this problem by conducting speech-decoding studies with participants in different conditions, such as in the morning and at night, over a period of several weeks to a year. Collecting a large amount of data, as well as information from participants about their mental state, should allow them to hone in on how various factors influence brain activity and the performance of BCIs.  "We will try to provide all this data to a classifier and see if we can find patterns," says Rekrut. "When participants are tired, maybe we can find a certain pattern that we can then filter out from EEG activity and provide this knowledge to the community."

Imagined speech is also harder to decode than attempted speech since neural signals are more subtle. In a new project called dSPEECH, Wellington and his colleagues are therefore aiming to decipher it with much higher accuracy by investigating two different modalities that could eventually be combined: electrocorticography, an invasive approach that involves implanting electrodes in the scalp to capture high-resolution signals on the surface, and sEEG –  a method that uses probes with sensors to tap into speech-related brain signals deep inside the brain.

They will also attempt to decode the 44 phonemes in the English language with a reasonable degree of accuracy by developing a signal processing and machine learning pipeline. Focusing on individual sounds that can be combined should dramatically increase the number of words that can be deciphered. Currently, the best systems can decode about 300 English words, which is insufficient even for basic communication.

Wellington's goal is to create a system in which people's ability to communicate is not restricted at all. It would also allow names to be deciphered, which is challenging for current systems since brain activity linked to each one would need to be recorded. For people who cannot speak, being able to address someone they have met by their name is an important element of communication, says Wellington.

"With phoneme-level decoding, you can suddenly say any word you want," he adds. "I'm sure that's the way forward."


Sandrine Ceurstemont is a freelance science writer based in London, U.K.


No entries found