How Automatic Speech Recognition could be used to protect language diversity
Automatic Speech Recognition (ASR) is a burgeoning technology that could become a valuable tool for documenting languages. Particularly endangered languages, where we often do not have the resources for human transcription. That is because very few people speak the language, and they often live in remote places. With ASR we could enhance our ability to learn about these languages and in doing so improve the quality of translation to, and from, an endangered language. This would result in better, more effective and more accurate communication. Which would in turn enable indigenous peoples to equip themselves with better representation, and more agency.
Language loss
Two years ago the UN launched the International Decade of Indigenous Languages, in response to some worrying predictions about the future of linguistic diversity. Currently, of the estimated 6,700 languages spoken globally, around 40% are endangered.
Languages represent complex systems of knowledge: when we lose a language, we also lose the culture and amassed knowledge that it belongs to. Furthermore, indigenous people are often already marginalised politically and socially within their own countries, due to both geographical remoteness and historical inequalities. Developing better communication, and therefore understanding, of these peoples could be a vital tool in sustainable development and forging stronger, more inclusive communities for all.
Linguistic diversity is vital to retain cultural diversity and with it, the rich tapestry of human existence. Indigenous languages protect and promote local cultures, traditions, customs and values; many of which are thousands of years old. Without them we would all be poorer — a rich plurality of perspectives leads to a deeper understanding of humanity.
Undoubtably globalisation and colonisation have had the greatest impact on language loss. However, in recent decades the speed of language loss is increasing. Our communication is increasingly digital — put simply, the internet provides content in just a handful of the world’s languages, effectively excluding a large proportion of our population from the platform.
What is Speech Recognition?
Specialised linguists who are trained in phonetic transcription use phones. A phone is any distinct speech sound (or gesture), which is transcribed using a set of symbols which represent speech sounds. Phones are relatively language independent, so can be used (theoretically) to transcribe any set of speech sounds, regardless of whether the transcriber understands the language they are hearing.
Over time, linguists have mapped these recognised phones into phonemes — a speech unit that that can distinguish one word from another in a particular language. Phonemes can be represented in a writing system (ie. an alphabet or script). And this, essentially, is how speech recognition works. Naturally, if the transcriber speaks or understands the language, they can be far more accurate in their transcription. As every translator will tell you — context is everything!
What about languages with no reliable audio data? Or no speech data at all?
What is Automatic Speech Recognition?
Like human linguists, ASR systems are trained using paired audio and transcription data in the source and target languages. However recent developments in multilingual speech recognition have enabled people to pre-train AIs using unlabelled data, and then fine-tune the resulting text using labelled speech. OpenAI’s Whisper, which is currently leading the field in accuracy for transcription in English, is pre-trained using paired data. But if we were to train AI’s in the use of phones, they could theoretically transcribe any given language.
That said, the leading AI models in this field are currently showing a Word Error Rate (WER) of around 70%.
The problems we must solve
In order to convert phonemes into possible word sequences ASR systems use language model (LMs) which are pre-trained with multilingual data and a guide to pronunciation. The size and data diversity of an LM is crucial for accuracy.
The amount of audio data that we have on endangered languages is scare or unavailable. Furthermore, text data can also be difficult to access in many endangered languages. In fact some languages are purely oral and do not have a writing system. In others, the amount of data available is not sufficient to accurately pre-train an AI.
The future of ASR and language diversity
In academia
Researchers at Carnegie Mellon University, whose focus is language preservation, want to expand the number of languages with automatic speech recognition tools available to them from around 200 to potentially 2,000. Their speech recognition model relies on information about how phones are shared between languages. Thus, reducing the need to build separate models for each language. Specifically, it pairs the model with a phylogenetic tree — a diagram that maps the relationships between languages — which helps with pronunciation rules. Through their model and the tree structure, the team can approximate the speech model for thousands of languages without audio data.
In big tech
Today, a typical voice AI system includes ASR modelling, which will convert raw audio into text. This is followed by natural-language understanding models (LMs), which determine the user’s intent and will recognise any named entities. This is central for dialogue management, which routes commands to the proper services (eg., “Hey Google, turn on Netflix”). Finally, there’s a text-to-speech model, which issues the output (“Turning on Netflix”).
Recent research has been looking into the possibilities of speech-to-meaning models. These could be trained to recognise semantic representations for speech, and map these to translated text, or images. This could be a way of circumnavigating the need for a standardised writing system, enabling us to better understand oral languages.
Multilingual applications are predicted to be highly profitable for tech industries. Amazon have a stated goal of providing their virtual assistant in 1,000 languages. Google, likewise, has the Google’s 1,000 Languages Initiative. Currently Google’s Universal Speech Model (USM) is trained using 400 languages. Google claim this makes it “the largest language coverage seen in a speech model to date.”
Meta’s No Language Left Behind project has already developed a speech-to-speech translation system for the primarily oral language of Hokkien using a translated text intermediary.
These developments will be hugely important for our academic understanding of languages and language development — linguists are excited! But more importantly, it will also have important humanitarian uses, facilitating better communication with marginalised communities.
In our communities
It must be said, that these tech giants are aiming to extend language coverage at the expense of human translators and linguists. They don’t wish to pay for skilled human transcribers and researchers, preferring instead, to “data mine”.
This comes with the risk of further marginalising minority cultures with “AI colonialism.” In the quest to feed their AI models more data, tech companies have appropriated the faces, messages, voices and behaviours of people worldwide. These companies are enriching themselves with the data of marginalised communities, in much the same way as former colonial powers enriched themselves with appropriated resources. And, as with colonialism, the communities affected have little or no say in the development of AI, or how it will impact them.
There have been a number of initiatives to protect the flow of data. Notably, a Māori initiative has developed their own language AI tools. And created mechanisms to collect, manage, and protect the flow of Māori data so it won’t be used without the community’s consent, or worse, in ways that harm its people. Te Hiku Media, a Māori radio station that worked with its community to develop this ASR, has stressed the importance of data sovereignty for indigenous languages in particular.
Now, we will have to see how the tech giants respond.