Eliminating the Lack of Diversity in Text to Speech

March 2, 2022

The science that goes into making machines talk like humans is very complex because our speech patterns are so nuanced. So, it’s not surprising that it has taken well over 200 years for AI voices to get from the first speaking machine—which was able to simulate only a few recognizably human utterances—to a Samuel L. Jackson voice clone delivering the weather report on Alexa today. Talking assistants like Siri, Google Assistant, and Alexa, now sound more human than before. Thanks to advances in AI, we’ve reached a point where it’s sometimes difficult to distinguish synthetic voices from real ones.

Despite such innovative developments, there still exist disparities in these text to speech (TTS) systems. Specifically, in how certain accents and words in particular languages are delivered. For example, voice AI mispronounces words in many Asian, African, and Latin languages. In fact, researchers at Stanford University recently published a study in which they reported findings from researching five automatic speech recognition systems. The average word error rate for white subjects was 19 percent compared to 35 percent for black subjects speaking in African-American Vernacular English.

So, why does such bias still exist in TTS systems?

While there are numerous possible reasons, a major cause boils down to the bias that prevails in the algorithms powering these TTS services. Many TTS systems are trained on limited sets of data, making them less effective. For example, the most commonly used dataset for TTS is, LibriSpeech, a corpus of approximately 1000 hours of 16kHz read English speech. The data for LibriSpeech is derived from read audiobooks from the LibriVox project and has been carefully segmented and aligned. Although LibriSpeech was started in 2015 and LibriVox in 2005, respectively, the majority of the books used for the audio recordings were written in the 19th century and before. This, in essence, means there certainly exists bias in terms of the lack of characters of color in these stories. Furthermore, the speakers that contributed to the LibriVox project recordings were not necessarily African Americans, Latinos, or Asians.

Moreover, LibriVox contains enough English content for a speech processing corpus, LibriSpeech, to be built from it but the same cannot be said for other languages. To state facts, LibriVox comprises over 60 thousand hours of content for English but comparatively less for other languages such as Greek, Latin, Japanese, and more. Another more recent dataset used by these voice AI systems, LibriTTS, addresses this issue.

LibriTTS is designed explicitly for text to speech. The new speech corpus is derived from the original audio and text materials of the LibriSpeech corpus and inherits the desired properties of the LibriSpeech corpus. LibriTTS corpus consists of 585 hours of speech data at a 24kHz sampling rate from 2,456 speakers and the corresponding texts. Despite the profusion of speech data through these datasets—among others, it is nevertheless difficult to find datasets that allow training the models on prosodic and emotional aspects. This serves as a second major hindrance.

There must be a large amount of clean speech data from one speaker to train an end-to-end speech synthesizer from scratch which is not available for several low-resource languages. Besides, there is also an inadequacy of a reliable text normalization front-end and lexicon for these languages, which means phonemes are also not available for that particular text.

A third issue lies in the way the TTS systems are trained to recognize sound. At the phoneme level, these AI systems are tied to the phoneme set of the language a voice is built from. This is usually because modern speech synthesizers use recorded voice samples in a diphone database, allowing them to sound more natural. As a result, the phoneme set is restricted to the voice's language.

Correct prosody and pronunciation analysis from written text is also a major challenge today. Written text contains no explicit emotions and pronunciation of certain words is sometimes very anomalous. At a low-level synthesis, the discontinuities and contextual effects in wave concatenation methods are the most problematic.

Bridging the Digital Divide in Speech Synthesis AI

Eliminating bias means you have to be meticulous with the data these TTS algorithms use and take a more holistic approach to not just including more diverse data but also phonetic fundamentals. It is equally important to consider how the synthetic voices perform in different environments through focus groups. In other words, data must not only represent different dialects but genders as well, in order to reduce these race and gender biases and be more accurate.

Utilizing a technology that learns from multiple diverse data sets is a way of reducing the gaps and allowing all voices to be heard. It’s important to emphasize that the deep learning technology itself isn’t biased and the algorithm isn’t biased upfront but they can attain inherent biases if the data set provided does not accurately represent a population.

In order to limit as many of these biases as possible within AI, it’s crucial to make sure you are representing a wide range of people with various demographics. Bias-free AI models are important so that they work for every person and create meaningful results that help solve problems, not create them.

To put it simply, whether you’re building a TTS model from scratch, or fine-tuning an existing one, quality data should include a balanced representation of different voices across gender, age groups, race, and unique speakers. With the right mix of diverse data, your AI model will perform as envisaged: able to respond to the real world with all its variations.

Building Voice Tech that Reflects Diversity

The adoption and innovation of voice AI is gathering speed at an unprecedented pace. For AI technology to be truly useful to the world at large, however, it has to be globally representative. It’s time we recognize that we naturally 'build for our own' and make a conscious decision to test our voice innovations with a much broader group of people.

Aligning with the aspect of diversity in TTS, we at Murf have created a new range of male and female African American AI voices. Terrell is among the mature voices in Murf studio and one of the most impactful and inspirational voices in our newly launched collection of emotive AI voices. Whether you are looking for an impactful narration or creating awareness through a project, Terrell’s voice will fit right in. His voiceover style is inspirational, powerful, and works best for inspirational ads, powerful documentaries, and motivational podcasts. Want to leave an impression, use Terrell’s voice.

Marcus, on the other hand, is one of the most welcoming, firm, and authoritative voices. Marcus’ tone is casual, conversational, and fits well for e-learning, commercial, and explainer content.

In addition, we have included three new female African-American voices to our range of emotive TTS voices: Brianna, Michelle, and Naomi.

While Brianna's voice style adapts easily to scripts that are conversational or professional, Michelle is a more authoritative and mature voice that lends a formal tone to eLearning, corporate training, and employee onboarding content. Brianna's voice suits best for eLearning and explainer videos.

Naomi, on the other hand, is a more dynamic voice that has the power to inspire and convey empathy. Naomi's voice style works best for motivational advertisements, documentaries, and healthcare videos.

You can check out the new voices by logging into Murf Studio.

Explore New Voices

The diversity conversation has been slowly creeping into the world of AI voiceovers but the historically anonymous nature of the industry, together with a very strong bias toward a particular population of people, made it seem like a fruitless cause. There is still obviously so much that needs to be done, however, this is the first step!