How realistic is text to speech?

March 14, 2021

While text to speech is not a new concept, it has gained immense popularity in the last few years. The technology of speech synthesis has come a long way, and the possibilities are immense. With voice assistants making restaurant reservation calls for you to audiobooks that help kids read better, a lot of new applications have been explored with some of the text to speech technology now available. The most recent one on this list would be VO-based shopping in home automation products.

A large impact of this has also been in the way we all make videos. Whether it is a product demo, an explainer video, or even an e-learning video, some of the latest 100% human-sounding neural text to speech has empowered us to create professional-quality voice overs for business without the need to record or do any audio processing. They save hundreds of dollars and many days of waiting time in recording voice overs.

Natural-sounding speech, powered by AI

Murf Studio is a text to speech tool that allows you to make realistic voice overs, with 120+ AI voices across 20+ languages. And the best part is, you can adjust and achieve perfect timing of audio files with the video clips or images within the studio itself. It has around 50 English actors across multiple accents that users can try to convert to audio files.

Try AI voices in Murf

Benefits of TTS

Text to speech is used across a variety of applications including customer service, contact centers, e-learning, learning and development, podcasts, voice assistants, and more recently product promos and demos as well. Some of the ways users benefit from text to voice are as follows:

Save money in recording voice overs

Creating high quality voice overs typically involves paying for a voice over artist, studio, recording equipment, post processing and so on. These days, a short 2 min second voice over can cost over a hundred dollars. Text to speech enables content creators to make professional-sounding voice overs at less than one-tenth the cost. In fact, there are some basic text to speech tools that are free to use for users on a low budget.

Edit audio, just like you edit text

One of the biggest pain-points of recording voice overs is the post processing and editing. The biggest challenge comes in when a user wants to change the script, say a couple of words or sentences here and there. There is no easy way to achieve this with recorded voice audio files. However, with the use of text to speech technology, editing is as simple as a cut-copy-paste in word documents. It can be done in seconds. No waiting time or added cost of re-recording.

Personalized customer interactions

Text to speech allows brands to create unique personalized customer interactions for premium users with a custom voice for a consistent customer experience. Brands can choose from a range of available professional voices or even create their own custom voice. Cloud providers like Amazon Polly and Google text to speech offer brand the choice to create their own celebrity unique voice which can be used across customer touchpoints.

e.g. Check out Colonel Sanders’ AI generated voice clip for KFC in Alexa

Your browser does not support the audio tag.

Your browser doesn't support HTML5 audio

Last year, Samuel L Jaskson’s Alexa voice created a wave of excitement across Echo users. Check out this video:

Global reach unlocked with languages and accents

Text to speech allows brands to create video content in multiple languages and voice accents. This builds a much more universal appeal at a much lower cost and helps connect with international customers effectively.

More power to educators

Text to speech has been used extensively in education in the recent years to help out kids who face difficulty reading. Also, it helps make education far more inclusive by helping out the differently abled as well. It is worth noting that two factors hugely influence the effectiveness of synthetic voices in e-learning:

1. Quality of voices

Natural-sounding text to speech are known to work better with kids as they are easier to listen to for a longer period of time. Custom voice is also incredibly useful for L&D content creators.

2. Reading speed

Some voice synthesis tools provide an option to set a pace of reading, which need to be in the range of 140 – 170 words per minute to be most effective for students ((Cunningham, 2003, Cunningham, 2011)

What is speech synthesis ?

Speech synthesis, in simple words, is the process of training a computer to convert text into human sounding speech using a large database of human speech recordings. Some of the earlier simpler forms of speech synthesis technologies used a concatenation of words from actual recorded human voices to generate artificial speech. The more recent ones use neural networks to convert text into wave forms which are then used to generate synthetic speech resembling a human voice.

The quality of the voice is judged on the basis of how natural and intelligible it is. Since the whole idea behind speech synthesis is to replicate a human speaking, it is not surprising that the success of the technology is also measured on the basis of its resemblance with human voices, it terms of how it sounds and how effectively it can communicate with a human mind.

Why now is the best time to try text to speech ?

A recent breakthrough in this field has made it possible to generate speech in the voice of different speakers using deep neural networks for text to speech (TTS) synthesis, even in the voice of speakers who were not part of the initial training set. This is detailed out in this paper:

Transfer Learning from Speaker Verification to Multispeaker text to speech Synthesis

What’s really new for voice synthesis is that the model is able to transfer the knowledge of speaker variability and is able to synthesize natural speech from speakers that were not included in the training. This has opened up the possibilities of cloning authoritative voices, like celebrity voices.

Example of voice cloning technology

Check out this video with synthetic voice versions of US presidents talking about speech synthesis. This video was created by a YouTuber entirely using artificial intelligence technology and is meant for demonstration/ entertainment purposes only.

Text to speech voices

The quality of text to speech has improved significantly with the introduction of neural voices across platforms in the last few years. These have also overcome the classic limitations of synthetic voices by allowing for a host of styling and customization features like changing pitch, speed of reading, pauses, tonality and even emphasis on certain parts of the script. Murf Studio is an advanced tool to create text to speech mp3 with natural speech.

Check out these samples from Murf Studio to get an idea of VO quality:

Quality comparison of TTS

Not all text to speech tools are high quality, in fact some of them could even sound a bit robotic. Always listen to the sample before making your choice.

To get an idea, check out these two samples, one anonymous standard voice from another cloud app and a Murf AI actor Ava introducing herself. Listen to the samples and feel free to make the choice for yourself.

Ava's Introduction - Murf natural sounding AI voice

Try Ava's voice in Murf Studio

Ava's Introduction - Standard TTS voice

Text to speech app

When it comes to choosing a software for text to speech, there are a lot of options available in the market. Some free and some are paid based on the quality of voices they offer. However, most of the offerings are built for the use-case of a text to speech reader and only work well if you want them to read aloud documents. There are not enough options in most text to speech generators to support creating a high quality video or presentation.

Text to speech for videos or presentations

If you are looking for a text to speech software to create 100% natural-sounding voice overs for videos and presentations, try out Murf AI Studio. It has an in-built speech synthesizer with 120+ natural sounding voices and offers a range of customization options like pitch, speed of narration, pauses and even emphasis on certain words.

Not just that, one of the coolest things about Murf Studio is that it is really simple to achieve perfect timing of your generated audio with your visuals. You can create multiple blocks for each scene in the video and adjust timing simply by changing the size of the blocks.

Text to speech mp3 formats

Murf Studio provides you the choice to create text to speech to mp3 as well if the user needs a simple audio output only. Our voice user interface lets you create as many blocks as they like, add pause, adjust timing etc. and finally stitch them all together with the render option. This would convert text to speech to a .mp3 file that can be exported easily and used across a variety of applications.

Text to speech download

The download options in Murf Studio are of two types:

Text to speech with download as mp3
Video download as mp4 combining speech with video clips or images

When it comes to download options, here are some of the choices available:

Audio channels – Mono/Stereo
Render quality – Full HD
Video size – Choice from a range of video aspect ratios and sizes

Pro Tips to create ultra-realistic text to speech

The ultimate goal of text to speech is to be just like a human speaking in any kind of use-case. We have now started seeing synthetic clones that can cry, scream with joy or even sing a song for you. But given where most of the technology stands today, it takes some effort on the part of the user to achieve the perfect audio as per their requirements. Here are some tips that would help you create good quality text to speech:

Choose high quality natural sounding TTS
A lot of what the final audio sounds like depends on the quality of training gone behind a particular AI avatar. There are a lot of text to speech tools that you would find online, but the quality has huge differences, ranging to 100% natural speech in advanced voice synthesis tools like Murf Studio to the basic robotic style voice offered by some other free tools. So, if you are looking to create a good quality voice over, may sure you listen to the samples before you start creating the voice over.

Choose a voice that goes with your script
Once you start using TTS more frequently you would realize that just like human voices, different TTS would be suited to different use cases. Some are trained very well on explaining ideas in a professional way, some are good for storytelling and some are best suited to sell. Depending on your use-case, try out a few voices with your specific script before you decide which one to go for.
Split your speech text and add pauses
One of the common mistakes many beginners make is to enter the entire script together and just render the audio. A good practice is to break down your script into blocks, remember each block could even be a single sentence in some scripts. In Murf Studio, you can add as many blocks and paragraphs as you like and adjust timing of each block yourself to create pauses. Simply put, a couple of seconds of breathers are great for the audience.
Punctuation matters
When working with text to speech voices, it is important that your sentences are punctuated correctly. Otherwise, the same sentence can have a very different meaning.

e.g. “Let’s eat, grandma” versus “Let’s eat grandma”.
Punctuation can potentially save lives!

Keep a consistent tone
Most speech text works best when the tonality of speaking is fairly consistent. If you absolutely need to introduce multiple tonalities in a script for character portrayal, try using voice styling options to increase the range. Another cool trick is to use a combination of voices in the same voice over.

We at Murf are really excited about the future of speech synthesis technology and would keep sharing more tips and tricks on how to make the most of it for your videos and presentations. We hope you find this article helpful and keep watching out for more updates from our side.

Natural-sounding speech, powered by AI

Benefits of TTS

Save money in recording voice overs

Edit audio, just like you edit text

Personalized customer interactions

Global reach unlocked with languages and accents

More power to educators