Speech Customization
Murf’s AI models not only generate natural-sounding speech quickly but also give you powerful customization controls to shape the output with precision and personality. Through intuitive controls, you can fine-tune every detail to bring your creative vision to life.
Voices
Murf offers a diverse collection of 150+ AI voices across different accents, genders, and speaking styles—designed to suit a wide range of use cases from narration and marketing to training and conversation. The voiceId
key is a required parameter in the Synthesis Speech operation’s request body and must be provided to specify which voice should be used to generate the audio output. Each voice comes with its own unique tonal profile and supports different features such as styles and multi-native locales.
Styles
Murf Styles enable developers to fine-tune voice output for different contexts. Each voice supports multiple predefined styles that modify tone, emotional inflection, and delivery patterns. By passing the style parameter, you can programmatically transform a neutral voice to match specific contexts such as promotional, newscast, conversational, or inspirational to meet your application’s delivery requirements.
Here are some examples of different styles available in the Murf API:
Use the style
key to select which style to use for your audio generation.
You can explore all supported styles and hear audio samples in our Voice Library.
Pronunciations
While our models are capable at handling complex pronunciations of heteronyms, acronyms, numbers, and proper nouns, you might sometimes need a specific pronunciation for certain words. Our custom pronunciation feature lets you adjust how words are spoken to perfectly match your context or accent preferences.
Here are a few examples of words and how they sound before and after adding custom pronunciations:
The pronunciationDictionary
key in Synthesize Speech operation’s request body is used to specify custom pronunciations.
You can specify custom pronunciations as an IPA or an alternate word. IPA is an internationally recognized set of phonetic symbols based on the principle of strict one-to-one correspondence between sounds and symbols.
Pronunciations are specified in a key-value pair format, where the key is the word that needs to be changed, and the value is an object that specifies the pronunciation type and value.
MultiNative
MultiNative voices enable text-to-speech synthesis that sounds authentically native across multiple languages. This allows you to use the same voice which can speak multiple languages while preserving natural pronunciation patterns specific to each language, effectively eliminating the “foreign accent” effect common in conventional multilingual TTS systems.
Use the multiNativeLocale
key to select which locale to use for your audio generation.
Make sure the locale that you send in multiNativeLocale
is supported by your
chosen voice. You can see the list of supported locales for each voice in the
Voice Library.
Pauses
Our models are capable of adding natural pauses based on the text and context. In some cases, you may want to adjust the pause duration between two words to achieve the desired effect in your speech.
In the Synthesize Speech operation, the text key of the request body holds the text to be synthesized. This text key can be tweaked to add a pause between words in your script. This is done using Murf’s pause syntax: [pause <duration>]
.
Specify how long you want the pause to be in seconds by replacing the <duration>
part of the syntax, and you’ll get silence for that duration in the generated voiceover. The pause duration can be between 0.1s to 5s.
Audio Duration
The audioDuration
key in Synthesize Speech operation’s request body lets you specify the desired length of the generated audio (in seconds), and the system adjusts the speech to fit this duration.
Here is an example of how audio duration helps in generating voiceovers of specific lengths:
This can be useful for matching voiceovers with specific audio lengths or other time constraints. The system will try to match the duration of the generated audio to audioDuration
as closely as possible.
If there’s a significant difference between the requested and actual duration, consider changing the text length or audioDuration
value for better alignment.
- Valid values: A double value representing the time in seconds.
- Guideline: As a rule of thumb. ~150 words/1000 characters of text generates around 60 seconds of audio.
- Availability: Only available for the Gen2 model.
Speed
The rate
key in the Synthesize Speech operation’s request body controls the speed at which the voice speaks. Adjusting this parameter lets you make the voice output faster or slower.
Higher values mean higher speed, and lower values slow down the speech.
- Valid values: Any integer between -50 and 50
- Default value: 0
Pitch
The pitch
key controls the tone or frequency of the generated voice. Increasing the pitch makes the voice sound higher (more treble), while decreasing it results in a deeper (more bass) voice.
- Valid values: Any integer between -50 and 50
- Default value: 0
Variations
Variations allows you to generate voiceover using three primary parameters: pause, pitch, and speed. A higher variation value results in a more dynamic voice output, incorporating changes in speech delivery, pitch shifts, and pauses to make the audio sound more natural and less robotic.
Variation 1
Variation 5
Increasing the value will add more variation in voice style, with noticeable shifts in pause, pitch, and speed
- Valid values: An integer between 0 and 5
- Default value: 1
- Availability: Only available for the Gen2 model