Text to Speech

Learn how to convert text into natural-sounding speech using Murf AI’s Text to Speech API.

Murf provides a powerful Text to Speech API that allows you to generate high-quality, natural-sounding speech from text input. The API supports over 13 languages and 20 speaking styles across 130 voices to suit your application’s needs.

Quickstart

The Synthesize Speech endpoint lets you generate speech from text input. You can either use the REST API directly or use one of our official SDKs to interact with the API.

1from murf import Murf
2
3client = Murf(
4 api_key="YOUR_API_KEY" # Not required if you have set the MURF_API_KEY environment variable
5)
6
7res = client.text_to_speech.generate(
8 text="There is much to be said",
9 voice_id="en-US-terrell",
10)
11
12print(res.audio_file)

A link to the audio file will be returned in the response. You can use this link to download the audio file and use it wherever you need it. The audio file will be available for download for 24 hours after generation.

Supported Output Formats

The API supports multiple output formats for the generated audio - the default output format is wav. You can choose from the following formats:

FormatDescription
WAVUncompressed audio format, useful for low-latency applications as it eliminates the need for decoding.
MP3Compressed audio format, widely supported and suitable for applications where file size is a concern.
FLACLossless compressed audio format, ideal for applications requiring high audio fidelity without the large file size of uncompressed formats.
ALAWCompressed audio format commonly used in telephony, providing a good balance between audio quality and bandwidth usage.
ULAWAnother compressed audio format used in telephony, similar to ALAW but with slightly different compression characteristics.

You can specify the output format using the format parameter in the request payload in the Synthesize Speech endpoint.

Furthermore, you can use the channelType and sampleRate keys to specify the channel type and sample rate for the generated audio. The API supports stereo and mono channels, and sample rates of 8000, 24000, 44100, and 48000 Hz.

1from murf import Murf
2client = Murf()
3res = client.text_to_speech.generate(
4 text="Hello world!",
5 voice_id="en-US-julia",
6 format="MP3",
7 channel_type="STEREO",
8 sample_rate=44100
9)

ULAW and ALAW formats only support mono channel type and a sample rate of 8000 Hz. If you specify a different channel type or sample rate, the API will default to the supported values.

Base64 Encoding

You can choose to receive the audio file in Base64 encoded format by setting the base64 parameter to true in the request payload. This can be useful when you need to embed the audio file directly into your application or store it in a database.

1from murf import Murf
2client = Murf()
3res = client.text_to_speech.generate(
4 text="Hello world!",
5 voice_id="en-US-julia",
6 encode_as_base_64=True
7)

The response will include the audio file encoded in Base64 format, which you can decode and use as needed.

Response
1{
2 ...,
3 "encodedAudio": "U29tZSB0ZXh0IHNob3cgd2l0aCB0aGF0Lg==...",
4 ...
5}

gzip Support

Responses from Murf API can be gzipped by including “gzip” in the accept-encoding header of your requests. This is especially beneficial if you choose to return the audio response as a Base64 encoded string.

1from murf import Murf
2
3client = Murf()
4
5client.text_to_speech.generate(
6 text="Hello, World!",
7 voice_id="en-US-natalie",
8 encode_as_base_64=True,
9 request_options={
10 'additional_headers': {
11 'accept-encoding': 'gzip'
12 }
13 }
14)

FAQ

Audio formats define how sound data is stored and compressed. Choose MP3 for web streaming due to its small size, WAV for high-quality recordings, FLAC for lossless compression with reduced size, and ALAW/ULAW for telephony systems. Base64 encodes audio as text, making it useful for embedding in APIs or data transfers.

Audio channels define the number of sound signals in a recording.

  • Mono (1 channel): Best for voice calls, podcasts, and telephony—ensuring clarity.
  • Stereo (2 channels): Preferred for music, films, and immersive experiences where directional sound matters.

The sample rate (measured in Hz) determines audio detail:

  • 8000 Hz: Telephony & VoIP (mandatory for ALAW/ULAW).
  • 24000 Hz: Balanced for podcasts and e-learning.
  • 44100 Hz: CD-quality audio.
  • 48000 Hz: Industry standard for film and professional audio. Higher sample rates improve quality but increase file size—choose based on your needs.

Base64 encodes audio as text, making it useful for embedding in APIs, JSON, XML, or data transfers where binary formats aren’t supported. Base64 is useful for transmitting audio files in web-based applications. Since Base64 increases file size compared to its original format, it’s best used for compatibility rather than storage efficiency.

Built with