The Fastest, Most Efficient Text-to-Speech API in Production

Text-To-Speech (TTS) systems operate within a complex set of trade-offs. They must produce speech that is expressive and humanlike, maintain very low latency, and do so within stringent cost and scalability limits. Improvements in one area, such as expressiveness or naturalness, often lead to inefficiencies in latency or concurrency. Balancing these factors requires more than incremental optimization. It calls for a fundamental rethinking of model architecture, data preprocessing, and inference strategy.

Murf Falcon introduces a new approach to streaming TTS.

Compute Efficient proprietary neural architecture outperforms much larger systems in context awareness , while delivering the speed benefits of a smaller model.

Edge-level deployment minimizes network hop variability, ensuring consistent low latency across regions.

Dynamic selection of cost-efficient GPU infrastructure available in each location, optimizes for both performance and cost.

This combination of architectural efficiency and intelligent deployment delivers significant latency improvements while reducing the overall token-processing cost of the model, making it exceptionally efficient across every parameter.

To demonstrate this, we benchmarked Murf Falcon against leading TTS APIs on both latency and voice quality. We plotted the benchmark data on efficiency quadrants to determine which model achieved the best balance of speed, quality, and cost in production.

Latency

Study Overview

This study measured time-to-first-audio (TTFA) for Murf Falcon and leading TTS APIs, including ElevenLabs, Google, Open AI, Deepgram and Cartesia across multiple global regions to evaluate latency consistency and network variability.

TTFA is defined as the delay between the initiation of a synthesis request and the receipt of the first audio frame. We believe end to end latency is a better measure of responsiveness because it is what customers experience and is independently verifiable.

Methodology

Using apiping.io, a geo-distributed API relay, identical streaming TTS requests were triggered for each API; Murf Falcon, ElevenLabs Flash v2.5, OpenAI 4o-mini-TTS, Cartesia Sonic Turbo and Deepgram Aura 2 from 33 global edge locations.
The system recorded DNS resolution, connection setup, TLS handshake, TTFA, and total response time. Both per-region and average latencies were measured to assess performance consistency.

Murf Falcon	ElevenLabs Flash v2.5	OpenAI 4o-mini TTS	Cartesia Sonic Turbo	Deepgram Aura 2
165	82	158	147	78

Results & Analysis

Average Latency

Average latency denotes the mean time-to-first-audio measured across 33 global locations. By sampling multiple instances, this metric provides a robust estimate of the end-to-end latency typically experienced in real-world production environments.

When measured globally, Murf Falcon delivers an average TTFA of 130ms, setting the benchmark for real-world performance. Falcon is 44% faster than the next fastest model, Cartesia Sonic Turbo, and is significantly ahead of ElevenLabs and Deepgram. In production, Falcon delivers the fastest text-to-speech performance.

Latencies across geographies

Voice agents serve users across the globe, so TTS latency performance must hold consistently across geographies, not just at a single test location.
Across key regions, Murf Falcon consistently delivers around 130ms time-to-first-audio. In head-to-head benchmarks, Falcon leads in 9 of 10 key business centers. It outperforms ElevenLabs and Deepgram in all 10 regions, and Cartesia in 9 of 10.

Region	Murf Falcon	ElevenLabs Flash V2.5	OpenAI 4o-Mini TTS	Cartesia Sonic Turbo	Deepgram Aura 2	Murf Falcon's Ranking
N. Virginia	135	189	431	142	192	1
Ohio	126	175	477	121	203	2
Calgary	131	251	428	466	177	1
London	112	274	325	278	265	1
Frankfurt	116	265	345	239	250	1
Osaka	121	291	494	780	258	1
Mumbai	108	437	545	170	393	1
Singapore	164	355	984	477	377	1
Seoul	114	325	478	267	272	1
Melbourne	128	369	526	310	297	1

Latency Variance

It is not enough for latency to be low, it must also be consistently delivered. Latency consistency is measured using the Coefficient of Variance (CoV), which quantifies how much latency fluctuates across geographies. A lower CoV indicates greater consistency and stability in performance.
Murf Falcon records the lowest CoV of 0.17 among all models, demonstrating that it is not only the fastest but also the most predictable and stable text-to-speech system in production.

Voice Quality

We benchmarked Murf Falcon’s voice quality across multiple languages, comparing it against leading text-to-speech models using the Voice Quality Metric (VQM).

VQM is a purpose-built evaluation metric for voice agents. It provides a unified score that measures how natural, accurate, and reliable a voice sounds in real-world production. Instead of asking “Does it sound nice?”, VQM asks, “Does it say the right thing, in the right way, across complex, real-world inputs?”

Voice agents often handle critical information such as prices, health data, IDs, or timelines, where even a single mispronounced unit or digit can cause confusion or, in some cases, serious errors.

The Voice Quality Metric combines five distinct sub-metrics: Naturalness, Numerical Accuracy, Domain Accuracy, Multilingual Accuracy, and Contextual Accuracy. Each sub-metric is normalized on a 0–1 scale, then weighted and aggregated to generate a final composite score.

We conducted comprehensive evaluations across all sub-metrics to arrive at the aggregate VQM score.

1. Naturalness

Study Overview

This study measured speech naturalness using UTMOS (Utokyo MOS), a neural MOS predictor developed by UTokyo SaruLab and trained on human-rated speech datasets. UTMOS estimates perceived naturalness on a 1–5 scale and has shown high correlation with human judgments in large-scale evaluations such as VoiceMOS 2022, making it suitable for automated benchmarking.

Methodology

Each TTS model’s outputs were evaluated with UTMOS after preprocessing audio to the model’s native mono sample rate. Predictions were computed for every sample, and the median score per model was used to minimize outlier effects. Scores were normalized to a 0–1 scale by dividing by 5.

Dataset Construction

The test set comprised 3000 sentences ranging from 3–12 seconds, representing typical voice-agent interactions such as prompts, confirmations, and short explanations. Language distribution: 200 US English, 100 UK English, 100 Spanish, 100 Hindi, 100 French.

Illustrative Audio Samples

Test Sentence

Thanks, John. For privacy, I can only discuss details with the patient. Do you happen to know Alex and can pass along a message?

ElevenLabs Flash v2.5

Deepgram Aura 2

Cartesia Sonic Turbo

GPT-4o mini TTS

Murf Falcon

Results and Analysis

Murf Falcon achieved a median naturalness score of 0.7, outperforming Open AI, Cartesia and Deepgram and trailing slightly behind ElevenLabs at 0.73.

2. Numerical Accuracy