For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
PricingJoin CommunityGet API Key
DocumentationAPI ReferenceChangelog
DocumentationAPI ReferenceChangelog
  • Introduction
    • Overview
    • Quickstart
  • Text to Speech Models
    • Falcon
    • Gen2
  • Text to Speech Capabilities
    • Overview
    • Streaming
    • WebSockets
      • Context ID
      • Advanced Settings
    • Speech Customization
    • Data Residency
    • On-Premise
    • Latency Optimization
  • Voices & Styles
    • Overview
    • Voice Library
    • Voice Cloning
  • Other Capabilities
    • Voice Changer
    • Translation
    • Dubbing
  • Integrations
    • MCP Server
    • Zapier
    • Make
    • n8n
    • Pipecat
    • LiveKit
  • Resources
    • Rate Limits
    • Enterprise
    • Best Practices
    • FAQ
    • Cookbook
    • Status
  • Migrations
    • Play.ai
LogoLogo
PricingJoin CommunityGet API Key
On this page
  • Default Behavior
  • min_buffer_size
  • max_buffer_delay_in_ms
  • Example
Text to Speech CapabilitiesWebSockets

Advanced Settings

Was this page helpful?
Previous

Speech Customization

Discover how to adjust voice settings to create a distinctive and expressive voice for your application.
Next
Built with

In real-time generation, input text often arrives in chunks. These advanced settings let you fine-tune text buffering to balance audio quality and Time to First Byte (TTFB).

Default Behavior

By default, our system waits until a sentence is complete—determined by punctuation—before sending it for voice generation.

min_buffer_size

When a sentence isn’t complete and no punctuation is detected, the system uses min_buffer_size to decide when to send the text. This parameter sets the minimum number of characters required before sending input for audio generation. A larger buffer provides better context for the model, leading to higher audio quality. Reducing this value can lower TTFB by enabling quicker responses.

  • Range: 40 to 160 characters

max_buffer_delay_in_ms

If a sentence is incomplete and text hasn’t reached min_buffer_size as well, max_buffer_delay_in_ms sets the maximum time (in milliseconds) the system will wait before processing the input. Once this delay is reached, the available text is sent—even if it’s below the buffer threshold.

  • Range: 0 to 1000 milliseconds

Example

Let’s say the first chunk of text received is: “I just wanted to say that”

  • This sentence is incomplete and has no punctuation, so the system doesn’t immediately send it for audio generation.
  • The current length is 24 characters, which is below a min_buffer_size of, say, 60 characters.
  • The system waits for more input. If no additional text arrives within the time set by max_buffer_delay_in_ms (e.g., 500 ms), the system sends whatever text it has.

Result: After 500 ms, “I just wanted to say that” is sent for audio generation to avoid further delay. This mechanism ensures fast responses (good TTFB) without compromising too much on naturalness or context.