In real-time generation, input text often arrives in chunks. These advanced settings let you fine-tune text buffering to balance audio quality and Time to First Byte (TTFB).
By default, our system waits until a sentence is complete—determined by punctuation—before sending it for voice generation.
min_buffer_sizeWhen a sentence isn’t complete and no punctuation is detected, the system uses min_buffer_size to decide when to send the text. This parameter sets the minimum number of characters required before sending input for audio generation. A larger buffer provides better context for the model, leading to higher audio quality. Reducing this value can lower TTFB by enabling quicker responses.
max_buffer_delay_in_msIf a sentence is incomplete and text hasn’t reached min_buffer_size as well, max_buffer_delay_in_ms sets the maximum time (in milliseconds) the system will wait before processing the input. Once this delay is reached, the available text is sent—even if it’s below the buffer threshold.
Let’s say the first chunk of text received is: “I just wanted to say that”
min_buffer_size of, say, 60 characters.max_buffer_delay_in_ms (e.g., 500 ms), the system sends whatever text it has.Result: After 500 ms, “I just wanted to say that” is sent for audio generation to avoid further delay. This mechanism ensures fast responses (good TTFB) without compromising too much on naturalness or context.