Turbocharging Voice Agents: Speed, Quality, and Powerful New Features

Since Voiceflow voice agents were released in late 2024, our engineering team has been hard at work improving them! In a world where every millisecond counts, we’ve cut down base voice-to-voice latency (a real world measure of how it feels to talk to an agent) of your projects by an average of ~700ms on Twilio, and ~1200ms on our web voice widget; all while also improving conversational quality,  plus new features to help you build even more powerful agents!

Voice agents aren’t just a cool demo anymore: they’re becoming core to how modern businesses interact with users. Faster response times mean less user frustration and higher task completion. Better audio quality means higher transcription accuracy, fewer dropped utterances, and smoother experiences. These new upgrades make Voiceflow Voice agents production-ready for everything from outbound calls to in-app support, helping your team build quickly and reliably at scale, without compromising on the customer experience.

We’ve made significant improvements to our Voice offering, including:

  • Real-world voice latency reduced by up to 1200ms
  • Krisp noise cancellation for cleaner audio
  • Full support for custom ElevenLabs voices and synthesis settings
  • Answering machine detection, call recording, and call event webhooks
  • Plus tons of performance improvements to make your agents faster and smarter

In addition to showing off our newest features, we’re going to get technical and take a peek behind the curtain to learn a bit more about how we achieved some of these wins. Let’s get into it!

Turbocharging the Voice stack to reduce latency

Voice agents can be complicated to get working as there are a ton of moving parts. Let’s trace through how a voice agent works and see the improvements made at each stage:

  • Browser-based voice recording widget: we’ve switched recording encoding formats to remove encoding delays and also send new audio to Voiceflow quicker and more often, saving up to 500ms.
  • Automatic Speech Recognition (ASR) and transcription: we’ve implemented a new algorithm to reduce how long we need to wait before we’re confident in a transcription for a given audio chunk. This is done by looking at preliminary transcripts to evaluate how long to wait before deciding the user is done speaking, rather than waiting for a final transcript, then pausing again due to 'on punctuation time.' This change alone saves up to 700ms.
  • Voiceflow logic and LLMs: through the addition of the agent step, LLMs are able to stream their output and use tools, which lets us start generating and playing the audio for the start of the speech while the LLM is still thinking, rather than waiting until generation is complete. In advanced flows, not waiting for function calls to finish can save seconds!
  • Text to Speech (TTS): we’ve added support for ElevenLab’s latest Flash v2.5 model, saving up to 250ms compared to Turbo, and we’ve accelerated sentence aggregation taking advantage of previous text to increase quality.
  • Browser-based audio: we’ve changed the format of audio being played in the client so it can be played natively, saving up to 200ms.

Altogether, this has reduced the real-world speech-to-speech, the time between a user finishing talking and hearing the LLM answer, time by up to ~1200ms. This brings us down to a total of ~1400ms in practice for a realistic project using an Agent Step with GPT-4o mini and ElevenLabs Flash v2.5 for TTS in Chrome.

With these improvements, Voiceflow's agents now respond within ~1.4s — faster than many leading production ready voice platforms.

Adding Krisp for noise cancellation

Latency is one piece of the puzzle — but quality matters too. That’s why we’ve added Krisp.

Background noise, especially speech or music, can seriously throw off voice agents. ASR systems transcribe everything they hear, so voices in a coffee shop or lyrics from background music can easily get mistaken for the user’s input, leading to weird or incorrect responses. It can also confuse the agent into thinking the user isn’t done talking, delaying responses or interrupting playback. In short: noise kills both quality and speed.

That’s why we’ve added Krisp, a lightning-fast, production-grade noise cancellation algorithm to our audio processing system.

We ran a test where a user is speaking with a podcast playing loudly in the background, simulating a crowded environment.

Here are two spectrograms, the upper one visualizing the audio that would be heard by ASR without Krisp, and the lower one showing the audio after having been processed with Krisp.

You can even listen to both recordings, that were run through Voiceflow’s real audio pipeline.

Without Krisp:

With Krisp:

In our tests, running Krisp noise cancellation adds ~20ms of latency to the audio pipeline, while drastically improving transcription accuracy, and ultimately leading to faster final transcriptions, reducing overall speech-to-speech latency by ~100ms.

And the best part: Krisp is automatically activated on all Voiceflow projects, for free! Enjoy!

Various ElevenLabs TTS Improvements

Now switching over to the TTS side of the voice stack, ElevenLabs is by far the most popular TTS provider among Voiceflow builders. Thanks to your feedback, we’ve added lots of new settings to make your ElevenLabs experience in Voiceflow that much better! Here are three quick improvements:

1. Got a custom ElevenLabs voice? Now you can bring it straight into Voiceflow with your own API key in the Voice behavior settings! This lets you now use custom voices outside the selection we offer by default. You could even clone your own voice for your agent! Using your own ElevenLabs API key means that you won’t be billed for TTS in Voiceflow, and instead you’ll be billed directly through your ElevenLabs account.

2. Want more control? Voice synthesis settings are now fully customizable: tweak speed, stability, and more to match your brand voice. This can be powerful to make voices more aligned with your brand persona; for example, a slower speaking and more patient agent for a doctor’s office.

3. We’ve also fine-tuned our ElevenLabs TTS generation by parsing out stray characters like " that LLMs would sometimes add. These stray characters can cause noise artifacts. We also upgraded our algorithm used for TTS sentence aggregation from streamed LLM text like from the agent step, giving ElevenLabs more context on the text that’s come before to make the synthesized audio more natural at no added latency. Basically, your generated voice will sound better!

Telephony Improvements

Most Voiceflow builders are already deploying their voice agents into production through our Twilio integration, so we’ve also been hard at work adding features here!

If you haven’t seen it yet, Voiceflow has an outbound calling API that you can use to programmatically get your AI agent to make outbound calls! This is super powerful when your agent fits into a larger user experience, proactively reaching out.

We’ve also increased the maximum call length from 10 minutes to 30 minutes so that users can have longer conversations with your agent. To mitigate cases where the agent might be sitting on an empty line for 30 minutes without any input, we’ve also added a new 3-minute silence timeout. This works by disconnecting the call if nothing has been said by the agent or the user in 3 minutes, preventing unnecessary usage on your account.

Voice agents also need a way to detect voicemails automatically. Now, we do that with answering machine detection, by ending calls automatically when we detect a voicemail beep. This avoids having your voice agent sit on the line waiting for user input without hearing anything back.

On top of this, we’ve added two hotly requested features: call event webhooks and Twilio call recording!

Call event webhooks give you real-time notifications when calls start and end, so you can automate workflows, log data, or trigger actions across Twilio or web voice — all without polling or manual setup. Learn more in the docs here.

We also added the ability to save call recordings for Twilio calls! This lets you learn more about your users’ real experience with your voice agent, outside simply reading transcripts, and is managed from your own Twilio account. Learn more about it in the docs here.

Try it out!

Phew! That’s a lot of new features! But we have one more quality of life upgrade left to mention.

To make it easier for you to test your Voice agents, we added a “Send phone call” option to the top right Run button, where you can enter your phone number and your agent will call you all for free, without even having to have your own Twilio phone number.

Give it a try today at voiceflow.com!

We’re always adding new features to Voice and to the rest of the Voiceflow platform, so stay posted for more! A great voice stack means you can now use Voiceflow’s powerful agent design tools — that you’re used to for chat — in a fully conversational voice experience

We’d also like to thank everyone in the community who has been sharing great feedback about our voice products, through the beta (call events webhooks wouldn’t exist without you!). Thank you!

Turbocharging the Voice stack to reduce latency

Voice agents can be complicated to get working as there are a ton of moving parts. Let’s trace through how a voice agent works and see the improvements made at each stage:

  • Browser-based voice recording widget: we’ve switched recording encoding formats to remove encoding delays and also send new audio to Voiceflow quicker and more often, saving up to 500ms.
  • Automatic Speech Recognition (ASR) and transcription: we’ve implemented a new algorithm to reduce how long we need to wait before we’re confident in a transcription for a given audio chunk. This is done by looking at preliminary transcripts to evaluate how long to wait before deciding the user is done speaking, rather than waiting for a final transcript, then pausing again due to 'on punctuation time.' This change alone saves up to 700ms.
  • Voiceflow logic and LLMs: through the addition of the agent step, LLMs are able to stream their output and use tools, which lets us start generating and playing the audio for the start of the speech while the LLM is still thinking, rather than waiting until generation is complete. In advanced flows, not waiting for function calls to finish can save seconds!
  • Text to Speech (TTS): we’ve added support for ElevenLab’s latest Flash v2.5 model, saving up to 250ms compared to Turbo, and we’ve accelerated sentence aggregation taking advantage of previous text to increase quality.
  • Browser-based audio: we’ve changed the format of audio being played in the client so it can be played natively, saving up to 200ms.

Altogether, this has reduced the real-world speech-to-speech, the time between a user finishing talking and hearing the LLM answer, time by up to ~1200ms. This brings us down to a total of ~1400ms in practice for a realistic project using an Agent Step with GPT-4o mini and ElevenLabs Flash v2.5 for TTS in Chrome.

With these improvements, Voiceflow's agents now respond within ~1.4s — faster than many leading production ready voice platforms.

Adding Krisp for noise cancellation

Latency is one piece of the puzzle — but quality matters too. That’s why we’ve added Krisp.

Background noise, especially speech or music, can seriously throw off voice agents. ASR systems transcribe everything they hear, so voices in a coffee shop or lyrics from background music can easily get mistaken for the user’s input, leading to weird or incorrect responses. It can also confuse the agent into thinking the user isn’t done talking, delaying responses or interrupting playback. In short: noise kills both quality and speed.

That’s why we’ve added Krisp, a lightning-fast, production-grade noise cancellation algorithm to our audio processing system.

We ran a test where a user is speaking with a podcast playing loudly in the background, simulating a crowded environment.

Here are two spectrograms, the upper one visualizing the audio that would be heard by ASR without Krisp, and the lower one showing the audio after having been processed with Krisp.

You can even listen to both recordings, that were run through Voiceflow’s real audio pipeline.

Without Krisp:

With Krisp:

In our tests, running Krisp noise cancellation adds ~20ms of latency to the audio pipeline, while drastically improving transcription accuracy, and ultimately leading to faster final transcriptions, reducing overall speech-to-speech latency by ~100ms.

And the best part: Krisp is automatically activated on all Voiceflow projects, for free! Enjoy!

Various ElevenLabs TTS Improvements

Now switching over to the TTS side of the voice stack, ElevenLabs is by far the most popular TTS provider among Voiceflow builders. Thanks to your feedback, we’ve added lots of new settings to make your ElevenLabs experience in Voiceflow that much better! Here are three quick improvements:

1. Got a custom ElevenLabs voice? Now you can bring it straight into Voiceflow with your own API key in the Voice behavior settings! This lets you now use custom voices outside the selection we offer by default. You could even clone your own voice for your agent! Using your own ElevenLabs API key means that you won’t be billed for TTS in Voiceflow, and instead you’ll be billed directly through your ElevenLabs account.

2. Want more control? Voice synthesis settings are now fully customizable: tweak speed, stability, and more to match your brand voice. This can be powerful to make voices more aligned with your brand persona; for example, a slower speaking and more patient agent for a doctor’s office.

3. We’ve also fine-tuned our ElevenLabs TTS generation by parsing out stray characters like " that LLMs would sometimes add. These stray characters can cause noise artifacts. We also upgraded our algorithm used for TTS sentence aggregation from streamed LLM text like from the agent step, giving ElevenLabs more context on the text that’s come before to make the synthesized audio more natural at no added latency. Basically, your generated voice will sound better!

Telephony Improvements

Most Voiceflow builders are already deploying their voice agents into production through our Twilio integration, so we’ve also been hard at work adding features here!

If you haven’t seen it yet, Voiceflow has an outbound calling API that you can use to programmatically get your AI agent to make outbound calls! This is super powerful when your agent fits into a larger user experience, proactively reaching out.

We’ve also increased the maximum call length from 10 minutes to 30 minutes so that users can have longer conversations with your agent. To mitigate cases where the agent might be sitting on an empty line for 30 minutes without any input, we’ve also added a new 3-minute silence timeout. This works by disconnecting the call if nothing has been said by the agent or the user in 3 minutes, preventing unnecessary usage on your account.

Voice agents also need a way to detect voicemails automatically. Now, we do that with answering machine detection, by ending calls automatically when we detect a voicemail beep. This avoids having your voice agent sit on the line waiting for user input without hearing anything back.

On top of this, we’ve added two hotly requested features: call event webhooks and Twilio call recording!

Call event webhooks give you real-time notifications when calls start and end, so you can automate workflows, log data, or trigger actions across Twilio or web voice — all without polling or manual setup. Learn more in the docs here.

We also added the ability to save call recordings for Twilio calls! This lets you learn more about your users’ real experience with your voice agent, outside simply reading transcripts, and is managed from your own Twilio account. Learn more about it in the docs here.

Try it out!

Phew! That’s a lot of new features! But we have one more quality of life upgrade left to mention.

To make it easier for you to test your Voice agents, we added a “Send phone call” option to the top right Run button, where you can enter your phone number and your agent will call you all for free, without even having to have your own Twilio phone number.

Give it a try today at voiceflow.com!

We’re always adding new features to Voice and to the rest of the Voiceflow platform, so stay posted for more! A great voice stack means you can now use Voiceflow’s powerful agent design tools — that you’re used to for chat — in a fully conversational voice experience

We’d also like to thank everyone in the community who has been sharing great feedback about our voice products, through the beta (call events webhooks wouldn’t exist without you!). Thank you!

RECOMMENDED
square-image

RECOMMENDED RESOURCES
No items found.