Creating the Voiceflow NLU

Building a NLU is an ambitious task. There are millions of assistants that need to be powered with different domains, languages, and use cases. As Voiceflow continued to grow, it became clear that while building a custom NLU was challenging, this was an important area we needed to invest into to deliver the best customer experience. We started working on the Voiceflow NLU in February 2022, and this is a small slice of our story.

Chapter 1: Understanding the goals and criteria

To start our research efforts, we looked at six important criteria:

NLU accuracy, for both intent and entity classifications
Latency, how long the model would take to respond
Training time, how long it would take to train a new model
Generalizability, we have customers across many verticals
Operational overhead, maintenance costs of the models
Cost $$$, we operate at a scale of 100,000s of models

We started with the first two criteria, and began exploring papers and blog posts on the topic. There were existing NLUs and libraries available, but as we dug deeper, the other four criteria started becoming a challenge. We ended up using a pre-trained transformer model, and fine tuning it for intent and entity based tasks. Starting from scratch also allowed us to think through all the features that were important to our customers.

It was now March 2022, and we were able to replicate performance of several existing implementations and began thinking of how to deploy the NLU models. We thankfully had our ML deployment platform (ML CLI) already built which made it much easier to do so.

Chapter 2: An integrated prototype

It was April and we had an internal engineering onsite. One of the demos was the VFNLU, now running in one of our Voiceflow dev environments. Below is the recording of the demo.

The VFNLU was no longer an academic exercise, but a feasible product.

Chapter 3: Feature completeness and Multilingual models

After the April excitement, the reality of preparing the models for production were set in. Code was refactored, added and changed to fit our existing systems. We also retrained the NLU several times, including new techniques that boosted performance, along with adding multilingual support. We could A/B test these models, with our ML platform setup, which was handy for doing comparisons. Summer turned into fall, and we began testing load testing.

Chapter 4: Real time constraints

“This is too slow”

Our ML platform had been built with realtime in mind, with latencies in the high 100s of ms for most requests. We built it with a queue system, but that technology was not designed for requests that should consistently return in 150ms. Even though our NLU was very fast (sub 20ms) the latency of the system was quite high, especially the long tail of requests.

We had to redesign how our NLU deals with requests, which started a 3 month code refactor. It was unexpected but required to achieve the latency that our users expected.

Chapter 5: Beta testing

April 2023, almost a year after the initial VFNLU demo, was when we started rolling out the VFNLU to our free users. It was time to test how well the VFNLU performed in the real world and what bugs we could find. We also began migrating some early customers to help avoid some of the limitations we had with our previous OEM NLU implementation. We implemented a feature flag, directing a % of users to train and run inference on our NLU. Eventually we finalized a full rollout across all our free users, helping us to find some last minute bugs and deficiencies in the systems.

Chapter 6: The VFNLU

And today we release the VFNLU. After 18 months of work VFNLU is ready. Outperforming the most popular NLUs on open source and proprietary benchmarks. We’ve released an open source testing repo to test the performance yourself

With 30+ languages supported and low training time, you can both power and prototype your next conversational assistant.

NLUs in a LLM world

In December 2022, ChatGPT was in almost every Conversational AI based conversation. A powerful new model that could solve many tasks, handle many languages and respond to almost any user request. We did an internal hackathon before the winter holidays to add a number of Gen AI features to Voiceflow.

Outside of the Gen AI features, we kept working on the VFNLU. There were a couple reasons for this:

All our Enterprise Customers were continuing to use NLUs to power their assistants
NLUs are 100-1000x cheaper than LLMs and perform better on large projects
NLUs don’t hallucinate, they might respond with the wrong intent but it will be clear if there’s a false positive.

With this in mind, each piece of technology has their place in the conversational AI world, our goal as a platform is to allow people to experiment and build with both.

‍

Creating the Voiceflow NLU

Chapter 1: Understanding the goals and criteria

Chapter 2: An integrated prototype

Chapter 3: Feature completeness and Multilingual models

Chapter 4: Real time constraints

Chapter 5: Beta testing

Chapter 6: The VFNLU

NLUs in a LLM world

How to create your NLU testing strategy

Benchmarking hybrid LLM classification systems

5 tips to optimize your LLM intent classification prompts

Improving performance of Hybrid Intent + RAG conversational AI agents