Statefulness with Microsoft LUIS

Frank Gu
June 3, 2021
💡 TLDR: To add context-awareness to the stateless Microsoft LUIS NLU service, Voiceflow's engine distinguishes between regular intents and fulfillment intents. Each fulfillment intent maps 1-1 with the regular intent that it's fulfilling.

During inference, Voiceflow's engine makes 2 LUIS inference calls: one with the user's original utterance and another with the user's utterance prefixed with a sentinel string. The prefix strongly biases the machine learning model towards a narrow set of fulfillment intents thereby achieving the context-awareness.

In the past few months, Voiceflow's engineering team has been hard at work rolling out natural language processing (NLP) support for Voiceflow's own conversational agent. One of the exciting announcements from our recent Voiceflow V2 launch is NLP general availability. What does this mean for our users? Users are now able to customize their application's language model (LM) and build conversational applications hosted on Voiceflow and test their prototypes with higher fidelity.

What is a LM you ask? A language model is a set of intents and entities that captures the "essence" of a user's input. Human language is often ambiguous and fuzzy; computers aren't smart enough to directly understand it yet! However, by discretizing user input (utterances) into a set of intents, we can capture what the users "intend" to do.

Behind the scenes, the workhorse of Voiceflow's NLP system is Microsoft's Language Understanding Intelligence Services (LUIS). LUIS is a managed NLP service that makes it easy for anyone to develop, train, and deploy a production-ready language model by defining a set of intent and entities, and uploading the associated labeled training data. We chose LUIS for its simplicity and ease of deployment. Each application on Voiceflow is mapped to a corresponding LUIS application.

Sometimes, you will want to extract some specific information from a user's utterance. LUIS achieves this feat by ingesting labeled training data of both the word's semantic context and position in the sentence as hints for the machine learning model. If these concepts sounds familiar, you probably have seen them on Voiceflow's Model manager; we refer to entities as "slots".

Intent Fulfillment

Our toy example here is a food ordering application that has a {% c-line %}pizza_order{% c-line-end %} intent that expects both a {% c-line %}size{% c-line-end %} and {% c-line %}topping{% c-line-end %} entity, and a {% c-line %}wings_order{% c-line-end %} intent that expects at least a {% c-line %}size{% c-line-end %} entity.

If the user says {% c-line %}I want to order a pizza{% c-line-end%}, the system will classify the user's utterance as a {% c-line %}pizza_order{% c-line-end %} intent, but realize that both the {% c-line %}size{% c-line-end %} and {% c-line %}topping{% c-line-end %} entities are missing. If you have defined these entities as required, the system will re-prompt the user to specify the {% c-line %}size{% c-line-end %} and {% c-line %}topping{% c-line-end %} options to fulfill the intent; unsurprisingly called "intent fulfillment". Think of this feature as a clerk who helps your application gather all the important bits of information that the application needs in order to respond to a user's intent.

To keep the conversation flow natural, the user should not be "locked" within an intent fulfillment context. They should be able to freely transfer to a completely different intent at any point in the conversation. A good example is when the user says {% c-line %}help{% c-line-end %} to invoke the Help intent in the middle of any conversation path.

The Problem - Statelessness of LUIS

Intent fulfillment fundamentally requires a stateful system to preserve the conversational context. After all, how could a stateless system know that it's missing information from a previous set of interactions without state management? Unfortunately, this is where LUIS's simplicity becomes a double-edged sword.

LUIS is a simple stateless NLP service that accepts a user's utterance string and outputs a list of intents, their associated prediction confidence intervals, and a set of extracted entities from the top-matching intent. The LUIS application has no idea of any previous interactions with the user. To make matters worse, since LUIS accepts word length and word positions as cues to the machine learning model; the probability of matching single-word intents is very high.

Coming back to the aforementioned {% c-line %}pizza_order{% c-line-end %} intent, suppose the system asked {% c-line %}What size would you like your pepperoni pizza?{% c-line-end %}. If the user answers {% c-line %}large please{% c-line-end %}, two problems arise:

  1. How will the stateless application know that the user's answer is to fulfill an entity of the {% c-line %}pizza_order{% c-line-end %} intent?
  2. Since the model considers both word length and semantic similarity, how will it differentiate with another similar utterance like {% c-line %}cheese please{% c-line-end %} (eg. to fulfill the toppings intent)?
  3. How can the user change to a different intent context from within the intent fulfillment context? One may say {% c-line %}help{% c-line-end %} to invoke the Help intent amidst the {% c-line %}pizza_order{% c-line-end %} intent fulfillment process.

These problems point to the need for context-awareness in the LUIS application. The context-awareness enables the system to dynamically adjust the model's bias of certain utterances matching to certain intents within a given conversational context.

How can we build statefulness into a fundamentally stateless system?

Adding state to LUIS

To accomplish statefulness (context awareness) on LUIS, we have leveraged LUIS's regex-based entity extraction system to make two calls to LUIS whenever the user is in an intent fulfillment context: one call with a unique prefix to the user's utterance to bias LUIS towards a certain set of intents with in the current conversational context, and another call with only the user's utterance to determine the user's intent as if the user was not in an intent fulfillment dialogue at all:

In our implementation, the prefix string is simply the {% c-line %}md5{% c-line-end %} hash of the name of the intent to be fulfilled; practically, any string that is never used in conversation, but provides a correlation to an intent (eg. another hash algorithm) can be used.

LUIS's regex matching based on the prefix string provides the machine learning model with a strong cue to match a fulfillment intent, instead of a regular intent. What is the difference between a regular and a fulfillment intent? To LUIS, they are all treated the same as plain old intents with their associated training data and entities. The differences are in the training data and their end-purpose:

  • Regular Intent: you can think of this as the "top level" intent that represents the user's initial intent. A good example is the {% c-line %}pizza_order{% c-line-end %} intent without any prefixed training data. These intents serve as a conversation path's entry point; they are more likely to be matched than a fulfillment intent due to the lack of the prefix string.
  • Fulfillment Intent: consider this as a context-dependent intent that is only likely to match whenever you are in an intent fulfillment context. The context distinction is important because simple user utterances such as {% c-line %}large{% c-line-end %} in response to a prompt for a pizza size can easily match ambiguously with any other single word utterance. Recall that LUIS uses both semantic context and utterance structure for intent classification and entity extraction. For low word-count utterances, the utterance structure dominates. Intents with short training utterances are usually more likely to be matched, so special provisions (ie. prefix string) should be in place to bias the intent classification context.

The results are compared and several possibilities are processed accordingly. Specifically:

1. An incoming user utterance to the {% c-line %}general-runtime{% c-line-end %} service is proxied to the LUIS runtime,  classified as an  intent, and have entities are extracted from the utterance

2. The {% c-line %}general-runtime{% c-line-end %} identifies the first unfulfilled but required entity, activates the intent fulfillment context.

3. The appropriate re-prompt message defined in the Voiceflow Model is sent to the user. The user may respond with either the requested entities, or with an utterance for a completely different intent.

4. The user utterance will pass through the regular {% c-line %}general-runtime{% c-line-end %} pipeline and produce an intent/entity extraction.

5. If the dialogue manager detects that the context is intent fulfillment, another {% c-line %}hash{% c-line-end %} prefixed call will be made to the LUIS runtime from the {% c-line %}general-runtime{% c-line-end %}, which is used to determine what the slots are if the user is responding to a reprompt. At this point, the user can also switch to a completely different intent (eg. {% c-line %}I want wings{% c-line-end %}, instead of specifying a pizza size). Source code

6. Once both LUIS responses have been obtained, the {% c-line %}general-runtime{% c-line-end %}'s dialogue manager will compare the results with a few potential branches (each branch should be executed in the order specified) The implementation can be found here.

CASE-A: The prefixed call results in an intent without the [dm_] prefix → not a fulfillment intent response

1. CASE-A1: If the prefixed and regular calls match the same intent that is different from the original intent→ migrate user to the new intent, extract all the available entities, and determine if additional entities are needed from the user.

User - "I want to order a large pizza" (Original utterance)
Sys  - "What type of pizza?" (Reprompt)
User - "I want to order chicken wings" (Response with completely different intent)

2. CASE-A2 (rare): If the prefixed and regular calls do not match the same intent → Fallback intent.

User - "I want to order a large pizza" (Original utterance)
Sys  - "What type of pizza?" (Reprompt)
User - "The sky is blue" | "the type of chicken wings that tastes good" (Response that may be completely off-topic or ambiguous)

In the above example, the NLU system will most likely not match the {% c-line %}hash{% c-line-end %} prefix call with a {% c-line %}None{% c-line-end %} or some other {% c-line %}dm_<hash>_<intent>{% c-line-end %}. However, the regular call will return a {% c-line %}None{% c-line-end %} intent or perhaps some other non-sensical intent due to variations in the machine learning model. In this case, we would either either get {% c-line %}None + None{% c-line-end %} for both calls and return {% c-line %}None{% c-line-end %} (fallback intent), or a {% c-line %}dm_<hash>_<intent> + None{% c-line-end %} (for example), and also return a fallback intent for the user to clarify.

CASE-B: The prefixed call matches the [dm_<hash> _<original intent>]

1. CASE-B1: If the prefixed and regular calls match the same intent, then use the entities extracted from the prefixed intent to overwrite any existing filled entities and determine if additional entities are needed from the user.

User - "I want to order a large pizza" (Original utterance)
Sys  - "What type of pizza?" (Reprompt)
User - "I want to order a small pizza" | "I want to order a large pepperoni pizza"

2. CASE-B2: If the prefixed and regular do {% c-line %}NOT{% c-line-end %} match the same intent.

   a) CASE-B2_1: If the regular intent has higher confidence interval than the prefixed intent    → migrate user to the regular intent, extract all the available entities, and determine if    additional entities are needed from the user.

User - "I want to order a large pizza" (Original utterance)
Sys - "What type of pizza?" (Reprompt)
User - "I want to order a large chicken wings"

   The prefixed call may result in the same intent due to the semantic similarity of "I want to    order a large pizza" and "I want to order a large chicken wings". However, if the training    data for the wings_intent includes this exact phrase, it's very likely that the regular call    will match with the wings_intent with higher confidence. In this case, we should switch    the user to the new intent.

   b) CASE-B2_2: If the prefixed intent has no entities extracted → migrate the user to the    regular intent.

User - "I want to order a large pizza" (Original utterance)
Sys  - "What type of pizza?" (Re-prompt)
User - "I want to chicken wings" (response with no entities)

   In the following cases, the user's response is very likely to match a completely different    intent with the regular call (due to the short response length). However, since the    prefixed version has entities and matches the original intent call, these entities should be    used to fill the current intent if they are valid.

   c) CASE-B2_3: If the prefixed intent only has entities that are in the intent's entity list, fill    the intent with extracted entities.

User - "I want to order a large pizza" (Original utterance)
Sys  - "What type of pizza?" (Re-prompt)
User - "Pepperoni"

   d) CASE-B2_4: If the prefixed intent has entities that do not exist in the intent's entity    list, return Fallback Intent.

User - "I want to order a large pizza" (Original utterance)
 - "What type of pizza?" (Reprompt)
- "Paris"

With these conditions, the system can easily determine how to proceed.

In summary

The combination of prefixed and non-prefixed LUIS predictions provides us with sufficient information to determine if the user's utterance is a response to an intent fulfilment context or a completely different/non-sensical response. The intent fulfillment logic is a powerful tool for conversational designers to ensure that their users provide all the information that their skill needs to proceed without the need for any boilerplate logic.