Large Action Models change the way we build chatbots, again

Large Action Models (LAMs) are AI models trained to understand human intention and predict subsequent actions to take. Unlike LLMs, which predict what to say next, LAMs predict what to do next, breaking complex tasks down into smaller, executable chunks, such as booking a vacation or filing a tax return. As LAMs advance, they will change the way we build chatbots, again.

Digital Assistants have been around for decades with many of us remembering Clippy, which launched in 1996. Digital Assistants, in their most primitive form, Understand what a user wants, Decide what action to take, and generate a Response. This simple “Understand Decide Respond” framework encompasses most Assistants, which today are often called “Agents”.

Clippy was a hardcoded Agent through the entire Understand Decide Respond framework. Every keyword or button Clippy listened for, action it took, and the response it gave was handwritten and predetermined by a human creator. Deterministic Agents like Clippy give the creator ultimate control over the Agent, never worrying about an Agent giving hallucinated responses at the expense of the Agent’s user experience.

The inverse of deterministic Agents are probabilistic, artificial intelligence (AI) powered Agents. AI Agents predict a confidence-ranked range of possible things to say or do next based on given inputs, whereas deterministic Agents can only do or say what was pre-programmed by their human creator. 

AI Agents produce more fluid, natural experiences because conversations can be so vast in scope that it’s impossible to predict everything a user might say to an Agent. This conversational flexibility comes at the cost of unintended “hallucination” responses that could be detrimental to a brand.

As AI has evolved, it’s been slowly integrated into Agents across the Understand Decide Respond model. First, Natural Language Understanding (NLU) AI models were introduced to replace keyword matching and help Agents Understand what users want. Then Large Language Models (LLMs) were adopted to generate Responses dynamically, replacing manually written human responses. 

Today, most companies use NLUs for Understanding with hardcoded Decision logic and handwritten Responses. Some companies are now experimenting with using LLMs for Responses as well, particularly for FAQs where the risk of hallucination is lower. Few are using AI to power the Decisions of their Agents in production as the risk is high when giving AI the power to perform tasks like issuing a refund. Large Action Models (LAMs) could change this and could complete the full transition of Agents from full determinism to full probabilism that started with Clippy (and others) in the 90s.

The concept of an action-taking Agent is as old as robots themselves (think C-3PO in Star Wars), and LAMs have been attempted before. One of the transformer paper’s authors (Ashish Vaswani) co-founded Adept in 2021 and, in 2022, released the ACT-1 model, a transformer for actions. We also went through the iterations of tooling models like Toolformer, Gorilla, and agent frameworks utilizing these kinds of models.

LAMs as a framework are exciting and offer the potential to make Agents fully probabilistic end-to-end across Understanding Decision and Response. Given the complexity of human intentions, I believe we still have some time until LAMs, or models like them, can break down user intention effectively and reliably for enterprise-scale critical applications. However, let's imagine for a second that LAMs became widespread, effective and reliable today - how would the way we build Agents change?

How Agent building will change with LAMs

Today, Decision logic is best represented and built with decision tree conversation logic, which we’ve all seen represented in tools like Voiceflow or Visio. With LAMs, these decision trees may still exist to document the ordering of operations, but they would be highly abstract to just the highest order concepts “e.g. order placed”. Think of a barista taking an order - how the conversation flows in and out of topics is natural and flowing. Only when you place an order does rigid logic that is unchanging and impossible to make probabilistic get introduced. No matter how good LAMs get, Starbucks will perform a hard logic check to ensure customers pay before they get their orders. The decision trees of today map out every possible conversation edge case and state, while decision trees with LAMs will be abstracted and simple to just the order of operations for major functions and hard logic checks, with conversation state in-between being probabilistic and logged in a transcript.

With LAMs, the dialog managers that manage conversation states will also evolve from decision tree structure to transcripts. In other words, the model of dialog management will change from remembering where the user is in a conversation (decision tree), to what has been said in the conversation (transcript). These “transcript managers” will be fairly complex to optimize transcript delivery for context windows and not burn too many tokens, knowing how to intelligently pass the LLM or LAM just enough transcript information to be useful but not wasteful. This is a significant departure from today’s dialog managers and could require large refractors for pre-LLM conversational AI platforms to adapt.

Creating Agents in a world of LAMs will feel similar to designing an employee handbook. Conversation Designers will design the permission systems for where, and when, a LAM can call a particular function. This is the “hard logic” I talked about before, in the same way a Starbucks barista may not be given the key to the store room. We will design the ways to handle customers in particular ways, the tone to use, the actions to take, informed by our uniquely empathetic human experience. We’ll spend spare time reviewing conversation transcripts and coaching Agents on how and where to improve and for what goal. In this sense, platforms like Voiceflow change from being a place to write conversations to a place to train and manage AI employees. Perhaps we’ll consider Voiceflow as an AI HR platform…

For many companies, the benefit of being first to adopt the cutting edge of LAMs is not worth the risk. Ultimately, the conversational experiences produced by LAMs may not feel markedly different from today’s top Agents produced by leading conversational AI teams. At Voiceflow we’re excited for a future where conversational Agents are easier to create, and conversations more fluid thanks to LAMs. We’re preparing our platform and customers for this reality by releasing Voiceflow Functions, which are isolated modules of functionality that could be permissioned by hard logic, and executed by LAMs as it navigates the conversation probabilistically. I expect Voiceflow to retain elements of its canvas design platform for order of operation functions, with conversation state designs being made markedly more simple. We cannot wait to see what new technology will enable for the industry and Voiceflow customers.

Will conversation designers still be relevant in a world of Large Action Models? Check out our latest Context Series video by Peter Isaacs discussing just that. Click here.

RECOMMENDED
square-image

RECOMMENDED RESOURCES
No items found.
ALONG THE SAME PATH
No items found.