Large Action Models change the way we build chatbots, again

The journey from Clippy to artificial intelligence

Digital assistants have been around for decades—remember Clippy? It was a hardcoded agent using the Understand, Decide, Respond (UDR) model. Every keyword Clippy listened for, action it took, and response it gave was handwritten and predetermined by a human creator. In their most primitive form, digital assistants understand what a user wants, decide what action to take, and generate a response. As a result, deterministic agents like Clippy give the creator ultimate control—designers never have to worry about an agent hallucinating at the expense of user experience. This simple Understand, Decide, Respond model is applied to most assistants, which are called “agents” today.

As AI has evolved, it’s been slowly integrated into UDR agents. First, Natural Language Understanding (NLU) AI models were introduced to replace keyword matching and help agents understand what users want. Then Large Language Models (LLMs) were adopted to generate responses dynamically, replacing manually written human responses. AI agents predict their responses and actions based on a confidence-ranked range of possibilities and given inputs. It’s impossible to predict everything a user might say to an agent, so this conversational flexibility comes at the cost of unintended “hallucinations” that could be detrimental to a brand. However, AI agents produce more fluid, natural conversational experiences.

Today, most companies use NLUs for understanding, paired with hardcoded decision logic and handwritten responses. Some companies are now experimenting with using LLMs for responses as well, particularly for FAQs where the risk of hallucination is lower. Few are using AI to power the decisions of their agents because the risk is high—especially when performing tasks like issuing a refund. But, Large Action Models (LAMs) could change this and complete the transition of agents from determinism to probabilism.

In fact, the concept of an action-taking agent is as old as robots themselves—think C-3PO in Star Wars. LAMs have even been attempted before. We’ve seen iterations of tooling models like Toolformer and Gorilla. And co-author of the Transformer paper and co-founder of Adept, Ashish Vaswani, released the ACT-1 model in 2022—a transformer for actions.

Though LAMs are exciting—and have the potential to make agents fully probabilistic end-to-end—given the complexity of human intentions, I believe we still have time until LAMs can manage user intention effectively and reliably at the enterprise-scale. But let's imagine for a second that LAMs became widespread, effective, and reliable today—how would they change the way we build agents?

How LAMs will change how you build agents

Today, decision logic is built best with decision tree conversation logic—represented in tools like Voiceflow or Visio. With LAMs, these decision trees may still exist to document the ordering of operations, but they would be highly abstract, just concepts like “order placed”.

Think of a barista taking an order—how the conversation flows in and out of topics is natural. Only when you place an order does rigid logic that is unchanging gets introduced. For example, a coffee shop will always perform a hard logic check to ensure customers pay before they get their orders.

Currently, decision trees map out every possible conversation edge case and state, while future decision trees with LAMs will be abstracted and simple for major functions and hard logic checks. Furthermore, the dialog managers that manage conversation states will evolve from decision tree structure to transcripts. The model of dialog management will change from remembering where the user is in a conversation (decision tree), to what has been said in the conversation (transcript). These “transcript managers” will be challenging for context windows—and burning through tokens—so you’ll have to know how to intelligently pass the LLM or LAM just enough transcript information to be useful, not wasteful. This is a significant departure from today’s dialog managers and could require large refractors for pre-LLM conversational AI platforms to adapt.

I'd argue creating agents in a world of LAMs will feel similar to designing an employee handbook. Conversation designers will design the permission systems for where and when a LAM can call a particular function—this is the hard logic I mentioned earlier. In the same way a new barista is trained on how to make latte art or take an order, we’ll design the way a LAM will handle customers—the tone to use, the actions to take—informed by our uniquely empathetic human experience. We’ll spend time reviewing conversation transcripts and coaching agents on how and where to improve. In this sense, platforms like Voiceflow change from being a place to write conversations to a place to train and manage AI employees. Perhaps, one day we’ll consider Voiceflow an AI HR platform…

LAMs are coming…We’re getting ready

For many companies, the benefit of being first to adopt the cutting edge of LAMs is not worth the risk. Ultimately, the conversational experiences produced by LAMs may not feel markedly different from today’s top agents produced by leading conversational AI teams.

At Voiceflow we’re excited for a future where conversational agents are easier to create and conversations more fluid thanks to LAMs. We’re preparing our platform and customers for this reality by releasing Voiceflow Functions, which are isolated modules of functionality that could be permissioned by hard logic and executed by LAMs as it navigates the conversation probabilistically. I expect Voiceflow to retain elements of its canvas design platform for order of operation functions, with conversation state designs being made more simple. We cannot wait to see what new technology will enable for the industry and Voiceflow customers.

Will conversation designers still be relevant in a world of Large Action Models? Check out this video by Peter Isaacs discussing just that. Click here.