Groq AI in 2026: Nvidia Deal, LPU Architecture, GroqCloud, and What It Means for Builders

Expert written and reviewed by Voiceflow team
Table of contents
    Don't get left behind in AI
    Get the latest AI news and industry shifts weekly.

    The story of Groq just changed. On December 24, 2025, Nvidia announced a non-exclusive licensing deal worth roughly $20 billion for Groq's Language Processing Unit (LPU) inference architecture, plus the move of founder Jonathan Ross and president Sunny Madra to Nvidia. Groq continues to operate as an independent company under new CEO Simon Edwards, but the competitive landscape that defined the AI chip conversation through 2024 and most of 2025 has reorganized around this deal.

    This guide covers what Groq actually is in 2026, what the Nvidia deal means for builders considering the platform, the LPU architecture that drove the whole thing, and how to use GroqCloud's current model lineup. If you're picking inference infrastructure or evaluating whether to build voice and real-time agents on Groq, this is the practical overview.

    What Is Groq?

    Groq is a US chipmaker focused on AI inference. The company designs and operates its own inference hardware (the LPU) and runs GroqCloud, a token-as-a-service platform that lets developers run open-source large language models like Llama 4 and DeepSeek R1 at speeds that conventional GPUs can't match.

    The headline number that gets repeated about Groq: Llama 4 Scout currently runs at over 460 tokens per second on GroqCloud, compared to roughly 100–150 tok/s for the same model on Nvidia H100 hardware. That speed advantage is the entire pitch.

    Groq Funding and Recent Events

    Before the Nvidia deal, Groq had raised roughly $1.75 billion in total equity across multiple rounds. The most recent equity event was a Series E in September 2025 that brought in $750 million at a $6.9 billion post-money valuation. The round was led by Disruptive, with participation from BlackRock, Neuberger Berman, Samsung, Cisco, DTCP, D1 Capital, Altimeter, 1789 Capital, Infinitum, and earlier backers including Tiger Global.

    Three months later, the Nvidia licensing deal added a $20 billion payment for non-exclusive access to Groq's inference IP. The capital structure is unusual: the $20 billion is a licensing payment that flowed to Groq's shareholders and the company, not a traditional acquisition price, and it's why the deal closed in under four months instead of the year-plus an M&A review would have taken.

    What the Nvidia $20 Billion Licensing Deal Means

    The deal has four moving parts that matter for anyone evaluating Groq:

    1. Non-exclusive license, not acquisition. Nvidia gets a license to Groq's inference architecture, but Groq retains the IP and continues operating GroqCloud as an independent business. Customers running on GroqCloud don't suddenly become Nvidia customers.
    2. Leadership transfer. Founder Jonathan Ross and president Sunny Madra moved to Nvidia. Simon Edwards is now Groq's CEO. The technical team that built the LPU mostly went to Nvidia along with the IP.
    3. Regulatory scrutiny. Senators Elizabeth Warren and Richard Blumenthal asked the FTC whether the deal violates Hart-Scott-Rodino premerger notification rules by using a licensing structure to sidestep antitrust review. As of May 2026 the deal stands, but the inquiry is open.
    4. Validation of the architecture. Nvidia, the dominant AI infrastructure company, paid $20 billion for non-exclusive access to a competitor's chip architecture. That's the loudest possible signal that LPU-style inference is becoming structurally important to the AI compute stack.

    For builders, the practical near-term implication is that GroqCloud continues to run, the model catalog continues to expand, and the speed advantage continues to hold. The medium-term question is whether Nvidia's licensed implementation of the LPU architecture will compete with GroqCloud directly. That's worth watching.

    Language Processing Units (LPUs), Explained

    Groq's primary chip is the Language Processing Unit (LPU), originally branded as the Tensor Streaming Processor (TSP). The LPU is purpose-built for inference (running trained models) as opposed to training, which is a different and more compute-heavy workload.

    The architecture difference vs. GPUs:

    • Deterministic execution. Each compute cycle is scheduled at compile time, not arbitrated at runtime. That eliminates the unpredictable latency GPUs introduce when batching variable-length inputs.
    • On-die SRAM, no external memory. Each LPU chip has up to 230 MB of SRAM with 80 TB/s on-die bandwidth. GPUs rely on slower external HBM memory; LPUs avoid the round-trip entirely.
    • Single-thread linear scaling. LPUs scale across multi-server and multi-rack deployments near-linearly without external switching fabric.

    These choices trade flexibility for speed. LPUs are not good general-purpose chips: they're inference chips, and they're optimized for the specific pattern of running a trained transformer model token-by-token. For that one workload, they're meaningfully faster.

    What Is AI Inference, and How Is It Different from Training?

    Inference is the act of running a trained model on new input to produce output. Training is the act of building the model in the first place by feeding it labeled data and iteratively adjusting weights. The two have different compute profiles:

    • Training is a one-time (or periodic) batch workload. It runs for days or weeks on thousands of GPUs and produces a model file.
    • Inference is a continuous online workload. It runs every time someone asks the model a question. For a popular model, inference compute totals dwarf training compute over the model's lifetime.

    Groq's bet from the start was that inference would become the dominant cost in AI, and that the chip best suited to inference would not be the same chip that trains models. The Nvidia deal validates that bet. Nvidia's H100 and Blackwell GPUs are training-optimized chips that also do inference. Groq's LPU is an inference-only chip that's faster at the inference job.

    Groq + Nvidia: How the Partnership Reshapes the Roadmap

    The pre-2026 framing of "Groq vs. Nvidia" was a clean rivalry. Groq the startup, faster but smaller. Nvidia the incumbent, with 80%+ of the AI accelerator market. After the December 2025 licensing deal, the framing flipped: Nvidia now has the IP, Groq still has the cloud business, and the question for builders is which one of them to deploy on.

    Two reasonable paths in 2026:

    • GroqCloud directly. Use Groq's own inference cloud. Fast, low cost, open-source-only catalog. Best for builders who need maximum speed and are comfortable with Llama / DeepSeek / Qwen / GPT-OSS models.
    • Nvidia-implemented LPU inference. Nvidia will deploy the licensed LPU architecture in its own DGX Cloud and through partners. For teams already locked into the Nvidia stack (CUDA, NIM, Triton), staying on Nvidia and getting LPU-style speed is a low-friction path.

    The pricing dynamics aren't fully settled yet (Nvidia hadn't published commercial LPU pricing as of May 2026), but Groq's "guarantees to beat published prices per million tokens by other providers for equivalent models" stance from 2024 remains in place. For most builders, GroqCloud directly is the right starting point until Nvidia's commercial implementation is clearer.

    GroqCloud Model Catalog

    Groq's catalog is open-source-only. No GPT-4, no Claude, no Gemini. If you need proprietary frontier models, you're going to another provider for those (and many builders run a hybrid: Groq for fast open-source inference, Anthropic or OpenAI for frontier reasoning).

    As of May 2026, the supported model list includes:

    • Llama 4 Scout and Llama 4 Maverick (Meta's latest, day-zero access on GroqCloud)
    • Llama 3.3 70B and Llama 3.1 8B
    • DeepSeek R1 Distill 70B (see DeepSeek for context on the underlying model)
    • GPT-OSS 120B (OpenAI's open-weight release)
    • Kimi K2
    • Qwen3 32B and Qwen QwQ 32B (reasoning model)
    • Whisper Large v3 (speech-to-text)
    • Mistral Saba 24B (see Mistral AI for the broader model line)
    • Allam 2 7B (Arabic-focused)

    Model availability changes frequently. Check the live Groq supported models page for the current catalog, and the model deprecation page before standardizing on a specific model.

    Free Tier and Pricing

    Groq runs three tiers:

    • Free. No card required. Rate-limited (around 30 requests per minute, 6,000 tokens per minute on most models, ~14,400 requests per day). Ideal for prototyping and small-scale experimentation.
    • On-Demand (pay per token). Higher rate limits, priority support, billed per million tokens. Pricing varies by model, so see the live pricing page for current rates since the catalog and per-model pricing rotate frequently.
    • Business. Custom rate limits, fine-tuned model hosting, dedicated SLAs.

    Get a free API key at console.groq.com/keys.

    When to Pick Groq

    Use cases where Groq's speed advantage actually changes what you can ship:

    • Voice agents and real-time conversational AI. Phone calls have a sub-second latency budget per turn. Faster LLM inference is the single biggest win.
    • Streaming-heavy chat interfaces. Users perceive 400 tok/s as instantaneous; 100 tok/s feels sluggish on long responses.
    • Agentic loops with many sequential LLM calls. Agentic AI workflows that chain 5–20 LLM calls before responding amplify the per-call latency advantage.
    • High-throughput batch inference. Document summarization, classification at scale, RAG indexing. Groq's per-token cost is competitive and the speed gets you through the corpus faster.
    • Cost-sensitive open-source-only deployments. If you're committed to Llama / DeepSeek / Qwen and don't need proprietary models, Groq is usually cheaper per million tokens than the equivalent Nvidia-cloud inference.

    Cases where Groq is the wrong pick:

    • You need GPT-5, Claude, or Gemini specifically. Use the provider directly or a hybrid setup.
    • You need fine-tuning on a proprietary model.
    • You need vision or video on a model that isn't in Groq's catalog.

    Why Groq Matters for Voice Agents

    Voice is where latency matters most. A phone-based AI agent has roughly 500–800 ms of end-to-end latency budget per turn before callers notice "weird pauses." That budget has to cover speech-to-text, LLM inference, text-to-speech, and network round-trips.

    On a typical GPU inference stack, the LLM call alone consumes 400–600 ms for a moderate response. That leaves almost nothing for STT and TTS. The result is the well-known phone-agent feel of slow turn-taking.

    On Groq, the same LLM call completes in 100–150 ms. That puts the full turn comfortably under the 800 ms threshold, and the agent feels natural. For AI phone calls and AI call-center agents, this isn't a "nice to have." It's the difference between an agent customers tolerate and one they hang up on.

    Conversation design for voice is also easier with Groq's speed. Designers can write longer, more nuanced agent responses without worrying about whether the TTS will start before the LLM finishes generating.

    Using Groq in Voiceflow

    Voiceflow integrates Groq as a first-party LLM provider in the Agent Builder. That means you can pick Groq-hosted Llama 4 or DeepSeek R1 from the model dropdown the same way you'd pick GPT or Claude, without a custom integration. The provider list includes Anthropic, OpenAI, Google, Groq, and others, plus an OpenRouter test tier for experimenting with anything else.

    What that gives you:

    • Build once, swap models freely. Test the same agent design on Claude, GPT, and Groq-hosted Llama 4 from a dropdown. Voice agents often start on a proprietary model and migrate to Groq for production once the prompt is stable.
    • Knowledge Base + Playbooks on a fast backend. Voiceflow's Knowledge Base (RAG) and Playbooks (LLM-driven sub-agents that use tools) run on whichever model you select. Pointing them at Groq-hosted models is the fastest path to a low-latency agent framework on a no-code canvas.
    • Voice channel out of the box. Voiceflow's native voice channel plus Groq's LPU speed plus Twilio for telephony is the standard low-latency stack for production phone agents.

    Frequently Asked Questions

    Why is Nvidia buying Groq?

    Nvidia paid roughly $20 billion in December 2025 for a non-exclusive license to Groq's LPU architecture, plus the move of CEO Jonathan Ross and president Sunny Madra to Nvidia. The structure is licensing-plus-acqui-hire rather than acquisition, which let the deal close in under four months without a full M&A review. The strategic logic: LPU-style deterministic inference is becoming structurally important to AI compute, and Nvidia wanted access to the IP and the team rather than letting Groq capture inference share at GPU's expense. Senators Warren and Blumenthal are probing whether the licensing structure improperly sidesteps Hart-Scott-Rodino review.

    Is Groq AI free?

    Yes for evaluation, with rate limits. Groq Chat (the consumer-facing interface) is free. The Groq API has a free tier with around 30 requests per minute and ~14,400 requests per day on most models. Production usage typically needs the paid On-Demand or Business tier.

    Who owns Groq?

    Groq is a privately held company. Original founders Jonathan Ross (former Google engineer, designed the original Google TPU) and Douglas Wightman led the founding team in 2016. Equity is held by employees and a long list of institutional investors, including Tiger Global, Disruptive, BlackRock, Samsung, Cisco, D1 Capital, and Altimeter. After the Nvidia licensing deal, Simon Edwards is the CEO.

    Is Groq publicly traded?

    No. Groq is a private company. Shares are not listed on any stock exchange, and the company is not required to disclose financials. Pre-IPO secondary-market platforms occasionally offer Groq shares to accredited investors.

    How do I invest in Groq?

    As a private company, Groq isn't accessible through public markets. The realistic paths for individual investors: pre-IPO secondary platforms (Forge, EquityZen), venture and private-equity funds that hold Groq positions, or waiting for a future IPO. Most retail investors are better off treating Groq as a strategic player to watch rather than an investable position.

    Groq AI vs. Grok by Elon Musk: what's the difference?

    They're different products with confusingly similar names. Groq (with a Q) is the AI inference hardware and cloud company covered in this article. Grok (with a K) is the xAI conversational AI assistant Elon Musk built for X (formerly Twitter). Groq predates Grok by years and trademarked its name first.

    Conclusion

    The Groq story in 2026 is a different story than the Groq story in 2024. The chip architecture that drove the company forward has been validated by the loudest possible signal: Nvidia paid $20 billion for a license to it. And the company continues as an independent inference cloud with new leadership and a current open-source model catalog.

    For builders, the practical decision is whether you need the speed. If you're shipping voice agents, real-time conversational AI, or agentic loops, the LPU advantage is concrete and measurable. If you're shipping non-real-time batch text processing, conventional GPU inference is fine.

    The Voiceflow integration makes the trial trivial. Pick Groq from the model dropdown, point your existing agent design at Llama 4 or DeepSeek R1, and measure the latency difference for yourself.

    background lines
    background lines