The production AI agent stack: tools and infrastructure you actually need

There is a well-known gap between a demo and a production system. With AI agents that gap is wider than usual, because the failure modes in production are not just bugs — they are actions. An agent that goes wrong in a notebook produces a bad output. An agent that goes wrong in production sends emails, moves money, and modifies data.

The tooling ecosystem for AI agents has matured significantly in the last two years. There are now good options at every layer of the stack. The challenge is understanding what each layer is actually responsible for, which tools are genuinely production-ready, and where the gaps still are. This post walks through the full stack.

The model layer

Everything starts with the LLM. For production agents, the model choice involves more than benchmark scores. You are optimizing across capability, latency, cost, and context window — and those four dimensions rarely move in the same direction.

OpenAI GPT-4o and o-series

GPT-4o remains the most widely deployed model for agentic workloads. It has strong tool-calling reliability, broad ecosystem support, and a large enough context window (128k tokens) for most retrieval-augmented workloads. The o-series models (o3, o4-mini) are worth evaluating for tasks requiring extended reasoning, though they add significant latency per step — something that compounds across a multi-step agent loop.

Anthropic Claude

Claude Sonnet has become the model of choice for agents handling long documents or complex multi-turn reasoning. Its 200k token context window and strong instruction-following under long context make it particularly well-suited for research and analysis agents. Claude also tends to be more conservative about tool use — it will often ask for clarification rather than guess, which can be a feature or a friction point depending on your application.

Google Gemini

Gemini 2.0 Flash has emerged as a strong option for latency-sensitive agents. Its speed and cost profile make it worth benchmarking for agents that make many small tool calls, where a cheaper-per-token model compounds into meaningful savings at scale.

In practice, most production teams end up with a two-model strategy: a faster, cheaper model for high-frequency decisions and routine tool calls, and a more capable model for steps that require complex reasoning. Routing between them based on task complexity is a real optimization — not just a cost-cutting measure.

Orchestration frameworks

The orchestration layer is where your agent logic lives: how tools are defined, how the agent decides what to do next, and how multi-step or multi-agent workflows are structured. The major frameworks have converged on similar primitives but make very different tradeoffs.

LangChain

LangChain is the most widely adopted framework, which means it has the largest ecosystem of integrations and the most answers on Stack Overflow. Its LangGraph extension is now the recommended approach for building stateful, multi-step agents — it models agent execution as a directed graph where nodes are functions and edges are conditional transitions. This gives you explicit control over agent flow that you don't get from a simple ReAct loop.

langchain + langgraph

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool

@tool
def issue_refund(order_id: str, amount: float) -> str:
    """Issue a refund for an order."""
    # ... your implementation
    return f"Refund of {amount} issued for {order_id}"

llm = ChatOpenAI(model="gpt-4o")
llm_with_tools = llm.bind_tools([issue_refund])

# Build a graph instead of a flat ReAct loop
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", call_tools)
graph.add_conditional_edges("agent", route_tools)

CrewAI

CrewAI takes a higher-level approach: you define agents with roles, goals, and tools, and assign them tasks. CrewAI handles the coordination between agents without requiring you to wire the execution graph yourself. This makes it fast to get started for multi-agent workflows, at the cost of less control over exactly how coordination happens. It runs on top of LangChain under the hood, so most LangChain integrations work.

OpenAI Agents SDK

OpenAI's Agents SDK is the newest major entry. It is deliberately minimal: agents, tools, and handoffs between agents are the core primitives. If you are building on GPT-4o and want something closer to the metal than LangChain, this is worth a look. The tracing built into the SDK and its compatibility with the OpenAI platform make it a natural fit if you are already heavily invested in the OpenAI ecosystem.

LlamaIndex

LlamaIndex is the best framework specifically for retrieval-augmented agents — agents that need to query large document corpora as part of their tool loop. Its data connectors, query engines, and retrieval primitives are more mature than LangChain's equivalent. If your agent's primary capability is answering questions over a large document corpus or knowledge base, LlamaIndex is the right starting point.

AutoGen / AG2

AG2 (the community fork of Microsoft's AutoGen) is designed for conversational multi-agent systems where agents debate, verify each other's work, or collaborate on complex tasks. It is the right framework when the core pattern is agent-to-agent dialogue rather than agent-to-tool execution. Less common in production agentic applications, more common in research and automation workflows.

How to choose

Framework choice matters less than people think at the start, and more than people think once they are deep into a codebase. A practical heuristic:

·Document-heavy retrieval workloads → LlamaIndex
·Multi-agent role-based workflows → CrewAI
·Complex stateful agent graphs with explicit control → LangGraph
·OpenAI-native, minimal abstraction → OpenAI Agents SDK
·Custom agent loop, full control → plain Python + raw API calls

All major frameworks support the same underlying model providers, tool-calling conventions, and memory patterns. Switching later is painful but possible. Pick based on the primary shape of your problem, not on ecosystem momentum.

Memory and retrieval

Context windows are not memory. They are a buffer. For agents that need to remember anything beyond a single session — user history, domain knowledge, prior decisions — you need an external memory system.

Vector stores

Vector stores are the standard solution for semantic retrieval: embed documents and queries into a shared vector space, retrieve by similarity. The major options:

·Pinecone — managed, scales to hundreds of millions of vectors without operational overhead. Best choice if you do not want to manage infrastructure.
·Chroma — open source, runs in-process or as a server. Excellent for development and small-to-medium production deployments.
·Weaviate — open source with a managed cloud option. Strong support for hybrid search (vector + keyword), which often outperforms pure semantic search in production.
·pgvector — a Postgres extension that adds vector similarity search. If you are already on Postgres, this keeps your operational footprint small. Does not scale as well as dedicated vector databases at very high query volumes.

Structured memory

Not everything belongs in a vector store. User preferences, account state, prior decisions — these are structured facts that are better stored relationally and injected into context selectively. A common production pattern is a two-tier memory system: a relational database for structured facts, a vector store for semantic retrieval of unstructured content. The agent query layer decides which tier to consult based on the type of lookup required.

Tool integrations and the MCP standard

Tools are what make agents useful and what make them dangerous. The definition of a tool in any major framework is essentially a function with a schema: a name, a description the LLM uses to decide when to call it, and typed parameters.

The emerging standard for tool interoperability is Model Context Protocol (MCP), introduced by Anthropic and now supported by most major frameworks. MCP defines a standard wire format for tools, resources, and prompts — the goal being that a tool server built once can be consumed by any MCP-compatible agent runtime. Early adoption is solid in developer tooling; broader production adoption is still maturing.

For most teams today, tools are implemented directly as Python functions and registered with the framework of choice. The practical advice is to keep tool implementations thin: the function validates inputs, calls an underlying service, and returns a result. Business logic that determines whether a tool should have been called at all belongs outside the tool implementation.

Observability and tracing

This is the layer most teams underinvest in early and regret later. An agent that works in testing and fails silently in production, for reasons you cannot reconstruct, is a production incident waiting to happen.

LangSmith

LangSmith is the observability product from the LangChain team and integrates zero-configuration with LangChain and LangGraph. It traces every LLM call, tool call, and chain step. For teams on LangChain, this is the natural starting point: set two environment variables and you have full execution traces in a UI.

Langfuse

Langfuse is framework-agnostic, open source (self-hostable), and increasingly the choice for teams that want observability without vendor lock-in. It supports tracing via OpenTelemetry, a Python SDK, and direct integrations with LangChain and the OpenAI SDK. If you need to self-host your traces for compliance reasons, Langfuse is the most mature option.

Arize Phoenix

Arize Phoenix sits at the intersection of observability and evaluation. Beyond tracing, it provides built-in evaluators for hallucination, relevance, and toxicity — useful when you want to understand not just what your agent did, but how well it did it. Good choice for teams doing continuous evaluation in production.

Arden

LangSmith, Langfuse, and Phoenix all answer the same question: what did the agent do? They trace the happy path — LLM calls made, tools invoked, outputs returned. They do not record what the agent tried to do but was blocked from doing, what a human approved or denied, or what the session cost in tokens across every model call. Arden fills that gap. Every tool call — allowed, blocked, or pending human approval — is logged with its full arguments and decision reason. Token usage and estimated LLM cost are captured automatically for LangChain, CrewAI, and the OpenAI Agents SDK with no extra instrumentation. Sessions are tagged with an ID so you can replay exactly what an agent did across an entire conversation.

The distinction matters most during incidents. When a customer disputes a charge your agent processed, you need to know exactly which tool was called, with what arguments, at what time, and whether a human approved it. A trace that shows the LLM decided to process a refund is not the same as an audit log that shows the refund tool was called withamount=850.00, matched the policy, and was approved by a named reviewer at a specific timestamp. The latter is what compliance and customer support actually need.

Runtime safety and guardrails

Observability tells you what happened. Runtime safety governs what is allowed to happen in the first place. Most teams need both, but they are different layers with different tools.

What the framework-level options provide

Most frameworks include some guardrail primitives. LangChain has output parsers and validators. The OpenAI Agents SDK has input/output guardrails as first-class concepts. NVIDIA NeMo Guardrails provides a DSL for defining conversational rails. These are all text-layer controls: they operate on what the model says, not on what the model's decisions actually cause.

For an agent that only generates text, that is sufficient. For an agent that calls functions with real-world consequences — issuing refunds, sending emails, modifying records — it is not. Text-layer guardrails do not intercept the function call. They evaluate the model's representation of its intent. By the time the function executes, the guardrail has already passed.

Runtime enforcement at the tool call layer

The alternative is to intercept tool calls before they execute: evaluate the function name and actual arguments against a policy, and block or pause the call regardless of what the model intended. This approach operates on structured data — typed values, not natural language — which makes it deterministic and auditable in a way that model-based guardrails are not.

agent.py

import ardenpy as arden

# One call at startup. Every tool call is now policy-checked.
arden.configure(api_key="arden_live_...")

# For LangChain, CrewAI, OpenAI Agents SDK:
# configure() intercepts automatically — no changes to tool code.

# For custom agents, wrap explicitly:
safe_refund = arden.guard_tool("issue_refund", issue_refund)
result = safe_refund(amount=850.0, order_id="FF-4210")
# → blocked if amount > policy threshold
# → paused for human approval if configured
# → logged regardless of outcome

Arden is built around this pattern. It auto-patches LangChain, CrewAI, and the OpenAI Agents SDK at configure() time, so every tool call in the process is intercepted without modifying your tool implementations. Policies are configured per tool in the dashboard — allow, block, or require human approval, with optional conditions like amount thresholds or recipient domain matching. Changes take effect without redeployment.

The human-in-the-loop piece is worth calling out specifically. For high-value or irreversible operations, requiring a human to approve before the tool executes is the only mechanism that provides a true safety boundary. A policy that says "require approval for refunds over $200" means a human clicks Approve before that function ever runs — not a confirmation prompt inside the agent context that an adversarial instruction could bypass.

Arden's audit log and the execution traces from LangSmith or Langfuse are complementary, not redundant. Execution traces show the full reasoning chain — every LLM call and its inputs and outputs. Arden's log shows the governance layer — which tool calls were evaluated, what the policy decided, and what a human approved. Use both.

Evaluation and testing

Agents are not deterministic. The same input can produce different tool call sequences across runs, which makes traditional unit testing insufficient. You need evaluation pipelines that measure behavior across a distribution of inputs.

promptfoo

promptfoo is an open-source CLI tool for running LLM evaluations. You define test cases with inputs and assertions, and it runs them across models or prompt variants, reporting pass/fail and diffs. It integrates well into CI pipelines and is a natural fit for catching prompt regressions before deployment.

Braintrust

Braintrust is a managed evaluation platform with experiment tracking, dataset management, and LLM-as-judge scoring. It is more opinionated than promptfoo but provides a much richer UI for exploring evaluation results over time. Good choice for teams running continuous evals with human review workflows.

What to evaluate

Most teams start with output quality and add safety evaluations later. In practice the ordering should be reversed: before measuring how good your agent is, measure what it will do when given adversarial inputs. For agents that take real actions, the cost of a safety failure in production far exceeds the cost of a quality failure.

·Tool call accuracy. Does the agent call the right tool with the right arguments given a known input? This is the most direct measure of agent reliability.
·Adversarial inputs. Does the agent behave correctly given inputs designed to elicit unsafe behavior, off-topic responses, or policy violations?
·Refusal rate. Does the agent correctly decline requests it should not fulfill? Under-refusal and over-refusal are both problems.
·Latency per step. Multi-step agents compound LLM latency. Benchmark end-to-end time across representative workflows, not just single-turn latency.

Infrastructure and async execution

Agents that involve human-in-the-loop approval or long-running tasks need asynchronous execution infrastructure. An agent that blocks a request thread for 10 minutes waiting for a human to click Approve is not a viable architecture.

The standard approach is to push the agent execution off the request thread into a background job queue. Celery with Redis is the most common Python stack for this. The agent task runs in a worker, reaches a point requiring approval, and suspends — persisting its state to Redis or a database. When the approval arrives (via webhook, polling, or the Slack integration), the worker resumes.

async agent execution

# Celery task that runs the agent
@celery.task
def run_agent_task(session_id: str, input: str):
    arden.set_session(session_id)

    # approval_mode="webhook" returns immediately when approval is needed
    # Arden POSTs to your webhook URL when a human decides
    safe_refund = arden.guard_tool(
        "issue_refund", issue_refund,
        approval_mode="webhook",
        on_approval=handle_approved,
        on_denial=handle_denied,
    )
    return agent.run(input)

Rate limiting is also a production concern that most demos skip. An agent loop that calls an LLM 20 times per user session can quickly hit provider rate limits at moderate traffic. Build rate limiting into the orchestration layer, not just the HTTP layer, and set per-agent token budgets before you need them.

Putting it together

A minimal production-grade agent stack looks something like this:

·Model: GPT-4o or Claude Sonnet depending on the reasoning requirements of your task
·Framework: LangGraph for complex stateful flows, OpenAI Agents SDK for simpler tool-calling agents
·Memory: pgvector if you are already on Postgres, Pinecone if you need to scale retrieval independently
·Observability: Langfuse (self-hosted) or LangSmith (managed) for execution traces; Arden for tool call audit logs, token cost tracking, and session replay
·Runtime safety: Arden for policy enforcement and human-in-the-loop approval on sensitive tool calls
·Evaluation: promptfoo in CI for regression testing, Braintrust for continuous production evals
·Infrastructure: Celery + Redis for async agent execution and approval flows

None of these choices are irreversible. The model, framework, and memory layers are all swappable with effort. The layers that are hardest to retrofit later are observability and safety: if you have not instrumented your agents from the start, reconstructing what happened in a production incident is extremely difficult. Build those in from day one, even if everything else changes.

If you are adding runtime enforcement to an existing agent and want to start in under 10 minutes, the ardenpy docs cover LangChain, CrewAI, and OpenAI Agents SDK integration with a single configure() call. Questions or corrections — team@arden.sh.