Testing AI Agents: Why Unit Tests Aren’t Enough

Photo of Patryk Szczygło

Patryk Szczygło

Updated Jul 17, 2025 • 26 min read
testing AI agents

Testing AI agents is where traditional QA breaks down and a new playbook begins.

When we built Omega—Netguru’s internal AI agent for the sales team—we weren’t just chasing automation. We wanted a tool that could work within our team’s natural flow, help reps focus on the right things, and turn scattered information into real, actionable support. And to a large extent, Omega delivered. It now assists with everything from summarizing expert calls to generating proposals—all from within Slack.

But as the system matured, a new challenge emerged: how do you test an AI agent that thinks, adapts, and behaves differently each time?

Traditional unit tests told us when a function broke or an endpoint failed. But they couldn’t tell us when Omega misunderstood a prompt, hallucinated an answer, or failed to provide relevant context. These weren’t bugs in the usual sense—they were failures in reasoning, communication, or integration. And they often didn’t show up until the agent was live and in use.

Testing AI agents is fundamentally different from testing conventional software. You’re no longer verifying deterministic code—you’re evaluating probabilistic systems that interact with users, tools, and unstructured data. Behavior isn’t just a product of code; it’s shaped by models, prompts, and dynamic context.

In this article, we’ll share what we’ve learned from taking Omega from prototype to production, and why unit tests alone aren’t enough. We’ll explore:

  • What makes AI agents hard to test (think tool use, hallucinations, and multi-agent coordination)
  • Why prompt-level evaluations and scalable test frameworks like Promptfoo matter
  • How observability tools like Langfuse help us debug reasoning failures and monitor performance
  • And why prompt versioning, model comparisons, and system-level A/B testing are now essential to shipping reliable AI features

If you're building or scaling AI agents, this is the testing playbook we wish we had from the start.

1. What Makes AI Agents Hard to Test

Testing AI agents brings a completely different set of challenges. Unlike traditional software, which relies on deterministic logic and isolated functions, AI agents operate in complex, dynamic environments. They juggle tool usage, memory, multi-step reasoning, and long workflows. The challenge isn’t just what the agent outputs—but how it gets there.

Because of this inherent non-determinism, standard testing approaches fall short. Here’s why:

The Role of Non-Determinism in Testing AI

AI responses are influenced by various factors: temperature settings, phrasing, prior context, and even recent interactions. Minor tweaks—like rewording a prompt—can lead to drastically different results. This variability is especially challenging in open-ended tasks, where there’s no single “correct” output to assert against.

Understanding Emergent Behavior

As LLMs and agents evolve, they can develop unexpected behavior—even when no code has changed. Shifts in model weights, system updates, or prompt changes can trigger new, hard-to-predict patterns. This kind of emergent behavior is rarely seen in traditional software and is difficult to anticipate through static test cases.

Tool, API, and Memory Dependencies

AI agents don’t operate in isolation. They rely on external APIs, search tools, and databases—any of which can introduce variability. A slow or inconsistent API response might derail an agent’s flow entirely. Similarly, agents often use memory (short- or long-term) to reference past interactions. But tracking how that memory influences behavior—or when it misfires—is notoriously difficult.

Complex Multi-Step Reasoning

Unlike simple functions, agents often execute multi-step reasoning chains. Testing them means validating not just the output, but the entire process: how the agent reasons, what tools it chooses, what data it retrieves, and how it decides what to do next. This transforms testing AI agents into a holistic evaluation of reasoning pathways rather than isolated logic checks.

Multi-Agent Orchestration

In more advanced systems, agents collaborate or delegate tasks among themselves. This introduces asynchronous workflows and dependencies. If one agent misinterprets context, hallucinates, or misuses a tool, it can trigger a cascade of errors across the system—complicating root cause analysis.

Handling Hallucinations and Model Instability

LLMs are known to hallucinate—generating confident, yet false, outputs. These can range from minor inaccuracies to critical errors, especially in domains like customer service or healthcare. Worse, hallucinations are inconsistent. What passes a test today might fail tomorrow, even under the same conditions.

Why Traditional AI Tests Fall Short

Traditional unit tests focus on fixed expectations and fail/pass assertions. But testing AI agents requires a broader, more flexible approach—incorporating prompt-level evaluation, observability tools, human-in-the-loop review, and continuous feedback loops. Instead of just checking if a function works, we need to monitor if the agent behaves reasonably and reliably in real-world scenarios.

What Makes AI Agents Hard to Test

2. Testing AI Agents at the Prompt Level: Why It Matters

Once you move beyond deterministic code and start working with prompts and reasoning paths, unit tests lose much of their value. In AI agent development, the “unit” is no longer just a function—it’s often a prompt, a reasoning chain, or a sequence of tool calls. What’s needed is a scalable, repeatable way to evaluate how these prompts perform across real-world inputs, model updates, and changing contexts.

Why Prompt-Level Evaluation Matters

Prompt failures don’t raise exceptions—they quietly return misleading, incomplete, or fabricated results. Without structured evaluation tools, these failures can go unnoticed until users start losing trust. Worse, a prompt that works today may fail after a model update or even a minor API change.

That unpredictability makes structured evaluation essential. Think of it as QA for reasoning: ensuring prompts consistently deliver accurate, helpful, and safe responses—under varied conditions and at scale.

Structured Testing with Promptfoo

Promptfoo is a powerful tool built specifically for this type of evaluation. It treats prompt testing like software testing, letting teams define test cases in config files with input variables, expected outcomes, and evaluation criteria.

With Promptfoo, you can:

  • Compare prompt behavior across models (e.g., GPT-4o, Claude 3.5, o1)
  • Track performance trends over time
  • Automate red-teaming and edge-case discovery
  • Score outputs using custom metrics like helpfulness, factuality, and safety

This shifts evaluation from informal spot checks to a structured, repeatable QA process.

Factuality Testing and Red Teaming in AI QA

In high-stakes areas like healthcare, legal, or finance, factual accuracy is non-negotiable. Promptfoo and similar tools support factuality scoring, comparison against ground truth, and detection of hallucinated outputs. You can also integrate human reviewers to assess edge cases and fine-tune prompts with more confidence.

Red teaming—intentionally pushing prompts to fail—is another layer of defense. By stress-testing your agent, you can surface vulnerabilities early, before they reach users.

Example: How to Test RAG-Based AI Agents

Retrieval-augmented generation (RAG) systems bring added complexity. It’s not just about the quality of the final response—you also need to verify that the right documents were retrieved.

Prompt-level evaluation for RAG includes:

  • Checking if retrieved documents are relevant to the input
  • Identifying hallucinations caused by poor retrieval
  • Comparing retrieval strategies (e.g., vector vs. hybrid search)
  • Scoring how well responses are grounded in the retrieved content

Tools like Langfuse and Promptfoo support this depth of evaluation, making it easier to trace issues to the right step in the pipeline.

Healthcare Use Case: Testing AI for Medical Transcriptions

Imagine a doctor using an AI assistant during a patient consultation. The conversation is transcribed in real time, and the AI is responsible for taking notes—capturing symptoms, medications, and relevant history.

But how can you ensure that nothing important is missed or incorrectly captured?

This is where prompt-level evaluation tools like Promptfoo become critical. You can:

  • Feed in hundreds of example consultations with ground truth annotations.
  • Define what a high-quality note looks like based on symptoms, dosage, duration, and context.
  • Run regression tests to spot failures whenever prompts are updated or when switching to a new model.
  • Validate prompt performance on real-world data to find the most reliable configuration.

As the system scales, you keep adding test cases—ensuring that updates or new APIs don’t silently break essential functionality. It’s no longer about testing a single outcome, but building trust in the overall behavior of the AI.

3. Observability in Practice: Tracing Complex LLM Systems

As LLM systems evolve from simple prompts into multi-step agents with memory, tool usage, and external API calls, observability becomes non-negotiable. Traditional logs or unit test outputs no longer give enough context to understand failures—or why results may vary between identical runs.

Why Tracing Matters for Testing AI Agents

When an agent chains reasoning steps, invokes external tools, and generates outputs from retrieved documents, every step matters. Without visibility into those steps, debugging turns into guesswork. With proper tracing, debugging becomes data-driven and actionable.

Langfuse is one of the most mature tools for this purpose. It’s built for tracing complex LLM workflows and provides visibility into:

  • Traces: The full sequence of model calls, tool executions, and nested steps within the agent’s reasoning chain
  • Metadata: Contextual info like user IDs, session types, tags, and environment variables that enrich every trace
  • Latency and cost: Token usage and response times across calls, helping teams find performance bottlenecks
  • Inputs and outputs: Full visibility into prompts, retrieved content, and generated responses at each stage

This level of trace data enables teams to reproduce, diagnose, and improve agent behavior across development and live environments.

Testing Multi-Step AI Workflows: A RAG Example

Take Retrieval-Augmented Generation (RAG) as an example. A user submits a question, documents are fetched, and an answer is generated. But when something goes wrong—maybe the answer is inaccurate or the sources aren’t relevant—pinpointing the issue isn’t straightforward.

Langfuse can help identify whether:

  • The retriever surfaced the right documents
  • The prompt used the retrieved content effectively
  • The model introduced errors despite receiving accurate inputs

For production systems, especially those vulnerable to hallucinations, API instability, or content drift, tracing is crucial. Without it, diagnosing issues is nearly impossible.

Debugging AI Agent Failures With Langfuse

Imagine an AI assistant that fails to return correct metadata for an image uploaded via a Slack command. You suspect a tool integration issue—but it could just as easily be a broken prompt or outdated memory.

With Langfuse tracing enabled:

  • You can inspect the entire interaction—from user input to final model response
  • You discover the tool executed correctly, but the prompt didn’t format its output properly
  • You fix the prompt, and the issue is resolved—without hours of guesswork or digging through logs

For AI agents, this kind of observability is the foundation of production-grade reliability.

  1. Prompt Management and Versioning at Scale

As teams explore how to create an AI agent that scales beyond prototypes, they quickly discover that managing prompts manually—or directly in code—doesn’t work. Hardcoded prompts buried inside application logic are difficult to track, test, or update. Once you deploy agents in production and serve multiple user types or model backends, you need a better system.

Why “Prompt-as-Code” Doesn’t Scale

Hardcoding prompts as constants may feel simple early on. But what happens when:

  • You need to A/B test different prompt versions?
  • You spot a typo post-release or need to tweak instructions on the fly?
  • You’re supporting multiple models—like GPT-4o, Claude 3.5, or o1—each with unique behavior?

Shipping new releases for every minor update slows teams down and increases the risk of silent failures. Prompt management needs to become a core part of your infrastructure—on par with code versioning or API observability.

Langfuse for Versioning and Testing AI Prompts

Langfuse offers a built-in CMS-like system for managing prompts. Instead of scattering them across files, notebooks, or buried code, you can:

  • Track versions with timestamps and authorship
  • Link prompts to specific use cases, users, or environments
  • Roll back or fork previous versions instantly
  • Tag prompts for experiments, staging, or production deployment

This separation of concerns lets developers focus on logic while prompt engineers iterate independently—without blocking releases or risking regressions.

Adapting to Model Differences: GPT-4o, Claude, o1

Each model comes with its own quirks. A prompt that works well in GPT-4o might fall flat in Claude or return erratic outputs in o1. Centralized prompt management helps you:

  • Benchmark the same prompt across different models
  • Detect model-specific issues, hallucinations, or edge cases
  • Seamlessly test and adopt new vendors or lower-cost alternatives

A/B Testing With Traces

By linking prompts with trace data in Langfuse, you can run real-time A/B tests and analyze how prompt changes affect users:

  • Does version A reduce user rewrites compared to B?
  • Which prompt generates more accurate RAG outputs?
  • How does tone or phrasing influence user engagement?

These insights go far beyond what unit tests or offline evaluations can tell you—they reflect real-world performance, under real conditions, with real users.

5. How to Test and Create AI Agents Using OpenAI o1

OpenAI’s o1 model marks a shift in how we think about prompt engineering. Unlike traditional chat-based LLMs like GPT-4o or Claude 3.5, o1 introduces a new paradigm—focused on structured reasoning, deep context, and system-driven planning. Teams used to casual prompting quickly learn that o1 isn’t a chatbot—and that’s exactly the point.

OpenAI’s o1 Is Not a Chatbot

Treating o1 like a conversational assistant often leads to poor results. It doesn’t ask clarifying questions or request missing details. Instead, it takes your input at face value—sometimes producing overengineered, inconsistent, or verbose outputs unless clearly directed.

Think of o1 less like a chat partner and more like a report generator. It’s designed to take comprehensive instructions and return structured, multi-step reasoning in a single pass.

Briefs, Not Prompts

Where chat models rely on iterative back-and-forth, o1 works best when given everything up front: project context, previous attempts, constraints, and the desired format. As one early user put it: “Whatever you think is enough context—10x that.”

This shift means teams need to rethink how they prompt:

  • Specify known failure modes and attempted fixes
  • Include relevant system architecture or dataset schemas
  • Clearly define what success looks like—not just what to do

In short, treat o1 like a new teammate who needs a thorough onboarding to deliver real value.

o1’s One-Shot Reasoning and Structured Testing

Unlike models that rely on conversational correction, o1 excels at one-shot reasoning. Given a clear, detailed brief, it often produces multi-part responses—with structured sections, deep analysis, or tradeoff considerations—right out of the gate.

This opens up new design possibilities:

  • Analyzing, organizing, and refining complex texts
  • Handling intricate relationships between concepts or systems
  • Breaking down difficult ideas into clear, digestible explanations

But it also demands more clarity in your prompt. Be explicit about the output type—whether you need a strategic summary, a structured plan, or a set of insights. Without that, o1 may default to lengthy, academic-style outputs.

Using LLM-as-Judge for Automated Prompt Evaluation

One of o1’s most promising capabilities is self-evaluation. It can apply judgment criteria to its own output—within the same prompt.

To enable this, you can:

  • Define what good vs. bad output looks like
  • Add scoring rubrics or QA checklists directly into the prompt
  • Ask o1 to review its response against those criteria

This sets the stage for LLM-as-Judge workflows, where the model not only generates content but also validates it. In the long term, this approach could support:

How to Test and Create AI Agents Using OpenAI o1

6. Best Practices for Testing AI Agents and Ensuring Reliability

As LLM agents transition from experimental tools to production systems, ensuring reliability is no longer optional—it’s mission-critical. Building trust with users, and preventing costly or harmful failures, requires more than solid engineering. It calls for a layered testing and monitoring strategy that integrates observability, structured evaluations, and real human feedback throughout the development lifecycle.

Move Beyond Unit Tests: Use Multiple QA Layers

While unit tests are still valuable for catching basic failures, they’re not enough for AI agents. These systems involve complex behaviors—like reasoning paths, tool interactions, and memory updates—that go beyond static outputs. High-performing teams use a layered QA stack:

  • Tracing: Tools like Langfuse help capture end-to-end workflows—prompts, outputs, latency, tool calls, and more
  • Prompt evaluations: Tools like Promptfoo and OpenAI Evals automate checks for factuality, consistency, and regressions
  • Human-in-the-loop feedback: Qualitative review from users and internal teams adds depth to automated testing

This integrated approach helps identify issues in logic, tone, user experience, and alignment with intent.

Monitor for Regression With Every Prompt or Model Change

Small prompt changes—or switching models—can break things in subtle ways. A single sentence tweak, or moving from GPT-4 to Claude, can throw off workflows that were working fine.

To avoid this:

  • Track prompt versions and associate them with releases using management tools like Langfuse
  • Run regression tests on key flows across model and prompt variations
  • Use side-by-side comparisons to evaluate output quality over time with tools like Promptfoo

Even your best prompts can degrade after a model update. Regression monitoring keeps your systems stable and predictable.

Handle Drift, Failures, and Unexpected Outputs

LLMs are dynamic. Their outputs can drift based on model updates, phrasing shifts, or changes in third-party tools and APIs. You need strategies to detect and respond to this unpredictability:

  • Use traces and logs to spot and analyze failures
  • Build custom evals to uncover logic errors (e.g., "Did the correct data get retrieved?")
  • Add fallback logic to protect users when the model misfires—like returning a safe default or escalating to human review

Designing for drift means accepting uncertainty and embedding systems that adapt.

Aligning AI Testing With Business and UX Outcomes

Accuracy is important, but it’s not the only metric that counts. The most successful agents are measured by their real-world impact:

  • Do they save users time?
  • Do they align with your brand’s voice and UX expectations?
  • Do users feel more confident using them over time?

Sometimes, a quick and clear answer that’s slightly imperfect beats a verbose “perfect” one. Test with your business goals in mind—not just technical precision.

7. Conclusion: Building a Future-Ready Stack for Testing AI Agents

Testing AI agents isn’t just a developer concern anymore—it’s a product risk, a UX challenge, and a strategic business priority. While unit tests still have value for catching low-level issues, they fall short when it comes to probabilistic outputs, emergent behavior, and complex, multi-step workflows.

To build reliable, production-grade AI agents, we need a new testing stack—one built around how these systems actually operate.

AI agents are software—but they’re also more than that. They’re reasoning systems that generate behavior dynamically, based on memory, context, and ever-changing models. That means we need to apply the same engineering rigor as traditional software—plus new layers of evaluation and observability:

  • Use tools like Promptfoo to evaluate prompts at scale and catch regressions
  • Implement trace-based observability with Langfuse to analyze agent behavior over time
  • Manage prompt versions, run A/B tests, and compare model performance across deployments
  • Combine automated metrics with human feedback to catch subtle failures that dashboards miss

Whether you're building internal copilots or customer-facing agents, long-term success depends on your ability to validate, monitor, and adapt these systems in production.

Bonus FAQ: Testing AI Agents and Systems

What makes testing AI agents different from traditional software testing?

Unlike traditional software testing, evaluating AI agents involves dynamic, non-deterministic behavior. While conventional apps follow fixed logic, AI agents generate responses based on input data, training data, and evolving context.

That’s why teams must go beyond traditional testing methods and embrace techniques like exploratory testing, prompt-level testing, and observability to monitor model behavior in real-world usage.

In traditional unit tests, we rely on assertions—expecting one exact value to match another. In AI testing, however, we often compare the similarity score between the generated response and an ideal output, or check for the presence of specific keywords. A more modern approach involves using another LLM as a judge, allowing multiple models to evaluate and validate each other’s outputs.

How do you perform AI model testing in agent-based systems?

AI model testing requires validating outputs under varied test scenarios, using both automated and human-in-the-loop techniques. Since machine learning models adapt as new data flows in, you need structured test data, robust model validation, and tools for regression testing to catch unexpected changes.

In agent workflows, this goes beyond unit tests to include performance testing, integration testing, and even fairness testing—especially when handling labeled data across diverse use cases.

How does continuous testing support reliable AI applications?

Continuous testing enables teams to track and maintain AI system testing over time, ensuring consistency as your AI model evolves or new APIs are introduced. This is especially important in AI and ML systems retraining on historical data.

This approach covers test case development, test execution, and test automation—helping teams catch regressions quickly and maintain high data quality.

It also supports production-grade AI applications running across platforms, including mobile devices or cloud environments.

Why are tracing and observability essential for testing AI systems?

Tracing and observability provide deep visibility into how AI systems operate, making them essential for debugging and continuous improvement. They help teams:

  • Detect failures early, even those that don’t trigger traditional errors

  • Analyze model behavior over time to improve accuracy and reliability

  • Identify and reduce hallucinations, with early warnings when reasoning goes off track

  • Ensure cost efficiency by tracking token usage and system performance

  • Understand AI system usage through analytics, enabling better optimization decisions

By implementing observability tools like Langfuse, teams can monitor agent workflows end-to-end—ensuring not just correctness, but also stability, usability, and business value.

Why is prompt management a key aspect of testing AI systems?

In modern AI workflows, prompts act like dynamic code. Managing them at scale is a key aspect of testing robust machine learning systems.

With every prompt update, teams need systems for test creation, version tracking, and automation to validate outputs across multiple ML models. By collecting logs and analyzing data using advanced testing tools, teams can diagnose issues early and continuously improve performance.

Effective data collection ensures your testing datasets reflect real-world scenarios—essential for prompt stability, behavior tracking, and adaptation.

Photo of Patryk Szczygło

More posts by this author

Patryk Szczygło

Patryk is an engineer leading R&D department to develop more knowledge in cutting edge...

Get actionable insights for integrating AI

Join AI Primer Workshop to adopt AI with success

Learn more!

We're Netguru

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency.

Let's talk business