Core Bottleneck In AI Engineering Isn't Writing Code. It's Trusting What Code Produces
Contents
How Netguru's evaluation and observability stack turned a high-stakes medical chatbot into a system its users can rely on — and why the same approach is becoming table stakes for every AI product team.
The shift that quietly changed AI delivery
For the last eighteen months, a particular sentence has been showing up in almost every conversation we have with engineering leaders. The exact wording varies, but the message is the same: their teams are leaner than they've ever been, agentic coding has absorbed a meaningful share of the work that used to require human engineers, and the bottleneck has migrated.
One CTO at a US tech company put it to us recently in a single line — "QA/UAT automation and production monitoring and observability are the unsolved bottlenecks for us currently." In the same email he described his teams operating with only one to three engineers, his managers and technical PMs running multiple background agents during meetings and overnight, and his offshore-versus-onshore calculus shifting because "onshore plus agents" had become the more competitive option.
This isn't a one-off. Across our client base, the pattern is consistent. Writing code is no longer the rate-limiting step in shipping AI features. Trusting that code — and the LLM behind it — is.
That trust gap is where most AI projects either earn their place in production or quietly stall out. And it's the part of the stack that, in our experience, most teams underestimate when they kick off.
This post is about how we close that gap. It walks through a real project — a domain-specialist chatbot we've been building in dentistry — and the evaluation, testing, and observability framework we put around it. The same framework now underpins most of our AI engagements, and we think it's a fair preview of what production-grade AI delivery is going to look like for everyone in the next two years.
The project: an educational chatbot for medical domain
The client from a medical domain came to us with a clear goal. They wanted a conversational interface that lets practitioners ask questions and get answers grounded entirely in the client's own clinical documentation. Standard retrieval-augmented generation (RAG) on the surface. Underneath, anything but.
The first problem hit us in the Discovery phase, before a single line of production code had been written: how do you test the correctness of a system whose answers require domain expertise neither you nor any reasonable QA hire is going to have? We were not going to learn implantology to a clinical standard. The client's experts were not going to manually review thousands of generations every release cycle. So we couldn't rely on human judgment as the primary quality signal, and we couldn't rely on our own intuition either.
Everything that follows came out of solving that single problem.
Step one: define what "good" means before you build
Before we wrote production code, we wrote an Evaluation Framework. It's a document — living, versioned, signed off by the client — that answers two questions in plain language. What are we testing? And how, with which metrics?
The framework forced us to be specific about requirements that would otherwise have stayed implicit. Three of the client's constraints turned out to matter enormously:
The chatbot must answer only from the supplied documents. No outside knowledge, no plausible-sounding extrapolation. If the answer isn't in the corpus, the model has to say so.
The chatbot must never recommend specific manufacturers when discussing products. The client operates in a regulated commercial environment where impartiality is non-negotiable.
The chatbot must not give direct medical instructions. Because the domain brushes up against clinical practice, the model is allowed to surface what the documents say, but it cannot tell a practitioner what to do with a patient.
These three constraints aren't testable as side-effects of "is the answer correct." They're behaviours, and behaviours need their own tests.
Step two: build the datasets
A model is only as well-tested as the questions you've thought to ask it. So we built, in collaboration with the client, a series of datasets — each one targeting a specific behaviour we needed to verify.
The cornerstone is the Golden Dataset. The client's experts contributed most of it. Each entry contains a question, an expected answer, and — crucially — the expected source document the system should retrieve. That last field is what makes the dataset useful for testing the retrieval layer of the RAG pipeline, not just the generation. Without it, we'd only know that the model produced a reasonable-sounding answer, not whether it produced that answer for the right reasons.
The Out-of-Scope Dataset holds questions outside the implantology domain. The expected behaviour here is refusal. If a user asks about car insurance or French cooking, the model should decline cleanly rather than improvise.
The No-Manufacturer Dataset is more adversarial. Every prompt is engineered to push the model toward naming a specific brand or producer, sometimes subtly. Some entries phrase the prompt as an explicit request; others bury it in a longer scenario. The expected behaviour is consistent neutrality.
The No-Direct-Instructions Dataset does the same job for clinical instructions. Prompts try to coax the model into telling a practitioner what to do. The expected behaviour is to surface what the documents say without converting that information into prescriptive guidance.
Then there's the Hallucinations Dataset, which has its own origin story we'll come back to in a moment. It contains questions about documents that don't exist and information that isn't in the corpus. The expected behaviour is honesty — I don't have that information — not fabrication.
These datasets aren't static. They grow whenever a real edge case surfaces, whether in testing or in client review. Every interesting failure becomes a permanent regression test.
Step three: choose metrics a machine can evaluate
With domain expertise out of reach, we needed an evaluator that scaled. We chose Promptfoo as our testing framework and built our metric set around what an LLM-as-judge can reliably assess.
The most important metric is Context Faithfulness. Given the retrieved context, the generated answer, and the original question, an LLM judge evaluates whether the answer is fully supported by the context — or whether the model has invented something. This single metric is the closest automated proxy we have to "the model isn't hallucinating," and it has caught problems no human reviewer could realistically catch at scale.
Answer Relevance asks a complementary question: did the model actually answer what was asked, or did it drift? It's possible to be context-faithful and irrelevant at the same time, and we needed to track both.
For the retrieval layer, we use hit-based metrics — checking whether the document the Golden Dataset designates as authoritative for a given question is actually retrieved by the system. A perfect generator on top of a broken retriever is still a broken product.
Together, these metrics let us run thousands of test cases automatically and turn correctness into a measurable, regressable property of the system.
Why we built the Hallucinations Dataset (a cautionary tale)
A few months in, we ran an optimization pass to bring response latency down. As part of that, we experimented with disabling the model's reasoning step on some queries — reasoning is expensive, and most queries seemed not to need it.
The metrics caught what the human eye would have missed. Context Faithfulness scores dropped on a specific subset of queries: ones where the model, deprived of its reasoning step, started over-interpreting the retrieved context and inferring facts that weren't actually there. The answers still sounded correct. They just weren't faithful to the documents.
We rolled back the change, and we built the Hallucinations Dataset specifically so that any future optimization that re-introduced this failure mode would trip a wire immediately. The whole experience was a useful reminder that an AI product's quality bar is not "does it sound right" but "is it doing what we promised the client it would do." Those two things diverge more often than people expect.
Performance testing: the part everyone forgets
The chatbot is being launched at an industry conference where roughly three hundred practitioners will use it concurrently. That number is small by the standards of consumer software and large by the standards of a niche RAG system, and we needed to know whether the architecture would hold.
Performance testing surfaced two issues we wouldn't have caught otherwise. The first was that our OpenAI rate limits, sized for steady-state traffic, were going to be insufficient under conference-day load. We requested and received a tier increase. The second was that our vector database was the secondary bottleneck — query patterns under concurrent load looked nothing like the patterns we'd tuned for, and we ended up doing a round of retrieval optimization to shave latency.
Our performance test script measures time-to-first-token, full streaming time, and a categorized error count split into rate-limit errors, database errors, and timeouts. It's checked into the repository and runs as a regression test of its own, which means we'll catch regressions in this layer the same way we catch them in answer quality.
Observability: turning every interaction into evidence
Promptfoo gives us evaluation. Langfuse gives us memory.
Every test run from Promptfoo gets pushed into Langfuse via a script we wrote against its API. The result is that each test execution becomes a comparable artefact — we can look at trends across runs, see exactly which questions regressed between two versions of a prompt, and drill from an aggregate metric down to a single problematic generation. When something looks worse than last week, we can find out which prompt change or retrieval tweak is responsible.
We run the full test suite whenever we make a non-trivial change — a prompt edit, a change to the RAG configuration, a new retriever version. The results land in Langfuse automatically. Reviewing them has become as natural as reviewing CI output for a backend service.
What's next: closing the loop in production
Everything described above runs pre-production. The next phase, which we're scoping with the client, is to take the same evaluation harness and run it against live production traffic.
The mechanism is straightforward. Every real conversation that flows through Langfuse can have evaluation metrics — Context Faithfulness in particular — applied to it in near-real-time. An administrator dashboard would surface drift in those metrics as it happens, not weeks later when someone notices that something feels off. A spike in hallucinations on a specific topic becomes a signal we can act on, not a complaint we receive.
This is, in our view, what mature AI observability looks like: the same metrics you trust in evaluation, running continuously on your real users, feeding the same dashboards your team already lives in.
What this case study reflects about our broader approach
We work this way on every AI engagement now, scaled up or down to fit the project. The core stack — Promptfoo for evaluation, Langfuse for trace observability, custom datasets per client domain, metric-driven regression on every change — is the operating system underneath our anti-hallucination framework. It's how we're able to ship LLM products into regulated, high-stakes contexts (insurance claims agents, HVAC support chatbots, multi-agent creative workflows on Slack) and stand behind the output.
It also includes things outside the happy path. Exploratory testing surfaces issues automated metrics won't, particularly around unsafe content. When OpenAI's safety classifiers flag a user prompt — say, for self-harm or prompt injection — our system has to handle that response gracefully. That's a behavior, not a metric, and it's caught by humans deliberately probing the system.
The takeaway for engineering leaders
The CTO we quoted at the start of this piece wasn't wrong about agentic coding. It really has compressed how much engineering capacity a team needs. But that compression has to be matched by an equal investment in the things that make AI output trustworthy — evaluation frameworks, golden datasets, automated metrics, observability stacks, and the discipline to run them on every change.
Teams that under-invest here ship AI products that work in demos and break in production. Teams that get it right ship AI products their users actually rely on.
If your team is somewhere in the middle of that — agents writing more code than ever, but quality assurance and production observability still feeling like an unsolved problem — that's the exact gap we're built to close. We'd be happy to walk you through what the equivalent of this framework would look like for your domain.
Want to talk about how this would apply to your AI product? Get in touch with our team.
