Agentic Development, Production-Grade: How We Approach Quality, Architecture, and Testing
Contents
There is a version of "building software with AI" that lives on LinkedIn: somebody types a prompt, a feature appears, the screenshot goes viral. There is another version that happens inside real teams shipping real software to real users - where the question is not whether an agent can produce code, but whether what it produces will still be standing in production a year from now.
We've spent the last several months working in that second version. Our internal back-office product is built through full agentic development: no production application code is written by hand. Every commit, every migration, every test ship from an agent — primarily Claude Code, with Cursor as a secondary environment for engineers who prefer it, working against a structured workflow we've evolved week by week. The things we still author directly — the rules, the workflow definitions, the CI scripts — we treat as a separate codebase with its own review discipline, because that's where the methodology actually lives.
What follows is the methodology we've landed on so far - the conventions, the gates, the test strategy, and the mindset shift required to make full agentic development feel less like driving an autonomous car with your eyes closed and more like running a high-performing cross-functional team. It is, candidly, still evolving. But we've made enough mistakes to know which ones cost the most.
Why convention is not a style preference anymore
The first thing you learn in agentic development is that "every developer writes differently" is a luxury you can no longer afford. When humans wrote code, an inconsistent codebase was a maintenance tax. When an agent writes the code, an inconsistent codebase is a guarantee that the next feature will drift further from the last, because the agent will mimic whatever it finds. Convention stops being a style preference and becomes a structural input to quality.
The output is only ever as good as the rules you give the model. So we treat the rules themselves as a product - versioned, layered, and updated whenever the agent gets something wrong in a way we don't want repeated.
There was a moment early on that taught us this in the most expensive way possible. A deployment to production broke the database. The migration script wiped state. The root cause was straightforward in retrospect: nobody had told the agent that migrations must be additive and forward-only, so each new feature was generating a fresh migration that overwrote the previous one. We had to nuke production and rebuild. We were lucky - it was early enough that there wasn't much to lose. But the lesson was permanent: the rules file is not a one-time setup, it is a living artifact that absorbs every incident.
The three layers that keep increments coherent
Consistency across an agent-driven codebase comes from three coordinated layers. Together they answer the question, "How does the model know what 'good' looks like in this project?"
Layer one: non-negotiable root rules. At the root of the repository we keep AGENTS.md and CLAUDE.md. These are short, declarative, and absolute. No new dependencies without approval. No external API calls without approval. Atomic commits. Migrations must be forward-diff only. Plan before you implement. After every code change, run the full check suite - typecheck, lint, tests, CI checks - before considering the work done. These are the rules that don't bend for any feature, no matter how urgent.
Layer two: area-scoped rules. Inside .claude/rules/ we keep one file per area of the codebase - estimations, AI, frontend, end-to-end testing, and so on. Each file uses a paths: frontmatter declaration so the rules load automatically only when the agent is working inside that part of the system. The rules for writing a React component sit next to the rules for writing a domain service, which sit next to the rules for writing a Playwright test. The agent doesn't have to hold the whole project in its head; the right rules surface at the right moment.
Layer three: a workflow skill chain. On top of the rules we layer a chain of specialized agent capabilities that we step through for any non-trivial change. Each step produces a named artifact that the next step is required to read before it begins — which is what stops the chain from quietly drifting.
Brainstorm produces a feature-brief.md containing the problem statement, the user, acceptance criteria, and explicit out-of-scope items. The agent asks clarifying questions here, and the human signs the brief off before planning starts.
Plan produces an implementation-plan.md keyed to that brief: file-level changes, new modules, schema deltas, test scenarios, and a risk list. No code is written in this phase. The plan is reviewed by a human before any implementation tool is invoked.
Subagent-driven development executes the plan in slices. The parent agent dispatches one subagent per slice with a tightly scoped task and the relevant area rules. Each subagent returns a diff and a short report; the parent reconciles them against the plan and flags any drift in writing rather than silently absorbing it.
Code review runs as a separate skill, not as a free-form prompt — it walks a checklist tied to the area rules and the four mandatory test scenarios, and refuses to clear changes that miss any of them.
Finish handles branch hygiene: rebasing, conventional-commit messages, doc updates, and the final CI run.
Each step has its own rules; each step refuses to run without the previous step's artifact. In Claude Code terms: brainstorm, plan, review, and finish are slash commands under .claude/commands/; subagent-driven development uses the Task tool to dispatch per-slice work; the test-driven-development discipline is a skill (SKILL.md) that the implementation step loads. The whole chain is documented in docs/development/ai-workflow.md, so a new team member can pick up the workflow from the repo rather than from folklore.
The combination is what creates coherence. The root rules set the floor. The area rules localize expertise. The workflow chain enforces discipline at every step. None of the three on its own is enough. We're still tightening the consistency of how the team applies all three - that piece is honest work-in-progress - but the architecture is in place.
Treating the rules like a codebase. Three layers only stay coherent if you manage them as a real artifact, not a folder of READMEs. We learned this the slow way — the rules drifted faster than the code did until we put the same discipline around them.
Conflicts have a precedence order: root rules in AGENTS.md and CLAUDE.md always win, area rules win over workflow rules, and any disagreement that survives that hierarchy is treated as a bug in the rules — the agent is instructed to halt and ask, not pick a side. We've found that most "conflicts" are actually missing rules, and forcing them into the open is how we find the gap.
Versioning sits in git like anything else, but every rule file carries a short header — date added, the incident or decision it came from, and the owner. When we change a rule we update the header; when we retire one we move it to .claude/rules/archive/ with a note explaining why it stopped earning its place. The archive is as useful as the active set, because it's the institutional memory of what we tried and stopped doing.
Deprecation is driven by evidence, not intuition. Once a quarter we sample a batch of recent agent diffs and check which rules actually shaped them; rules that produced no observable behaviour for two consecutive samples are flagged for retirement. The model gets better; some rules age out. The point is to notice rather than accumulate.
Defense in depth: our testing strategy
Test coverage in an agent-driven codebase is a different problem than in a hand-written one. Agents write more code than humans do - substantially more, between defensive checks, generated comments, and split-out helper functions. That means coverage percentages can look healthy while leaving meaningful gaps, and it means the surface area you have to defend grows faster than your test suite naturally would.
We approach this with four mandatory layers, all enforced as a merge gate. We call it defense in depth.
Layer one: static checks. Seven CI scripts run on every change - translations, RBAC consistency, migration validity, hardcoded string detection, accessibility checks, API contract verification, and documentation freshness. These are the fastest, cheapest signals, and they catch the kinds of mistakes an agent is statistically most likely to make: a missing translation key, a stale doc, an accidentally hardcoded string that should have been a constant.
Layer two: unit and integration tests. Vitest plus React Testing Library, roughly a hundred files at the time of writing, colocated next to the source files they cover. The suite is now well past a thousand tests and growing. We don't yet enforce a hard coverage threshold — that's a deliberate, tracked decision — because percentage coverage in an agent-generated codebase rewards padding more than it rewards meaningful tests. Instead, the four-scenario floor enforces minimum depth per change, the code-review skill flags any new source file without a co-located test, and we sample critical paths quarterly to confirm the tests there actually fail when the implementation is mutated. The suite gates merges by passing or failing the entire run.
Layer three: end-to-end API tests. A dedicated Playwright project that runs against a real database, with real RBAC and real Drizzle, no mocks. The contract is tested against the running system. We've moved these into a separate e2e/api/ project with dedicated client wrappers under e2e/api-clients/<area>/, which means a change to a route's shape is a one-file fix instead of a hunt through scattered specs.
Layer four: end-to-end UI smoke tests. Chromium-based, bilingual locators, structured into e2e/ui/smoke/ for the must-pass set and e2e/ui/regression/ for the broader suite. The smoke set is small, fast, and protective - the regression set is where we expand coverage over time.
On top of all four, branch protection enforces one human approval plus all CI jobs green before merge. Every layer has a clear purpose, and the entire strategy now lives in a single source-of-truth document - docs/development/qa-test-strategy.md - with an explicit coverage gaps table tied to a tracked epic. When somebody asks "are we testing this?", there's one place to look.
The post-deploy story matters too. A GitHub Actions workflow fires the API tests automatically after every successful staging deploy and alerts the team on Slack if anything fails. The point is to catch the class of bugs that only appear once the full stack is up - and to catch them inside our team, not in front of users.
TDD with agents: why it matters more, not less
Test-driven development feels almost old-fashioned in 2026, but agentic development has made it more important, not less. We enforce it through a workflow skill called test-driven-development, and the rule is uncompromising: code written before its test is deleted. Red, green, refactor. No exceptions.
The reason is straightforward. Agents are very good at producing plausible-looking code. They are less reliably good at producing code that does exactly what you specified, particularly on the edges. The discipline of writing the test first - with given/when/then assertions for the happy path, the validation failure, the auth-denied case, and the not-found case - forces the specification to exist as executable text before any implementation appears. The test becomes the contract. The implementation either satisfies it or doesn't.
Beyond happy path, the rule is that every non-trivial change must cover at least four scenarios: success, validation failure, auth denied, and not-found. That floor is non-negotiable because those four classes account for the overwhelming majority of bugs we've actually seen in agent-written code.
Mocking strategy follows the layer. Domain and pure logic - zero mocks. Database services - vi.mock("@ngos/db"). End-to-end API tests - no mocks at all, real stack. End-to-end UI - bilingual locators, mocked data only where it makes the test deterministic.
Documentation that stays alive
A common failure mode of fast-moving AI-built projects is that documentation rots within weeks. We treat documentation as a first-class output of every change, enforced two ways. First, there's a rule in AGENTS.md: after any code change, update the corresponding doc if it describes the area being touched. Second, a pnpm ci:docs check verifies that documentation matches the current state of the code.
The structure is one file per module, under docs/modules/. Estimation, staffing, leads, financials, AI infrastructure, UX standards, role and permission definitions, the QA test strategy, routing, UI components - each lives in its own document. When the agent starts a new task and reads the root CLAUDE.md, that file points it at the right module document for whatever area it's about to touch. The agent loads the relevant doc into context before writing a line of code.
This is why module documentation has to stay accurate. A stale doc is worse than no doc in agentic development - it actively misleads the model. The freshness check exists for the same reason that the migration rule exists: an incident taught us we couldn't trust ourselves to remember.
The mindset shift nobody tells you about
The technical methodology is the easier half. The harder half is what happens to the team.
When you watch full agentic development for the first time, the reaction is almost always the same. You see somebody type a sentence into a console and a working feature appears on staging fifteen minutes later, and your first instinct is to step back from the steering wheel. The feeling is exactly that - like letting go of the wheel on a highway. The car keeps going. It seems to be going where you want. But you don't trust it, because you're not the one holding it.
People work through that discomfort at different speeds, but they all have to work through it. Engineers who've spent years writing code line by line have to learn to spend their time on architecture, conversation, and validation instead. QA, in many ways, has the easier transition - the skill of being skeptical, designing scenarios, and looking for what could break translates directly. Business analysts and designers struggle most when they don't have agent access themselves, because they end up several steps behind the team that does. We've seen this dynamic clearly in conversations with peers: an analyst preparing a week's worth of scope while the engineers ship it in half a day.
The role that grows the most is the one that bridges the agent and the work. Somebody has to validate the plan before implementation starts. Somebody has to own the final review. The agent accelerates execution; judgment remains human. In our workflow we call this the human layer, and it sits between every agent handoff. It isn't a dedicated seat on the team and it isn't a rotation — it's a responsibility every team member carries whenever they're driving a piece of work. Whoever started the conversation with the agent owns a brief sign-off, plan review, the final merge, and the decision to halt the chain if something looks wrong. The discipline is collective; the accountability for any given change is individual.
A practical consequence is that the conceptual phase of a project now carries more weight than it used to. Architecture and feature definition determine the quality of everything downstream. Garbage in, garbage out applies with unusual force here - the output is exactly as good as the input. When we write a story for an agent, we write it the way we'd write it for a junior engineer who happens to be brilliant but has never seen the product before. Every implicit assumption a human developer would catch - "obviously the delete button shouldn't show for users without permission" - has to be made explicit.
The teams we've watched succeed with this methodology share one trait: they stopped treating the conversation with the agent as overhead and started treating it as the work.
When the stakes are high
Our methodology was built for internal tooling, where a bad merge costs us a day of cleanup. The interesting question is what changes when failure costs more than that — finance, healthcare, industrial control.
The architecture stays; the calibration shifts. Concretely: the four mandatory test scenarios stop being a floor and become a starting point, expanding to include adversarial inputs, partial failures, and timing-sensitive race conditions for each critical path. The defense-in-depth model gains a fifth layer of mutation testing on those paths, so a passing test that doesn't actually catch the mutation is treated as a missing test. Every agent action that touches a sensitive state writes to an immutable audit log, separate from git history, so post-incident review can reconstruct exactly which prompt produced which diff. Merges to critical paths require two human approvers from different functions — typically engineering plus domain expert — instead of one. And the workflow chain gains a parallel-validation step: a second subagent, running on a separate model where available, reviews the implementation independently against the original brief, and disagreement between the two halts the chain.
None of this is hypothetical for us — it's the shape of the conversation we'd want to have before taking on work in a regulated domain. The point is that agentic development scales up, but the rules, gates, and layers have to be calibrated to the stakes, and that calibration is real engineering work, not a marketing claim.
What we've learned, so far
A few things have become clear enough to say out loud.
Context is everything. Agents work by pattern-matching against what they've seen, and the quality of the pattern they match determines the quality of the output. A mature codebase with strong conventions produces strong agent output. A greenfield project produces whatever you feed it as reference. Either way, the team's job is to curate the context - examples, rules, documentation, prior work - so the agent has something coherent to extend.
Conventions must be one, not many. The temptation in an agent-driven team is to let each contributor configure things their own way. Resist it. The whole point of the rules system is that the agent is consistent. If three engineers run three different rule sets, the codebase will drift in three different directions and no amount of testing will paper over the inconsistency.
The methodology evolves at the speed of the models. The model we're using today writes meaningfully better code than the one we were using six months ago. Practices that were necessary then are sometimes redundant now. We don't optimize prematurely - we add rules in response to real incidents, retire them when they stop earning their place, and stay close to the work so we know which is which.
Quality is now an upstream problem. The traditional QA position at the end of the pipeline doesn't work in this model. By the time something reaches a tester, the agent has already written a thousand lines and merged the PR. QA has to be embedded in the planning, in the rules, in the workflow itself. The best QA work in an agent-driven team happens before any code exists.
And finally: we are early. We are honest about that. We're building the methodology while shipping the product, and we know that some of what we believe today will be obviously wrong six months from now. But the direction is clear, the gains are real, and the teams that work this way have an unusual amount of room to think about the things that used to get squeezed out - architecture, strategy, edge cases, quality. The discussions we wished we could have when delivery pressure dominated every sprint are now happening every week. That alone tells us we're moving in the right direction.
Full agentic development isn't about replacing the team. It's about elevating what the team spends its time on. The methodology is how we make sure what comes out the other end is something we'd put our name on.
