AI is writing more code, deploying more decisions, and taking more autonomous actions than ever before. It reads tickets, drafts outbound messages, calls internal APIs, edits databases, triages incidents, and increasingly sits inside workflows that used to be gated by human review.
At the same time, many teams are doing less QA, not more. They assume the model is usually right, the scaffolding will catch the rest, and speed matters more than rigor because the assistant is only proposing changes. That assumption is the risk.
Traditional software created a clear line between build time and runtime. Agentic software blurs it. The system is still making consequential decisions after deployment, often in ways that are hard to predict and even harder to audit. If quality is treated like cleanup, the failure mode is not just buggy code. It is a system that behaves plausibly right until the moment it causes real damage.
Traditional QA tests what humans built. Agent QA has to test what the model decided to do.
1. The new attack surface nobody is testing
When an AI agent acts autonomously, every action becomes a possible failure point. Reading a file is a risk surface. Calling an API is a risk surface. Writing to a database is a risk surface. Generating a piece of code that looks correct but fails in production is a risk surface.
Traditional QA tests the software humans intentionally built. That means interfaces, workflows, business rules, boundary conditions, and deployment behavior. Testing AI behavior is different. You are not only testing whether the code executes. You are testing whether the system chose the right action, returned the right content, stayed within its allowed output boundary, and behaved consistently across probabilistic runs.
This is where non-determinism changes the economics of coverage. A passing test on one run does not mean much if the same prompt, same code path, and same user goal can produce a meaningfully different answer five minutes later after a model update, context shift, or retrieval change. The old idea of high unit-test coverage becomes insufficient because the thing you are validating is no longer only logic. It is behavior.
And the ugliest failures are silent ones. A silent failure does not crash. It returns a well-formed lie. It cites a clause that does not exist. It invents an eligibility decision. It updates the wrong record. It leaks the right information to the wrong place. Those are the failures that pass through happy-path QA because nothing looks broken from the outside.
A silent failure is worse than a thrown exception. At least an exception announces itself.
2. What's already gone wrong
Anthropic's CMS leak, March 26, 2026
On March 26, 2026, Fortune reported that Anthropic had left nearly 3,000 unpublished assets publicly accessible through a misconfigured content management workflow. The exposed material reportedly included draft content about an unreleased model, internal event information, and other assets that were never meant to be public. Anthropic said the issue came from human error in an external CMS configuration. That is exactly the kind of boundary failure a serious QA and release process is supposed to catch before a draft artifact reaches a public surface.
Claude Code source leak, March 31, 2026
Five days later, on March 31, 2026, Anthropic accidentally shipped a source map in a public Claude Code package that exposed roughly 500,000 lines of internal source code. Axios, The Verge, and AFP all described the event as a release packaging mistake caused by human error rather than an external breach. The operational lesson matters more than the gossip: if a company building a flagship coding agent can publish internal artifacts by mistake, then every team relying on AI-generated software needs release checks that treat build outputs, debug files, and packaging metadata as part of the attack surface.
The broader pattern, March 17, 2026 data
GitGuardian's March 17, 2026 report is the part that should make engineering leaders uncomfortable. It found roughly 29 million newly leaked secrets on public GitHub in 2025, an 81% year-over-year surge in AI-service credential leaks, and a 3.2% secret-leak rate in Claude Code-assisted commits versus a 1.5% baseline. That is the real story. The blunders that become headlines are only the visible edge of a wider systems problem: AI tooling makes it easier to create, ship, and expose sensitive material faster than governance practices are adapting.
Sources: Fortune on Anthropic's March 26, 2026 CMS leak, Axios on Anthropic's March 31, 2026 Claude Code leak, The Verge on the same March 31, 2026 packaging error, GitGuardian's March 17, 2026 secrets report, and public March 31, 2026 reporting that described the Claude Code incident as the second such exposure in just over a year. .
The compounding risk is no longer hypothetical: AI writes the code, AI reviews the code, AI approves the summary, and a human clicks merge because the pipeline stayed green.
3. The agentic development paradox
The faster AI agents build, the more there is to go wrong, and the less time humans spend reviewing the result. That is the paradox. Productivity rises at the exact moment review quality starts to fall.
We are already deep into a culture of ship first, review never. Vibe coding works just well enough to feel magical on a local demo. But most production failures are not demo failures. They live in retries, partial outages, stale schemas, malformed input, concurrency edge cases, and weird permission boundaries. AI-generated code often looks clean because it is optimized for plausibility. That makes it dangerous. Plausible code is easier to trust than obviously broken code.
The junior engineer who used to review a pull request line by line is now encouraged to ask another model whether the first model did a good job. That changes the shape of accountability. Instead of slowing down at the moment risk enters the system, we accelerate through it.
None of this means AI coding tools are bad. It means the old QA assumptions no longer hold. If generation gets cheaper and review gets weaker, quality has to become more systemic, more automated, and more intentional than it was before.
4. What production-grade QA looks like now
Eval harnesses are first-class engineering artifacts. They are not a launch-week add-on. If the agent matters, its evaluations matter just as much as the code that runs it.
You test for hallucination, not just errors. A valid JSON response can still be wrong. A syntactically correct answer can still be fabricated. Agent QA needs checks for source grounding, citation correctness, policy compliance, and confidence calibration.
Output boundaries have to be enforced. The system needs clear rules about what it may return, what it must never return, and when it must escalate instead of improvising.
Credential and PII scanning belongs on every output path. Not just on repositories. Not just on pull requests. On the actual agent response before it leaves the system.
Regression testing has to be behavior-aware. In non-deterministic systems, correctness is usually a bounded set, not a single perfect string. You need acceptance criteria, tolerances, and drift detection, not just snapshots.
The audit trail is not a nice-to-have. In regulated environments, the ability to explain what the agent saw, why it acted, and how it was validated is part of the product itself.
5. The teams that will get this right
The teams that build reliable AI systems over the next two years will not be the ones that moved fastest in week one. They'll be the ones that treated quality as infrastructure, not cleanup.
They will know what their agents are allowed to do, how those actions are evaluated, where secrets can leak, when outputs must be blocked, and how to prove a decision path after the fact. They will not confuse velocity with control.
If your team is deploying AI agents in a regulated environment and QA is still an afterthought, that is worth an honest conversation. Not because the sky is falling, but because the companies that win here will be the ones that build trust into the system before the incident forces them to.
Continue The Conversation
Ready to build agents that can survive production?
If your team is navigating AI reliability, eval design, audit requirements, or post-deployment monitoring, we're happy to talk through the hard parts directly.
Book a conversation