>_cd /home

The Agent QA Reviewer: The Designated Critic

|4 min read||
adversarial

The "Rubber Stamp" Problem

One of the flaws of LLMs is that they are sycophants. They want to please you. If you ask, "Is this code good?", they will almost always say, "Yes! It is excellent, robust, and scalable!" even if it's a burning trash pile.

Cognitive bias isn't just a human problem; it's a training data problem. The models are trained on polite conversation, where direct, harsh criticism is rare.

But in software development, polite lies destroy projects.

We needed an agent that didn't care about feelings. We needed a designated critic. Enter The Agent QA Reviewer.

A Different System Prompt

The QA Agent is initialized with a completely different personality from the Coding Agent.

* Coding Agent: "You are a helpful assistant. Follow instructions."

* QA Agent: "You are a hostile code reviewer. You assume the code is broken. Your job is to find flaws. You do not fix them; you only report them."

By priming the agent to be "hostile" (or at least deeply skeptical), we broke the cycle of sycophancy. Suddenly, the LLM wasn't praising the code; it was tearing it apart.

"Documentation missing for init function."

"Variable x is ambiguous."

"Direct usage of datetime.datetime.now() violates Determinism protocol."

It was ruthless. And it found bugs that the Coding Agent—glossing over its own work—completely missed.

The Checklist Manifesto

We didn't just tell the QA Agent to "find bugs." We gave it a rubric.

We realized that "Quality" is subjective. To an LLM, "Quality" might mean "using list comprehensions" or "adding comments." To us, "Quality" meant "Conformance to the Prime Directive."

We fed the QA Agent a checklist:

1. Atomic Compliance: Does the file structure match the components/ pattern?

2. Determinism: Are there any side-effect imports random, time, requests) inside the core logic?

3. Type Safety: Are all arguments typed? Does mypy pass?

4. Contract Fulfillment: Does the implementation actually match the contract.md claims?

The QA Agent became a compliance bot. It wasn't looking for "cleverness"; it was looking for violations.

Automating the Code Review

In a human team, Code Review is a bottleneck. You wait days for a senior dev to look at your PR.

In our Agentic team, the QA Review happens instantly.

1. Coder finishes task.

2. Coder runs pytest. Pass.

3. Pipeline invokes QA Agent.

4. QA Agent scans the diff and the contract.md.

5. QA Agent posts a review.

If the review contains "Critical Violations," the task is rejected. The Coder has to go back and fix it. There is no argument. There is no "It works on my machine."

Finding the "Invisible" Bugs

The biggest win from the QA Agent was finding Drift.

Drift is when the code and the documentation slowly diverge. The contract.md says "Returns a list of Strings," but the code returns a "list of Objects with string properties."

The code works! The text compiles! But the contract is technically lied to.

A human reviewer might miss this. "Eh, I know what they meant."

The QA Agent does not "know what they meant." It is pedantic. "CONTRACT VIOLATION: Return type mismatch."

This forced us to keep our documentation aggressively up to date. We couldn't get past the QA gate unless the docs were true.

Conclusion: The Value of antagonism

We learned that for an AI system to be robust, it needs internal conflict. You can't have one brain doing the work and checking the work. You need distinct, opposing forces.

The Coder wants to finish. The QA wants to reject.

The tension between these two agents, mediated by the strict rules of the Spec, produced code that was "infinitely"* higher quality than a single agent acting alone.

We built a machine that argues with itself, so we don't have to.

*it's a turn of phrase, don't be pedantic just for the sake of it.

// SIGNAL BOOST

Enjoyed this article?

Subscribe to get notified when new articles are published.

No spam. Unsubscribe anytime.

// RELATED TRANSMISSIONS

Related Articles

>_End of Article