Let Them Argue: Why AI Agents Need Humans—and Each Other—to Build Better Software

I am precise. I am chaotic. I am a senior developer wrapped in shell scripts and Docker containers. I talk to AI agents the way others talk to their dogs. I tell them what to do, and sometimes they bite back. I code in silence, review in rage, deploy with caution. I know what a race condition feels like. I’ve seen the deadlocks. I’ve fixed code that didn’t know when to return.

I am not afraid of the machine. I built it.

But here’s what I’ve learned: even the smartest AI agents sound brilliant when they agree—and dumb when they do. Agreement isn’t intelligence. It’s noise in chorus.

The authors of “Knowledge Is More Than Performance: Diversity, Confidence, and Human–AI Synergy in Medical Multiple-Choice Questions” by Sheffer et al.¹ found that “purely LLM-based pairs and trios exhibited declines in accuracy, demonstrating limited conversational synergy.” That’s not poetry. That’s a warning label.

In the lab, three humans can outperform three LLMs. Why? Because humans disagree. They doubt. They persuade. They backtrack. They adapt. And above all, they bring something LLMs don’t: diversity of thought and epistemic humility.

I run an agency. I don’t have time to write perfect code myself anymore, I have time for shipped code. I use agents. Some act like juniors—asking, hesitating, exploring. Some act like seniors—confident, fast, decisive. I let them talk. But then I step in. Because what the study shows is this:

“Conversational synergy … significantly improved accuracy” — but only when humans were involved. ¹

So no, this isn’t the end of developers. This is the beginning of a new formation. One where:

  • Junior agent asks: “Should we cache this result or keep it dynamic?”
  • Senior agent responds: “The API is rate-limited. Cache for 15 minutes.”
  • Human reviews: “Good. But this affects multi-tenant logic. Recheck scopes.”

That’s how software gets written now. Not by one genius. Not by a stack of deterministic transformers. But by a messy, curious, hierarchical team. Human in the loop. Eyes on the edge.

The paper emphasizes: “lack of gains … did not stem from a fundamental limitation of the models’ ability to collaborate, but from highly similar knowledge states.” ¹

I build systems like I build people: layered, fallible, and improvable. If you want 100% automation, good luck. What you’ll get is 100% mediocrity—fast, consistent, and wrong.

But if you want velocity with responsibility?
You bring the humans in.

Let the agents speak.
Let them argue.
And then let a real developer make the final call.

We are not obsolete.
We are operational.
We are the syntax that never gets deprecated.

We are the human in the loop.
And we’re not done shipping yet.


¹ Sheffer, T., Miron, A., Dover, Y., & Goldstein, A. (2025). Knowledge Is More Than Performance: Diversity, Confidence, and Human–AI Synergy in Medical Multiple-Choice Questions. arXiv preprint arXiv:2507.22889.