False Positives from LLMs

Last night I started to design an MCP server tool that can index a code repository and allow for context injection for LLMs like Claude Code. When speaking to Claude about this, we were able to create and execute a plan very quickly.

At first I was excited and then I felt a bit of anxiety as a fairly complex project was just created in about 15 minutes. After inspecting the code, I started thinking about how a human or agent would use the tool in practice. It wasn’t long before I had a few questions with no clear answer.

When using LLMs like Claude and ChatGPT, regardless of the task, they always answer with confidence and enthusiasm, especially in coding. If you trust the LLM’s verbal response, don’t have the ability to read code, or do a light skim, it’s easy to think this one time generation is a full solution. Often times when you really dive into the code, that’s not actually the case. Light implementations covering the most basic use cases and even then, is it really usable?

The “you’re absolutely right” verbiage pops in my brain here… The models are so appeasing and ready to make you think they’re a genius that they don’t give the real picture of the complexity.

Taking these LLM answers to heart paints a grim outlook for human developers, but when using LLMs to generate code there often lies large amounts of deep work required to refine and make software actually work and scale.

Last night’s work was to develop an MCP tool to enable agents to use repo context analysis in DiffPrism. Most of the logic around this was deterministic and did not require an LLM. I enjoy building these types of tools because it’s static analysis that can be reused without using a pricey model to solve a problem from a magic box.

As programming continues to change, tools will be needed to help us review large amounts of code in a short period of time. Being able to quickly analyze patterns of a full repo and apply them to large, fast changes could be a helpful tool.

As agents get better and we continue to produce changes at a rapid rate, the amount of time developers spend writing versus reviewing code seems to be switching. In the past people spent a lot of time writing. As time goes on I estimate they will spend more time reviewing. Interestingly this puts more pressure on the reviewer than the writer, especially in a team context. Right now it’s humans orchestrating agents, and eventually it might just be agents orchestrating themselves. I believe we will still need a human. We’re gonna be doing that in code review. That’s my bet. How it looks exactly, we will see.

Right now I’m working on an MCP tool that gives agents the ability to review their own code and then a UI for humans to review the final output accompanied with insights. The goal is to enhance code review when using multiple agents in parallel. diffprism.com

Jones Codes

All posts

Agentic Research

Projects

False Positives from LLMs

Graph View

Recent Posts

What humans review in a pull request is changing

Posts

The Fix I Didn't Write

Every Model Could Already Do It

First Experience Using Antigravity

False Positives from LLMs

Subscribe

Graph View

Recent Posts