Skip to content

Processes

Vibe Coding experiment

A bit more interesting data on vibe-coding vs structured assistant-coding.

Yesterday I ran an experiment. Had a task that will never touch production (local utility), and I'd been reading yet another piece about some engineer cranking out 10 MRs per day with AI. Good enough reason to actually try the dump-everything-at-once approach.

So I described the full scope to Claude upfront and let it run. Two things stood out. It took longer and produced more frustrating dead ends than my usual flow (detailed design first, then feature-by-feature implementation1). And it burned roughly twice the tokens compared to the structured approach. March 24 alone hit $35 after a session involving all three models, versus $10 on a more focused day. If you want to track your own numbers, npx ccusage@latest does the job.

I've seen the "feed it the whole task at once" question pop up a lot lately. Based on this, my answer is: don't. At least not yet. The model doesn't have the context to make the right trade-offs upfront, and you end up steering it through corrections that a proper design phase would have avoided entirely. And if you think vibe-coded output is going into a proper review pipeline, that's a separate problem2.

Vibe-coding makes for great demos. In practice, it's just paying more for worse results.

Making GenAI code review actually useful

Alamedin Gorge is one of my favorite places for a short weekend walk.

I ran a single Claude reviewer agent against an approximately 1000-line C++ change last month (yes, it's too big for a normal workflow, but we are in the GenAI era and norms are different)—some pretty important video pipeline updates, but nothing exotic. The review came back with 14 findings. Three were real. The rest included a phantom race condition in code that runs on a single thread, two style nitpicks elevated to High severity, and a complaint about missing error handling on a function that already returns std::expected. The signal-to-noise ratio was bad enough that I almost closed the tab.

This is the dirty secret of GenAI-assisted code review. The models are good enough to spot real issues, sometimes ones a tired human would miss. But they also hallucinate problems with enough confidence to waste your time, and the false positives are not random. They cluster around the same blind spots every run because they come from the same weights, the same training distribution, the same biases baked into one model family.

I wrote about the false positive problem briefly in my earlier piece on development processes in the GenAI era1, but I didn't have a concrete solution at the time. Now I do, and it's been running in my workflow for a few weeks. The short version: stop asking one agent. Ask three, and only keep what two of them agree on.

Development processes in the GenAI era

The current debate around GenAI and C++ is a good illustration of the real problem. Many engineers report that models are worse than juniors. Others report dramatic speedups on the same language and problem space. Both observations are correct.

The difference is not the model. It is the absence or presence of the state.

Most GenAI usage today is stateless. A model is dropped into an editor with a partial view of the codebase, no durable memory, no record of prior decisions, no history of failed attempts, and no awareness of long-running context. In that mode, the model behaves exactly like an amnesic junior engineer. It repeats mistakes, ignores constraints, and proposes changes without understanding downstream consequences.

When engineers conclude that “AI is not there yet for C++”, they are often reacting to this stateless setup.

At the same time, GenAI does not elevate engineering skill. It does not turn a junior into a senior. What it does is amplify the level at which an engineer already operates. A senior engineer using GenAI effectively becomes a faster senior, and a junior becomes a faster junior. Judgment is not transferred, and the gap does not close automatically.

These two facts are tightly coupled. In stateless, unstructured usage, GenAI amplifies noise. In a stateful, constrained workflow with explicit ownership and review, it amplifies competence.

This is why reported productivity gains vary so widely. Claims of 200–300% speedup are achievable, but only locally and only within the bounds of the user’s existing competence. Drafting, exploration, task decomposition, and mechanical transformation accelerate sharply. End-to-end throughput increases are lower because planning, integration, validation, and responsibility remain human-bound.

The question, then, is not whether GenAI is “good enough”. The question is what kind of system you embed it into.

Note

Everything I'll explain below is only applicable to the Stateful GenAI setup.