AI & Automation
I compared 5 AI coding assistants for 30 days. The winner was a surprise.
Real client work, not toy benchmarks. Thirty days with Claude Code, Cursor, Copilot, Windsurf, and Gemini. The ranking, the receipts, and the metric I wish I'd tracked from day one.
By Mr. Gill ·
Thirty days ago I decided to settle this for myself. Too many confident takes on Twitter. Too many benchmarks measuring things I don't actually care about. I wanted to know which AI coding assistant makes me faster on real client work, not which one solves a LeetCode puzzle in the fewest tokens.
So I ran five of them, side by side, on the same codebase, for a month. Real work. Shopify app features. Firebase Functions. React component refactors. The sort of thing I'd actually bill a client for.
The winner was a surprise. Let me show you the receipts.
The setup
I picked five tools that I hear about most often: Claude Code, Cursor, GitHub Copilot, Windsurf, and Gemini Code Assist. For each one I ran the same rotation of tasks over six days, then compared.
The tasks weren't toy problems. They came straight from client backlogs: adding a new Shopify webhook handler with retry logic, refactoring a React form to use controlled inputs with validation, writing Firebase security rules for a new collection, migrating a TypeScript file from one state management library to another, debugging a flaky integration test.
The ranking, with the reasoning
Gemini Code Assist kept losing context on anything longer than one file. It'd confidently edit a function in a way that broke its callers. I spent more time fixing its edits than writing code. Dropped it after ten days.
Claude Code. Not because it writes the prettiest code — it doesn't always. Because it actually reads the whole repo when it needs to. When I ask it to refactor a component, it looks at the callers. When I ask it to add a test, it finds the existing test patterns and matches them.
Cursor came in second, close. It's the best interactive editing experience I've ever used. The in-line chat, the cmd-K edit, the multi-file rename — all feel like what an IDE should feel like in 2026. Where it falls short for me is the longer horizon. If the task takes more than 20 minutes of agent time, Cursor tends to lose the thread.
Windsurf was a genuine surprise. I went in skeptical, came out recommending it to two developer friends. Its agent mode handles multi-file changes more gracefully than Cursor's on average, though the polish around single-file editing is a touch behind.
GitHub Copilot is a specific tool for a specific job. It's still the best inline completion tool for typing-speed augmentation. I wouldn't ask it to refactor anything, but I wouldn't give up the suggestions while writing fresh code either.
What changed my mind about all of this
The ranking flipped a couple of times during the month. The first week, Cursor felt like the obvious winner because the feedback loop is so tight. By the end of the second week, I started taking on bigger refactors and Claude Code's ability to actually understand the repo's structure mattered more than Cursor's speed.
By week three I was using both. Cursor for the tight-loop work — adding a feature to a single component, fixing a specific bug, writing a one-off utility. Claude Code for anything that crossed three or more files, anything involving test suites, anything I'd normally have procrastinated.
The real question is: which one wastes less of my attention? That's the metric that matters when you're billing clients.
The costs are easy to underestimate
I came in expecting the subscription cost to be the main variable. It's not. The tokens I burn on Claude Code's heavy-lifting refactors dwarf the monthly subscription. On a particularly big migration week I spent close to what a full-time junior developer earns in a day, in API credits, just on one project.
Was it worth it? Yes, without question. That same migration would've been about four days of my own time. But the cost structure is different from "pay $20/month for Copilot" in a way nobody prepares you for. If you're solo, budget for it. If you're a studio billing clients, build it into the price.
What I'd tell you if you asked in person
Pay for two tools. Not one, not five. Two. One inline completion (Cursor or Copilot — I prefer Cursor). One agentic worker for bigger jobs (Claude Code). The other three aren't bad. You just don't need them if the two you have cover the full spectrum.
Don't pick based on benchmarks. Pick based on a one-week trial on your actual work. Benchmarks measure what's measurable; your workflow is what matters.
And finally: all of these will be completely different in six months. What I said a year ago about Copilot is no longer true. What I'm saying now will age the same way. Pick the tool that works today, and plan on re-evaluating every quarter.
- Claude Code won on complex, multi-file work. Cursor won on tight-loop single-file editing. Run both, not one.
- The real metric isn't code quality — it's how little supervision the tool needs. Track your supervision time, not your lines-per-hour.
- Token costs on agentic tools are higher than most expect. Budget for them as a line item, not an afterthought.
- Re-evaluate every quarter. This space is moving fast enough that last year's winner is often this year's also-ran.