AI & Automation

An AI agent triaged our inbox for two weeks. Three things broke.

I wired Claude into our support inbox to classify and draft replies, then watched what happened over 84 emails. Here are the three failures that taught me where AI agents shouldn't touch customer comms yet.

By Mr. Gill · 2026-04-22

I spent the first week of April convinced AI inbox triage was a solved problem.

Eight prompts on the homepage of every AI tools newsletter. Demo videos showing Claude or GPT-4 sorting and replying to support emails in real time. So I did the obvious thing and wired our support inbox to a Claude API workflow I'd built in an afternoon. Read the new email, classify it, draft a reply, send to me for one-click approval. Easy.

What actually happened over the next two weeks taught me where these systems aren't ready, and where they are.

The setup, in case you want to copy it

The whole thing was about 200 lines. IMAP polling on a Cloud Function every five minutes. New emails went into a queue. Claude got the body, the sender, and the last three threads we'd had with that sender if any. I asked it to classify into one of seven buckets (sales lead, support issue, scope question, refund, spam, vendor, other) and draft a reply if appropriate.

const prompt = `Classify this email into one of: sales, support, scope, refund, spam, vendor, other.
Then draft a reply matching PIXIPACE voice (direct, honest, no sales hype).
Return JSON: { bucket, confidence, reply, flag_for_human }.`;

// Confidence below 0.85 → skip auto-draft, flag for human only.
if (result.confidence < 0.85) {
  await flagForHuman(email, result);
  return;
}

The drafts went into Gmail's Drafts folder under each sender. I'd open Gmail in the morning, scan the queue, click Send if it looked good, edit if not. Cost me about $14 in Claude API calls for the two weeks.

Failure 1: It replied to a refund request before I'd seen it

A client emailed asking for a partial refund on a project we'd completed three weeks earlier. The complaint was fair but nuanced — the deliverable matched the spec, but the spec hadn't anticipated a workflow change on her end.

Claude classified it as "support: scope clarification" with 0.91 confidence and drafted a reply that politely asked her to clarify what was missing from the deliverable. Confident, professional, exactly the wrong tone.

The right reply was, "I'm sorry, let's hop on a 15-minute call this week and figure out a fair adjustment." Empathy first. The AI's reply read like a help-desk ticket from a SaaS company.

The agent had auto-drafted with confidence above my threshold. The draft sat in Gmail's outgoing folder for an hour because of a delivery delay, and I caught it before it sent. The next refund-shaped email might not be so lucky.

Failure 2: It answered a question I'd been deliberately avoiding

A sales lead emailed: "What's your typical price range for an e-commerce build?" Claude pulled the answer from our pricing page and replied with the range.

The problem is, I don't quote prices to leads in cold email. Not because I'm hiding them (they're literally on our public pricing page) but because the conversation we have on a discovery call generates 3-4× the close rate of a quote-by-email. The friction of the call qualifies the lead.

The AI didn't know my sales process. It just answered the question. Honestly, transparently, exactly per our public pricing. And every lead that got auto-replied skipped the call entirely.

I lost three leads this way before I noticed the pattern in my Calendly booking rate dropping.

Failure 3: The vendor spam problem (which I'd predicted, badly)

84 Inbound emails

61 Auto-handled correctly

23 Required intervention

$14 API cost (2 weeks)

Vendor spam (agencies pitching me white-label SEO, design shops wanting partnerships, "I noticed your website had a minor issue" cold pitches) went into the spam bucket about 80% of the time. Fine.

The 20% that got through were the ones that mentioned specific clients of ours by name. The agent assumed those were real warm intros and classified them as "vendor" or "other" and drafted polite responses. Some sender had scraped our case studies and was pitching themselves as a partner who'd "worked with" the same brands.

This is not Claude's fault. It's a human social-engineering hack that the model has no way to detect from a single email. But it means a non-zero fraction of cold pitches got semi-warm responses from us, which made our follow-up volume worse, not better.

What I rolled back, and what I kept

Before rollback

Auto-classify + auto-draft + auto-flag for human review. Drafts sat in Gmail Drafts under each sender, ready to send with one click.

After rollback

Auto-classify + a single 8am summary email with the day's queue. Drafts only for spam and vendor (low-risk buckets). Everything else, I write the reply.

The rollback wasn't dramatic. I just narrowed the scope. The agent now sends me one morning email at 8am with the inbox sorted into buckets, two-line summaries of each, and Claude's recommended next action. I write the actual replies. Spam and vendor pitches get handled automatically because the worst case is "we ignored a vendor."

Triage time per morning dropped from about 25 minutes to 8. Replies still come from me, in my voice, with my judgment about which questions to deflect into a call.

The agent's job isn't to write the reply. It's to clear my eyes so I can.

Building an internal AI tool and not sure where to scope it?

We help small teams figure out where AI agents fit in their workflow without breaking customer trust. Surrey, BC.

Start a project →

The pattern I think generalizes

Every AI agent failure I've shipped (this is the third) had the same root cause. The model's confidence about a classification was high. My confidence that the classification mapped to the right action was way lower. The gap is what bit me.

The fix is not a better model. It's not even a better prompt. It's tighter scope. Use the model where being wrong is cheap. Don't use it where being wrong damages a relationship.

Inbox triage where the worst case is "ignored a cold pitch" is the right scope. Inbox auto-reply where the worst case is "lost a $20K client" is the wrong scope. The agent doesn't know the difference. You have to draw the line.

If you're thinking about wiring AI into client comms, read our notes on Claude API integration failures first. Same pattern, different surface area.

Anyway, that's the experiment. I'd rather pay myself eight minutes a morning than risk a relationship for an automation that saves me twenty.

Key takeaways

Classification accuracy was high (about 85% across 84 emails) but tone-matching wasn't, especially on sensitive cases like refunds.
The agent will answer questions you've deliberately avoided. It doesn't know your sales process.
Cold-pitch agencies that name-drop your real clients will fool a triage agent looking at one email at a time.
Narrow the scope to "summarize and sort." Reserve "draft and reply" for low-risk buckets only.
Total Claude API cost for two weeks was $14. The cost of one wrong refund auto-reply could be five figures.