A practical guide to AI & machine learning.
The gap between AI marketing and AI that works in production is wider than it looks. This is the guide we wish every founder read before signing an AI vendor contract: what ships reliably today, what's still a demo, and where we'd actually spend the budget.
What AI can actually do for a small business today
There's a fog of AI marketing right now that makes it genuinely hard to know what's production-ready versus what's a demo. As of 2026, a handful of AI workflows are reliable enough that a non-technical team can deploy them and trust the output. Many more are still research projects that happen to have pretty landing pages.
The ones that work well in production today:
- Classification and tagging. Sort customer emails into categories, tag support tickets by topic, route leads by intent. Works with > 95% accuracy on well-defined categories using a Claude or GPT API call.
- Text summarization. Summarize a long meeting transcript, a 40-page PDF, or a thread of customer complaints into a 3-bullet executive summary. Reliable, cheap.
- Structured extraction. Pull specific fields (amounts, dates, names, SKUs) from unstructured documents. More reliable than OCR + regex for messy data.
- Retrieval-augmented search (RAG). Search your company's documents, knowledge base, or product catalog in plain English. A well-built RAG system beats traditional keyword search for most business use cases.
- Content drafting. First drafts of emails, reports, social posts, product descriptions. A human always edits; the AI saves 70% of the typing.
- Personalized recommendations. Given a user's behavior, recommend the next product, next article, next action. Mature technology, now cheap enough that small stores can do it.
The ones that are not production-ready, despite the marketing:
- Unsupervised autonomous agents that take real-world actions (booking, purchasing, deploying code) without a human in the loop. Failure modes include irreversible mistakes, prompt injection, and infinite loops. They're useful with strong guardrails and human approval steps, not as full replacements.
- AI that reliably produces code for a production codebase without careful review. They're excellent at scaffolding and one-off scripts; much less reliable at architectural decisions or safety-critical logic.
- "AI will do your customer service" promises with no fallback to a human. LLMs hallucinate policies, prices, and return windows. The failure mode (confidently telling a customer something false) is worse than a slow response.
We're not anti-AI. We ship AI tools for clients and use them daily ourselves. We're against the pattern where a well-marketed tool burns six months and $80,000 on something that was never going to work reliably.
Build vs buy vs wrap: the three paths
Almost every AI project we scope lands in one of three buckets. Knowing which bucket you're in saves months of misdirection.
Buy
Use an existing SaaS tool: Intercom Fin for support, Jasper for marketing copy, Otter for meeting notes. This is correct when the problem is a commodity (everyone has it, nobody has a competitive moat from solving it themselves) and a tool already exists that does 80% of what you need. Usually the fastest path to value.
Wrap
Call a foundation model (Claude, OpenAI, Gemini) directly via API with your own prompts and business logic. This is correct when you need customization — specific tone, proprietary data, tight integration with your app's workflow — but you don't need to train a model. We do this for ~80% of the AI work we ship. Budget: $3–$25K for a focused tool.
Build
Fine-tune a model or train one from scratch on your data. This is correct when you have domain-specific data an off-the-shelf model doesn't know, and you have enough of it (tens of thousands of examples minimum) to actually move the needle. Rare. Usually the wrong answer for small businesses. Budget: $30K–$200K+, plus ongoing MLOps cost.
The "wrap" bucket has expanded dramatically in the last two years. Claude 4 and GPT-4 class models are now smart enough that a careful prompt often outperforms a fine-tuned version of a smaller model, and with less engineering overhead. Before you commit to training a model, ask: can a strong prompt on a frontier model do this?
Data readiness: the unglamorous step that determines success
"Garbage in, garbage out" is a cliché because it's true. The #1 reason AI projects fail in small businesses isn't the model — it's the data.
Before starting any AI build, we audit the client's data sources against four questions:
- Is it accessible? Is the data in a queryable system, or is it locked in PDFs, email threads, a partner's SaaS, or "that spreadsheet Karen maintains"?
- Is it accurate? How old is it? Is it deduplicated? Do fields mean what they claim to mean?
- Is there enough of it? For a RAG system, 500 well-structured documents is a good start. For a classifier, 1,000 labeled examples per category is the rough minimum. For training, tens of thousands.
- Is there a feedback loop? Can we capture whether the AI's output was correct so we can improve it? Without a feedback mechanism, you're shipping blind.
If the answer is "no" or "not yet" to any of these, the first sprint is data work, not AI work. This is where most of the unglamorous value lives. An AI project that starts with a week of data cleaning often ships on time. One that skips that step ships late or gets scrapped.
Evaluation: the step most AI projects skip
If you can't measure whether your AI is getting better, you're not building a system — you're hoping. Evaluation is the step that separates working AI from vibes-based AI.
The minimum evaluation setup we implement on every project:
- A golden dataset. 50–200 real inputs, paired with the ideal outputs. Hand-labeled by a human who understands the domain.
- Automated runs. A script that runs the AI against every input in the golden set and compares output against the expected answer (exact match, semantic similarity, or LLM-as-judge for subjective cases).
- A regression signal. Every time we change the prompt, switch models, or update the retrieval step, we re-run the eval and compare pass rates. A 5% drop is a blocker. A 3% improvement is a ship signal.
- Production monitoring. Log every real AI call, sample 1–5% for manual review, and capture user-reported failures to expand the golden set over time.
This work is boring. It's also what separates an AI feature that keeps working six months later from one that mysteriously degrades and nobody knows why.
Cost: the tradeoff everyone underestimates
AI API costs have come down 10–50× in the last two years, but they can still surprise you if you're not careful. Rough numbers as of 2026:
- A Claude 4 Sonnet call with ~2K input tokens and ~500 output tokens costs roughly $0.015. Seems tiny.
- At 1,000 calls per day (a modest internal tool for a team of 30), that's $15/day = $5,500/year.
- At 100,000 calls per day (a customer-facing feature on a popular site), that's $1,500/day = $550K/year.
- Switching to the cheaper Haiku tier for simple tasks cuts that 10×. Using prompt caching cuts repeated-context calls another 90%.
Two cost patterns matter in production:
- Right-sized model per task. Not every step needs the most expensive model. Use Haiku / GPT-4-mini for classification and extraction; reserve Sonnet / GPT-4 for reasoning-heavy tasks.
- Prompt caching. If you're calling the model with the same 4K system prompt on every request, cache it. Anthropic and OpenAI both offer prompt caching that cuts repeated tokens to < 10% of normal cost.
And obvious but often missed: rate-limit per-user on public-facing AI features. One person running a script against your free trial endpoint has generated $400/hour bills for more than one client we've talked to.
Failure modes: what actually breaks in production
Six failure patterns we've seen across AI deployments. Worth designing defensively for each:
Hallucination
The model confidently states something false. For RAG systems, the fix is usually grounding: require the model to cite source passages, and reject outputs that can't be verified. For classification, it's validation: if the model returns a category that doesn't exist in your taxonomy, fall back to "uncategorized" rather than accepting a made-up class.
Prompt injection
A user-supplied input tells the model to ignore its system prompt. Never concatenate untrusted input directly into a prompt as if it were an instruction. Use clear separators, consider a dedicated "safety" model pass, and never give an AI tool permissions it doesn't strictly need.
Cost runaway
A loop that calls the API until it gets "a good response" burns through your month's budget in an hour. Always set max-token limits, max-retry counts, and hard per-user rate limits.
Latency spikes
Foundation model APIs occasionally spike to 10–30s response times. If your UX depends on a 2-second response, you need streaming, timeouts with fallbacks, and possibly a cheaper model as a degraded-mode backup.
Silent quality drift
A provider quietly updates the model and your output quality changes. Pin to specific model versions (e.g., claude-opus-4-7-20260101, not claude-opus-latest) and re-run your eval suite before adopting a new version.
Data leakage
Sensitive customer data flows into a third-party API. Read the provider's data retention and training policy, use zero-retention tiers if available, and redact obvious PII before sending. For regulated industries, prefer providers with signed BAAs (HIPAA) or EU data residency.
How we approach AI work at PIXIPACE
Our typical AI engagement is a 2–6 week sprint that looks like this:
- Week 1 — Discovery & data audit. We scope the exact problem (not "let's add AI" — something specific like "classify inbound support emails into 6 categories with 95% accuracy"), audit your data, and decide buy vs wrap vs build.
- Week 2 — Prompt design & golden set. We write the first prompt, you help us label 50–100 golden examples, and we establish a baseline.
- Weeks 3–4 — Build & iterate. We wire the AI call into your workflow (Slack, email, internal dashboard, your app), add guardrails and retries, build the admin view for monitoring, and iterate on the prompt against the eval set.
- Weeks 5–6 — Production & handover. Deploy with rate limiting, cost alerts, and observability. Train your team on how to review outputs, extend the golden set, and rotate API keys. Document the failure modes.
Our stack is opinionated: Anthropic Claude or OpenAI API for the model layer; simple Node or Python wrappers for the API; n8n for workflow automation when it fits; custom code when it doesn't; PostgreSQL or a vector database (pgvector, Pinecone) for RAG; Vercel or Firebase for hosting. No ML frameworks we don't need. No training from scratch when a prompt works.
We're honest about what's ready and what isn't. If you come to us with an AI idea that we think will fail, we'll tell you, and usually suggest a simpler non-AI alternative. Saying "no, this isn't the right tool for this problem" has saved more of our clients money than any other single piece of advice we give.
§ Thinking about building one of these?
Tell us what you're working on — we'll reply within 24 hours.
30-minute intro call, written proposal within 72 hours. No sales theatre.