I Went From 11 AI Agents to 6. The Mistake Was Organizing Them Like Humans.

I built a bedtime story app for my daughter using AI agents for all the engineering work. She is 4, and the story says her name out loud. My first version had 11 agents organized like a human team. It worked worse because agents are not employees.

What 11 Agents Looked Like (and Why It Was a Mess)

My first org chart had separate web, mobile, and automation engineers; separate product, QA, and UX roles; and extra strategy layers on top.

I had imported a bunch of human-team assumptions into a system that was not human. CTOs and CEOs are different people. Mobile and web are separate specialties. Those are good defaults when your team needs meetings, handoffs, and sleep.

A human-looking org chart felt more legible to me. It was also the wrong abstraction. Agents are not humans with better uptime. They have no meetings, no handoffs, no sleep, and no reason to inherit our org chart anxieties.

My product-head can read the entire codebase as fluently as the engineer. I can run three copies in parallel and nobody gets territorial. Once I saw that, the 11-agent version stopped looking clever and started looking like overhead I had invented for myself.

Context as if they were human. I was front-loading everything into each agent's prompt -- the way you brief a human employee on their first day. My engineer prompt hit 400 lines. The real mistake was treating agent context like human memory: permanent, always-on, always there when I needed it.

Duplicated knowledge and overlapping mandates. The specialized agents all needed overlapping patterns: Supabase, auth flows, API contracts, testing expectations. When I updated one prompt, I had to remember to update the others. Meanwhile the strategy, product, QA, and UX agents could all give slightly different recommendations on the same feature because they were operating from slightly different slices of context.

Context drift and coordination costs. Some agents got used daily. Others went weeks. The rarely-used ones operated on outdated assumptions when I finally invoked them -- not unlike human teams, except a human absorbs context from Slack. Agents only know what you tell them. And every handoff required me to summarize context, which is the agent equivalent of meeting overhead.

Before (11)

web-engineer

mobile-engineer

automation-engineer

code-reviewer

technical-architect

product-head

operations

product-advisor

ux-designer

research-scout

After (6)

engineer

code-reviewer

strategic-advisor

product-head

growth-lead

research-scout

Most of the specialization moved into skills instead of standalone roles.

The Fix: Agents Define WHO, Skills Define WHAT

The principle: agent prompts define identity. Skills define domain knowledge.

The engineer prompt is about 100 lines. Identity, workflow, universal standards. Nothing platform-specific. When it works in the web app directory, it loads the relevant web skill. When it works in the mobile directory, it loads the mobile one instead. Mechanically, a skill is just a markdown file that gets concatenated into the system prompt on demand -- no RAG, no vector database, just the right context appended at the right time. The prompt stays short. The relevant context loads on demand.

Task

Skill Loaded

What Changes

Task: Web task

Skill: web-platform

Result: Next.js patterns, Tailwind conventions, motion rules

Task: Mobile task

Skill: mobile-platform

Result: Expo patterns, mobile navigation, native constraints

Task: Architecture review

Skill: architecture-review

Result: trade-offs, boundaries, cost, operability

What identity is really doing here is giving the model a steady default. When the engineer hits something no skill covers, it falls back on "I am a senior full-stack TypeScript engineer who values readability, tests edge cases, and commits frequently."

Here is the same deliberately boring task -- "write a function that returns the monthly price for a subscription plan" -- given to the same model with no identity and with the engineer identity. Same model, same temperature:

Vanilla prompt (no identity):

function getMonthlyPrice(plan: string): number {
  const prices: { [key: string]: number } = {
    free: 0,
    starter: 9,
    pro: 29,
  };

  return prices[plan.toLowerCase()] || 0;
}

Engineer identity (~60 words of system prompt):

type Plan = 'free' | 'starter' | 'pro';

const MONTHLY_PRICES: Record<Plan, number> = {
  free: 0,
  starter: 9,
  pro: 29,
};

function getMonthlyPrice(plan: Plan): number {
  return MONTHLY_PRICES[plan];
}

The vanilla version works. It would pass a basic test. But it accepts any string as a plan and silently treats unknown input as free. The engineer version makes valid plans explicit with a union type and uses Record<Plan, number> so the compiler enforces that every supported plan has a price. Identity changes the model's default standard for what "done" looks like.

I collapse roles until collapsing further makes the default behavior worse. Merging web-engineer, mobile-engineer, and automation-engineer improved things because their differences were mostly domain knowledge, which skills can load on demand. Merging engineer and code-reviewer would remove a useful boundary. The separate reviewer is not about pretending one model instance has a different soul. It is about forcing a second pass with a different prompt, different incentives, and a clean workflow transition from "produce code" to "look for risk."

The extra product-specialist roles stopped being standalone agents and became skills the product-head can load when needed. 14 skills total. Each agent prompt is 80-100 lines of identity, and each skill loads only when needed.

The Org Chart

The closest analogy is not employees. It is configured processes -- a small set of stable execution profiles around the same base model: implement, review, product-shape, research. I am defining repeatable setups for certain kinds of jobs, each with its own defaults, tools, and success criteria.

Founder (me)
├── research-scout
├── strategic-advisor
├── product-head
├── engineer
│   └── code-reviewer
└── growth-lead

Every box except "Founder" is a Claude agent. Six total, with most specialization moved into skills instead of standalone roles.

When I say "build the story arc feature," the product-head shapes the request, the engineer implements it, and the code-reviewer reviews before merge.

shape

Product Head

→

implement

Engineer

→

review

Code Reviewer

→

ship

Merge

Product Head loads the relevant skills when the work touches children's content or UX.

That is also why I did not collapse all the way to zero named agents. Zero agents would mean one generic entrypoint that has to decide, on every task, whether it is planning, coding, reviewing, or researching. A few named agents give me cleaner interfaces, better routing, and more predictable behavior.

The interesting roles are the ones that only make sense in an AI org chart. engineer handles web, mobile, automation, and tests. One agent plus the right skills worked better than pretending I needed separate specialists. code-reviewer is separate mainly to make the workflow legible: implementation is over, review has started. research-scout exists because models have a knowledge cutoff; it scans current sources so the rest of the system is not reasoning from stale assumptions.

When routing is wrong, a task that is split too finely starts bouncing between agents, each adds a reasonable opinion, and suddenly I am paying the agent equivalent of meeting debt. That is a big part of how I learned I did not need more roles. I needed fewer handoffs.

What Breaks

Training-Time Heuristics That Will Not Die

I will instruct an agent to never use a phrasing, add it to the blocklist, add it to memory, and three weeks later it appears in a new template. The correction is in the prompt. The model read it. And then the training-time prior pulls it right back.

What works is multi-layer heuristic override: prompt, memory, validation rule, and exemplar template. Each layer catches what the others miss. For my use case, this works better than fine-tuning. Fine-tuning needs training data, model version management, and can degrade other tasks. Multi-layer override works on the base model, is immediately editable, and each layer is independently testable.

Layer 1

Prompt

Identity and explicit instructions establish the intended behavior.

Layer 2

Memory

A persistent markdown file read at session start keeps learned corrections available.

Layer 3

Validation

Deterministic scripts — regex, word-frequency, structural checks — catch regressions before output ships.

Layer 4

Exemplar

Reference outputs show what good looks like when the model drifts.

Context degradation is the other thing I have not solved. In longer sessions, the agent's grip on its own instructions loosens. Rules it followed perfectly an hour ago start slipping. I have learned to distrust long runs even when the early output was clean.

The agents do not make more bugs than the human teams I have been on. But the bugs are different. On a human team, bugs cluster around the hard parts -- the tricky state management, the race condition nobody thought about. Agent bugs are all over the place, and the speed compounds them. A human writes one module with a subtle issue. An agent writes five modules in the same time, and three of them have subtle issues in places you would not think to check. The velocity is real, but so is the review surface.

The founder version of this story — what it is actually like to build with an AI team — is here.

You can try Endless Storytime or find me on LinkedIn.