I built a bedtime story app for my daughter. She is 4, sensitive and aware, and bedtime has never been the easy part of the day. In the app, she is the hero. Her rabbit lovey - the one she has had since 6 months old - comes along for the adventure. The story knows their names. It remembers that two weeks ago they helped a little star find its way home. Tonight, it brings that memory back while she explores a tiny world. Bedtime has gotten easier because the story engages her, then settles her.
There is an intentional system behind it - templates, memory, a chunked audio pipeline, developmental review, privacy architecture - and it is bigger than most people would guess.
And every line of TypeScript, every test, every migration, and every API route was written by AI agents.
It started as an experiment: what happens if the human sets the direction and the quality bar, but never touches the code? It stayed because it worked.
The current setup is six core agents, most with skills, and it costs about $200 a month in API calls. The other part of the cost is more important: I was working 10 to 11 hours a day for a few weeks while my husband picked up everything else with our daughter.
$200
Monthly agent cost
$0.0006
Per-story cost
6
Core agents
10-11h
Daily hours at peak
The practical rhythm is less futuristic than it sounds. I talk to the strategy and product agents to develop the roadmap, plan my involvement based on the tasks, queue up the work in multiple terminals, review and redirect when needed, and test. The "test" part takes the longest.
I wrote a separate post about how the agent system is structured - the org chart, the routing, and why I went from 11 agents to 6.
The Most Important Product Bet
The obvious version of this product is: prompt the model, get a fresh story. I did not do that.
For a children's product, fully fresh generation makes the wrong trade-offs. It is slower, harder to review, harder to keep developmentally consistent, and much harder to trust at bedtime. It is also the wrong trade-off for any creative product: compile-time templates let you amortize the cost of multi-stage agent pipelines and human feedback. The template-first system looks like this:
Template — compile-time
Tonight's Story — runtime rewrite
The model gets freedom where it helps and constraints where it matters.
The rewrite layer is where the flexibility comes back. It pulls in the parent's connection-card context, memories from previous adventures, and any shared specifics of that night, then rewrites lightly enough that the story still feels natural.
What happened to my role was stranger.
My Technical Role
There is a recurring negotiation with my agent team:
Me
I need a dashboard tab that shows cost breakdown per story generation.
Agent
For a single-employee company, a multi-tab admin dashboard represents significant implementation surface. A direct database query would provide the same information.
Me
Build the dashboard tab.
Agent
I want to flag that this introduces meaningful maintenance overhead for a one-person team. Would you like me to outline the trade-offs before proceeding?
The agent sees one employee and thinks: less tooling. I see one employee and think: no one to delegate the checking to. The dashboard is not overhead. It is how a single person stays across a system this size without living in a terminal.
The bottleneck is not development capacity. It is my ability to review and make product decisions. I can feel how human my own bandwidth is. I am not a machine, and I do not need to pretend I will beat one at being a machine. Nobody tries to out-compile a compiler (anymore).
What I cannot delegate is sitting beside my daughter at bedtime and noticing when a story moves too fast, lands emotionally wrong, or somehow misses the point. Beyond privacy and security, which always gets extra human eyes, quality bar for a children's product is not "did the code compile?" It is "would I want my daughter to experience this tonight?" That part is still mine.
Here is a real example. The app pronounces the child's name out loud in every story. I built a robust name pronunciation system because of how I saw a parent's face fall flat when the earliest tests mispronounced their kid's name. The agent pushed back more than once - different approaches kept failing on the tough cases, and it argued that the effort was not justified when 90% of names would work fine. I kept pushing.
Later, we wrote a test script for the audio output with test data. The pronunciation hardening works at the data layer, and the test script bypassed it. My daughter heard that night's story and looked up: "Why did she say my name like that?" It had pronounced her name AY-luh instead of EYE-luh. She noticed instantly.
The dashboard is where I spend most of my time now - cost breakdowns, failure traces, the roadmap the agents update as they work.
The technical work got increasingly abstract, and that took adjustment: where models collapse toward sameness, how memory systems become magical instead of gimmicky, how to override training-time habits without fine-tuning, how to tell when a result seems structurally correct but emotionally wrong.
Not writing code also keeps me less emotionally attached to the system. We have pivoted several times, and I have never had the "but I spent two weeks on that module" feeling.
And when I say "pivoted," I mean real pivots. We built interactive mini-apps to teach coping techniques visually - axed them entirely in favor of a more cohesive character set. We built two lead magnets that failed to provide the kind of value parents actually expected - axed. We built a symmetrical audio chunking approach that worked but created way too much upfront latency. Unacceptable for bedtime. So we rebuilt it as a staggered pipeline and had to touch 100+ templates to accommodate the change. At my day job, I know we would have shipped at least two of those. We had already invested too much. Here, the code was not mine, so killing it felt like editing a draft instead of demolishing a building.
Testing is similar. Agents happily produce respectable coverage around the obvious path, which is not the same thing as knowing which tests matter. The real work is pushing them toward ugly edge cases and not mistaking a green suite for proof that the system is safe.
The Creativity Problem
AI agents default to the middle of the road. Not bad. Just average. In a children's product, average becomes a problem fast.
I found "warm" used 21 times in one story template. The phrase "had been there all along" showed up in five others. Three story arcs came back with different nouns and the same skeleton. The model gravitates toward the most probable output, and the most probable output is the average of its training data. For code, that is usually fine. For creative work, it is aggressively mediocre.
The fixes are specific: frequency validators, blocklists, a 10-stage isolation pipeline where each arc is designed by a separate agent with no visibility into the others. If you want agents to be creative, you have to build creativity into the constraints. Left alone, they give you polished average. This is also part of why I chose the template-first architecture. It keeps a human in the loop on the creative work, because the model will not flag its own mediocrity.
The memory system extracts 7 types of narrative facts from each story - brave moments, discoveries, companion interactions, favorite places. Future stories inject up to 3 atoms, selected by a scoring algorithm that balances recency, diversity, freshness, and thematic relevance. The scoring is what makes it feel magical instead of gimmicky. On night 14, your child hears a reference to night 3, and she sits up because she remembers. That only works if the system chose the right memory for the right moment.
Night 3
Story plays. System extracts:
Scoring algorithm selects 3 of 7
Night 14
New story weaves in 3 memories:
She sits up because she remembers.
What I'd Tell Someone Trying This
Generate where variation is a feature. Template where it isn't.
For this product, templates for the bones plus a light rewrite layer beat fully fresh generation.
Treat agent-written code as disposable on purpose.
When a module stops serving the product, kill it. The reduced sunk-cost pull is one of the best parts of this setup.
Agents are not employees.
They are stateless, parallelizable, context-limited, pattern-following, and very good at clear tasks. The more I treated them like tools with specific strengths, the better things got.
Be careful about copying the whole setup.
This works for me because I can review code, I have strong product opinions, and I have a very real nightly user test with my daughter. I am just telling the truth about what happened when I tried it.
The app ships bedtime stories to a 4-year-old who is sensitive and aware and for whom bedtime has always been a little hard. She does not know or care that an AI agent wrote the code that generates the story where she rides a cloud whale to the moon.
She just knows the whale said her name.
You can try Endless Storytime or find me on LinkedIn.