Build an AI Agent That Learns From Its Mistakes

By Pritika Mehta | YC S20, Applied AI @ Anthropic

Every AI startup I've worked with builds an agent that nails the demo. Then they ship it and it keeps screwing up the same way. Calling cold leads hot. Missing obvious red flags. Giving advice so generic it could apply to anyone. The founder tweaks the prompt, fixes one thing, but breaks another. No idea if the agent is actually getting better. Sound familiar?

I built a system that fixes this. It runs on Claude's API, takes 5 mins, and costs about 50 cents.

I am using customer discovery as the example because all YC founders are doing it right now, but the pattern works for whatever you're building.

This works for your agent, not just mine

Before I walk you through the steps, here's why it matters

If you're building...	Your "messy input" is...	Your eval rubric grades...
Support agent	Support tickets	Resolution accuracy, tone, escalation decisions
Sales research agent	Prospect data	Personalization quality, research depth
Code review agent	Pull requests	Bug detection accuracy, false positive rate
Compliance agent	Legal documents	Extraction accuracy, risk identification
Onboarding agent	User profiles	Personalization quality, flow completeness
Content agent	Briefs and guidelines	Brand voice accuracy, factual correctness

The feedback loop is identical. Swap in your data, define your rubric, run the loop. The architecture is the same for all use-cases.

The pattern

Run agent → Grade outputs → Find failure patterns → Improve prompt → Rerun → Measure

Six steps. Let me walk through each one.

Step 1: Run the agent

The agent takes messy customer notes, the kind you actually have in your Notion right now:

"call with mike from healthtech. SUPER enthusiastic. 'this is exactly
what we need!' BUT said 'let me socialize this internally.' They just
signed a 2-year contract with a competitor. Budget not available until
January..."

One Claude API call turns this mess into something you can actually act on: lead quality, pain points, what to build, and what NOT to build.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1500,
    temperature=0,
    system=AGENT_SYSTEM_PROMPT,
    messages=[
        {"role": "user", "content": f"Analyze this note:\\n\\n{note}"}
    ]
)