By Pritika Mehta | YC S20, Applied AI @ Anthropic


Every AI startup I've worked with builds an agent that nails the demo. Then they ship it and it keeps screwing up the same way. Calling cold leads hot. Missing obvious red flags. Giving advice so generic it could apply to anyone. The founder tweaks the prompt, fixes one thing, but breaks another. No idea if the agent is actually getting better. Sound familiar?

I built a system that fixes this. It runs on Claude's API, takes 5 mins, and costs about 50 cents.

I am using customer discovery as the example because all YC founders are doing it right now, but the pattern works for whatever you're building.


This works for your agent, not just mine

Before I walk you through the steps, here's why it matters

If you're building... Your "messy input" is... Your eval rubric grades...
Support agent Support tickets Resolution accuracy, tone, escalation decisions
Sales research agent Prospect data Personalization quality, research depth
Code review agent Pull requests Bug detection accuracy, false positive rate
Compliance agent Legal documents Extraction accuracy, risk identification
Onboarding agent User profiles Personalization quality, flow completeness
Content agent Briefs and guidelines Brand voice accuracy, factual correctness

The feedback loop is identical. Swap in your data, define your rubric, run the loop. The architecture is the same for all use-cases.


The pattern

Run agent → Grade outputs → Find failure patterns → Improve prompt → Rerun → Measure

Six steps. Let me walk through each one.


Step 1: Run the agent

The agent takes messy customer notes, the kind you actually have in your Notion right now:

"call with mike from healthtech. SUPER enthusiastic. 'this is exactly
what we need!' BUT said 'let me socialize this internally.' They just
signed a 2-year contract with a competitor. Budget not available until
January..."

One Claude API call turns this mess into something you can actually act on: lead quality, pain points, what to build, and what NOT to build.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1500,
    temperature=0,
    system=AGENT_SYSTEM_PROMPT,
    messages=[
        {"role": "user", "content": f"Analyze this note:\\n\\n{note}"}
    ]
)