Did that prompt change improve your agent or break it?

Right now you ship LLM changes by eye — try three inputs in your head, merge, and find out it regressed when a user complains. EvalForge is a git-native, zero-infra eval suite: your test cases live in your repo, run in CI, and open a PR with the quality diff.

Join the early-access list Reserve a spot — $19/mo

Founder pilot · limited spots · no charge today

Sound familiar?

You're evaluating your LLM app by eye, and you know it

You change a prompt and just hope

You tweak a prompt, run it on a couple of inputs in your head, see it 'seems to work', and merge. Silent regressions slip through and you only find them when a paying user churns over a bad output.

Every new model forces a re-eval you can't do

4.6 → 4.7 → next. Each release makes you ask 'can I migrate without breaking anything?' — and you have no fast, cheap way to answer it across all your prompts, so you either skip the upgrade or gamble on it.

Real eval tools are built for teams, not you

Langfuse, Braintrust and friends want you to stand up a server, learn an SDK, and ship your data to their platform. The friction outweighs the pain of any single change — until production breaks. So you stay on a couple of asserts in pytest.

How it works

Your eval suite, in your repo, in your CI, in your PR

Cases live in your repo, the agent grows them

You define eval cases as YAML / markdown in your own repo — versioned, in code review. The agent generates more cases from real failures: inputs that broke in prod or edge cases it expands automatically, so your coverage grows instead of going stale.

Cross-model, judged by calibrated LLM-as-judge

Run the same suite across N models and prompts in parallel — 4.6 vs 4.7 vs an open-weight — and judge with task-calibrated judges (extraction, tone, factual fidelity) instead of a generic judge that hallucinates its own verdict. The judges are the moat.

A quality diff in your PR, not another dashboard

EvalForge opens a PR / comment with a readable diff: 'prompt B: +12% on extraction, −4% on tone, 2 new regressions in date edge-cases.' You approve or reject the merge. The decision stays yours; generate-run-judge is automated.

Cómo funciona

From by-eye to in-CI in four steps

Drop a few eval cases as YAML/markdown into your repo, or let the agent seed them from real failures.

EvalForge runs the suite across the models and prompts you list, in parallel, on every CI run.

Calibrated judges score each case and compute the quality diff vs your baseline.

It opens a PR/comment with the diff

you read '+12% extraction, −4% tone' and decide whether to merge.

Pricing

Indie

$15/mo

For the solo dev shipping one LLM app. Hosted calibrated judges, evals in hosted CI, quality history, and a sensible monthly run cap. The OSS core CLI stays free forever.

El más elegido

Pro

$29/mo

For the indie running quality as the product. More judge volume, agentic case generation from real failures, and unlimited cross-model runs across providers.

Team

$99/mo

For the 3–10 person micro-team. Shared baselines, mandatory PR gating, and seats — for when LLM quality goes from 'my discipline' to team policy.

Open-core: the runner CLI is free and self-hostable. You pay for the calibrated judges, hosted CI, and quality history — never for what the free tier already gives you. Usage-based overage on judge tokens above your tier.

I shipped a prompt tweak that quietly tanked extraction accuracy and didn't notice for two weeks — until a customer churned. The thing I actually wanted was this: cases in my repo, the diff in my PR, and a judge I can read and trust. When 4.7 dropped I ran the suite and knew in five minutes I could migrate.— Name, indie hacker · Founder pilot

Testimonio de piloto en validación.

Dudas

Preguntas frecuentes

How do the calibrated judges work, and are they reliable?

Instead of one generic LLM-as-judge that hallucinates its verdict, EvalForge ships judges calibrated per task — extraction, tone, factual fidelity — tuned and checked against public ground-truth. And because everything is git-native, the judge isn't a black box: you can read it, edit it, and version it in your repo. If you don't trust a verdict, you can see exactly why it was made.

Does it really run in CI, git-native, with no infra?

Yes. Your eval cases live as YAML/markdown inside your repo, versioned and in code review. The suite runs in your existing CI (GitHub Actions) — no server to stand up, no SDK to learn, no uploading your data to a platform. The result lands where you already work: a PR or comment with the quality diff.

What models does it support? Is it really cross-model?

Cross-model is the whole point. Run the same suite against 4.6 vs 4.7, against an open-weight, against another provider — so you can decide whether to migrate, not just whether one prompt works on one model. A lab's native eval will only ever help you stay on their model; EvalForge is provider-agnostic by design and helps you leave one if you should.

Can I try it before paying?

Yes. The OSS core CLI is free and self-hostable — run evals locally with no judges premium and no account. Reserve a spot to get early access to the hosted calibrated judges and CI, with no charge today. You see it working on your own repo before anything is billed.

Reserva tu plaza

Reserve a spot in the founder pilot

I'm opening a small founder pilot for indie hackers shipping LLM apps in production. Reserve a spot to get early access to the calibrated judges and hosted CI — and to shape what gets built.

Early access

Reserve a spot

✓ ¡Hecho! Te avisamos en cuanto abramos el piloto. Gracias por confiar.