Benchmarks need independent ground truth

A 95% benchmark score against ground truth that was hand-typed from LLM output measures the LLM's consistency, not its accuracy. Source your ground truth somewhere else.

3 min read

Benchmarks are the right tool for measuring extraction quality. They are also the easiest tool to fool yourself with. The trap is subtle: someone runs the pipeline on a sample, exports the output to a spreadsheet, the team marks "this looks right" against each row, and the spreadsheet becomes the benchmark ground truth. The next time you run the pipeline, it scores 95% against ground truth that was derived from its own previous output.

What that score actually means

It means the LLM is consistent with itself, which it almost always is at temperature: 0.1. It does not mean the extraction is correct.

Where to source ground truth

Original system of record. If the data already exists in an ERP, contract management system, or CRM, use those values, not the LLM's.
Manual transcription, blind to LLM output. Have someone type the values from the document without seeing the extraction.
A second extraction with a different prompt and model. Where the two agree, treat as confirmed; where they disagree, manually adjudicate.

Benchmark on a stable, small set

You do not need a thousand-document benchmark. Twenty to fifty documents that you have ground truth for, kept stable across prompt and ontology changes, will tell you everything you need. Re-running the same fifty after every change is also vastly cheaper than running a thousand.

Still need help?

If this article does not solve it, the bizSupply team is one ticket away.

Submit a ticket