Create a Benchmark
Build a benchmark plugin that scores, computes, and compares extraction quality against ground truth data.
Benchmarks are specialized plugins that measure the quality of extraction results. They score individual documents against ground truth data, compute aggregate metrics, and compare different plugin versions or configurations. Unlike regular plugins, benchmarks are synchronous and do not call prompt_llm.
What Benchmarks Do
Benchmarks serve three purposes in the bizSupply platform:
- Score — Evaluate each extracted document against ground truth fields and assign a per-document score (0.0-1.0).
- Compute — Aggregate individual document scores into summary metrics (accuracy, precision, F1, etc.).
- Compare — Evaluate two sets of extraction results (e.g., plugin v1 vs. v2) and report the difference.
Benchmarks are distinct from regular plugins. They do not run inside pipelines and they do not call prompt_llm or any LLM service. They are purely computational — scoring, aggregating, and comparing data that has already been extracted.
Scaffold with the CLI
Use the SDK CLI to generate a benchmark project:
bizsupply scaffold benchmark energy-contract-price
cd energy-contract-price/This creates the following structure:
energy-contract-price/
├── benchmark.py # Your benchmark implementation
├── benchmark.yaml # Benchmark metadata
├── ground_truth/ # Sample ground truth data
│ └── sample.json
├── requirements.txt
└── tests/
└── test_benchmark.pyIdentity Properties
Every benchmark must define these class-level identity properties:
| Property | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | Unique benchmark identifier (kebab-case). |
| version | str | Yes | Semantic version (e.g., "1.0.0"). |
| description | str | Yes | Human-readable description of what this benchmark measures. |
| target_labels | list[str] | Yes | Document classification labels this benchmark applies to (e.g., ["invoice", "credit_note"]). |
| metric_unit | str | Yes | The unit of measurement for scores (e.g., "accuracy", "f1", "precision", "mae"). |
| group_by | str | None | No | Optional field name to group scores by (e.g., "vendor_name" to see per-vendor accuracy). |
Aggregation Rules
Benchmarks can define MATCH_RULES to create named subsets of documents for focused scoring. Each MatchRule contains one or more MatchConditions that filter documents.
from bizsupply_sdk import MatchRule, MatchCondition
MATCH_RULES = [
MatchRule(
name="high-value",
conditions=[
MatchCondition(field="total_amount", operator="gte", value=10000),
],
aggregation="mean",
),
MatchRule(
name="usd-invoices",
conditions=[
MatchCondition(field="currency", operator="eq", value="USD"),
MatchCondition(field="document_type", operator="eq", value="invoice"),
],
aggregation="median",
),
MatchRule(
name="multi-page",
conditions=[
MatchCondition(field="page_count", operator="gt", value=3),
],
aggregation="mean",
),
]MatchCondition Operators
| Operator | Description | Example |
|---|---|---|
| eq | Equals | MatchCondition(field="currency", operator="eq", value="USD") |
| neq | Not equals | MatchCondition(field="status", operator="neq", value="draft") |
| gt | Greater than | MatchCondition(field="total_amount", operator="gt", value=5000) |
| gte | Greater than or equal | MatchCondition(field="page_count", operator="gte", value=2) |
| lt | Less than | MatchCondition(field="total_amount", operator="lt", value=100) |
| lte | Less than or equal | MatchCondition(field="confidence", operator="lte", value=0.5) |
| contains | String contains | MatchCondition(field="vendor_name", operator="contains", value="Corp") |
| not_contains | String does not contain | MatchCondition(field="filename", operator="not_contains", value="draft") |
| regex | Regex match | MatchCondition(field="invoice_number", operator="regex", value="^INV-202[56]") |
Implement score()
The score() method evaluates a single document against its ground truth. It receives an ExtendedDocument (which includes both extracted_fields and ground_truth) and returns a float between 0.0 and 1.0.
def score(self, document: ExtendedDocument) -> ScoredDocument:
"""
Score a single document against ground truth.
Args:
document: ExtendedDocument with:
- document.extracted_fields: dict — values from the extraction plugin
- document.ground_truth: dict — expected correct values
- document.content, .filename, .metadata, etc.
Returns:
ScoredDocument with a score (0.0-1.0) and per-field details.
"""
extracted = document.extracted_fields
truth = document.ground_truth
field_scores = {}
total = 0
matched = 0
for field_name, expected_value in truth.items():
actual_value = extracted.get(field_name)
total += 1
if actual_value is None:
field_scores[field_name] = {"score": 0.0, "reason": "missing"}
continue
if isinstance(expected_value, (int, float)):
# Numeric comparison with tolerance
tolerance = abs(expected_value) * 0.01 # 1% tolerance
is_match = abs(float(actual_value) - expected_value) <= tolerance
elif isinstance(expected_value, str):
# Case-insensitive string comparison
is_match = str(actual_value).strip().lower() == expected_value.strip().lower()
else:
is_match = actual_value == expected_value
score_val = 1.0 if is_match else 0.0
matched += score_val
field_scores[field_name] = {
"score": score_val,
"expected": expected_value,
"actual": actual_value,
}
overall_score = matched / total if total > 0 else 0.0
return ScoredDocument(
document=document,
score=overall_score,
details=field_scores,
)Implement compute()
The compute() method aggregates scores across multiple scored documents to produce summary metrics.
def compute(self, scored_documents: list[ScoredDocument]) -> dict:
"""
Aggregate individual document scores into summary metrics.
Args:
scored_documents: List of ScoredDocument from score().
Returns:
dict — summary metrics (e.g., accuracy, precision, etc.).
"""
if not scored_documents:
return {"accuracy": 0.0, "total_documents": 0}
scores = [sd.score for sd in scored_documents]
# Overall metrics
metrics = {
"accuracy": sum(scores) / len(scores),
"min_score": min(scores),
"max_score": max(scores),
"median_score": sorted(scores)[len(scores) // 2],
"total_documents": len(scored_documents),
"perfect_scores": sum(1 for s in scores if s == 1.0),
"zero_scores": sum(1 for s in scores if s == 0.0),
}
# Per-field accuracy
field_totals: dict[str, list[float]] = {}
for sd in scored_documents:
for field_name, detail in sd.details.items():
field_totals.setdefault(field_name, []).append(detail["score"])
metrics["per_field_accuracy"] = {
name: sum(values) / len(values)
for name, values in field_totals.items()
}
# Apply match rules
for rule in self.MATCH_RULES:
matching = [sd for sd in scored_documents if self._matches_rule(sd, rule)]
if matching:
rule_scores = [sd.score for sd in matching]
metrics[f"rule_{rule.name}"] = {
"count": len(matching),
"accuracy": sum(rule_scores) / len(rule_scores),
}
return metricsImplement compare()
The compare() method evaluates two sets of metrics (typically from different plugin versions or configurations) and returns a comparison report.
def compare(self, baseline: dict, candidate: dict) -> dict:
"""
Compare two sets of metrics.
Args:
baseline: Metrics from the baseline run (e.g., plugin v1).
candidate: Metrics from the candidate run (e.g., plugin v2).
Returns:
dict — comparison report with deltas and verdict.
"""
baseline_acc = baseline.get("accuracy", 0.0)
candidate_acc = candidate.get("accuracy", 0.0)
delta = candidate_acc - baseline_acc
# Per-field comparison
field_deltas = {}
baseline_fields = baseline.get("per_field_accuracy", {})
candidate_fields = candidate.get("per_field_accuracy", {})
all_fields = set(baseline_fields.keys()) | set(candidate_fields.keys())
for field in all_fields:
b = baseline_fields.get(field, 0.0)
c = candidate_fields.get(field, 0.0)
field_deltas[field] = {
"baseline": b,
"candidate": c,
"delta": c - b,
"improved": c > b,
}
return {
"baseline_accuracy": baseline_acc,
"candidate_accuracy": candidate_acc,
"delta": delta,
"improved": delta > 0,
"regression": delta < -0.02, # >2% drop is a regression
"per_field_deltas": field_deltas,
"verdict": "improved" if delta > 0.01 else "regressed" if delta < -0.02 else "neutral",
}Validate and Register
="color:#5c6370;font-style:italic"># Validate
bizsupply validate ./benchmark.py
="color:#5c6370;font-style:italic"># ✓ Benchmark class found: EnergyContractPriceBenchmark
="color:#5c6370;font-style:italic"># ✓ Base class: BenchmarkPlugin
="color:#5c6370;font-style:italic"># ✓ Required methods: score, compute, compare
="color:#5c6370;font-style:italic"># ✓ target_labels: ["energy_contract"]
="color:#5c6370;font-style:italic"># ✓ metric_unit: "accuracy"
="color:#5c6370;font-style:italic"># All checks passed.
="color:#5c6370;font-style:italic"># Run against ground truth data
bizsupply benchmark run ./benchmark.py \
--ground-truth ./ground_truth/ \
--plugin invoice-extractor
="color:#5c6370;font-style:italic"># ✓ Scored 150 documents
="color:#5c6370;font-style:italic"># ✓ Overall accuracy: 0.923
="color:#5c6370;font-style:italic"># ✓ Per-field: vendor_name=0.98, total_amount=0.95, contract_term=0.85
="color:#5c6370;font-style:italic"># Register
curl -X POST https://api.bizsupply.com/v1/benchmarks \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "energy-contract-price",
"version": "1.0.0",
"description": "Scores extraction accuracy for energy contract pricing fields.",
"module_path": "energy_contract_price.benchmark.EnergyContractPriceBenchmark",
"target_labels": ["energy_contract"],
"metric_unit": "accuracy"
}'Complete Example
Here is a complete benchmark implementation for energy contract price extraction:
from bizsupply_sdk import (
BenchmarkPlugin, ExtendedDocument, ScoredDocument,
MatchRule, MatchCondition,
)
class EnergyContractPriceBenchmark(BenchmarkPlugin):
"""Scores extraction accuracy for energy contract pricing fields."""
name = "energy-contract-price"
version = "1.0.0"
description = "Scores extraction accuracy for energy contract pricing fields."
# This benchmark applies to documents classified as "energy_contract"
target_labels = ["energy_contract"]
metric_unit = "accuracy"
group_by = "supplier_name"
# Aggregation rules for focused scoring
MATCH_RULES = [
MatchRule(
name="fixed-rate",
conditions=[
MatchCondition(field="rate_type", operator="eq", value="fixed"),
],
aggregation="mean",
),
MatchRule(
name="high-value",
conditions=[
MatchCondition(field="annual_value", operator="gte", value=50000),
],
aggregation="mean",
),
]
# Fields with numeric tolerance
NUMERIC_FIELDS = {"unit_price", "annual_value", "contract_value", "tax_rate"}
NUMERIC_TOLERANCE = 0.01 # 1% tolerance for numeric comparisons
def score(self, document: ExtendedDocument) -> ScoredDocument:
extracted = document.extracted_fields
truth = document.ground_truth
field_scores = {}
total = 0
matched = 0.0
for field_name, expected in truth.items():
actual = extracted.get(field_name)
total += 1
if actual is None:
field_scores[field_name] = {"score": 0.0, "reason": "missing"}
continue
if field_name in self.NUMERIC_FIELDS:
try:
exp_f = float(expected)
act_f = float(actual)
tolerance = abs(exp_f) * self.NUMERIC_TOLERANCE
is_match = abs(act_f - exp_f) <= max(tolerance, 0.01)
except (ValueError, TypeError):
is_match = False
else:
is_match = str(actual).strip().lower() == str(expected).strip().lower()
s = 1.0 if is_match else 0.0
matched += s
field_scores[field_name] = {
"score": s,
"expected": expected,
"actual": actual,
}
return ScoredDocument(
document=document,
score=matched / total if total > 0 else 0.0,
details=field_scores,
)
def compute(self, scored_documents: list[ScoredDocument]) -> dict:
if not scored_documents:
return {"accuracy": 0.0, "total_documents": 0}
scores = [sd.score for sd in scored_documents]
metrics = {
"accuracy": sum(scores) / len(scores),
"total_documents": len(scored_documents),
"perfect_scores": sum(1 for s in scores if s == 1.0),
}
# Per-field accuracy
field_totals: dict[str, list[float]] = {}
for sd in scored_documents:
for fn, detail in sd.details.items():
field_totals.setdefault(fn, []).append(detail["score"])
metrics["per_field_accuracy"] = {
n: sum(v) / len(v) for n, v in field_totals.items()
}
# Group by supplier
if self.group_by:
groups: dict[str, list[float]] = {}
for sd in scored_documents:
key = sd.document.extracted_fields.get(self.group_by, "unknown")
groups.setdefault(str(key), []).append(sd.score)
metrics["groups"] = {
k: {"count": len(v), "accuracy": sum(v) / len(v)}
for k, v in groups.items()
}
return metrics
def compare(self, baseline: dict, candidate: dict) -> dict:
b_acc = baseline.get("accuracy", 0.0)
c_acc = candidate.get("accuracy", 0.0)
delta = c_acc - b_acc
return {
"baseline_accuracy": b_acc,
"candidate_accuracy": c_acc,
"delta": delta,
"improved": delta > 0,
"regression": delta < -0.02,
"verdict": "improved" if delta > 0.01 else "regressed" if delta < -0.02 else "neutral",
}Common Mistakes
1. Calling prompt_llm() in a benchmark
# WRONG — benchmarks are synchronous and cannot call LLM services
def score(self, document):
result = self.prompt_llm("Score this document...") # Error!
# CORRECT — benchmarks only do computation, no LLM calls
def score(self, document):
return self._compare_fields(document.extracted_fields, document.ground_truth)2. Returning a raw float from score()
# WRONG — score() must return ScoredDocument, not a float
def score(self, document) -> ScoredDocument:
return 0.85
# CORRECT — return ScoredDocument
def score(self, document) -> ScoredDocument:
return ScoredDocument(document=document, score=0.85, details={})3. Not handling missing ground truth fields
# WRONG — crashes if ground_truth is missing a field
for field in ALL_FIELDS:
expected = document.ground_truth[field] # KeyError!
# CORRECT — use .get() with fallback
for field in ALL_FIELDS:
expected = document.ground_truth.get(field)
if expected is None:
continue # Skip fields not in ground truth4. Integer division in score calculation
# WRONG — integer division gives 0 in Python 2 (and confuses readers)
score = matched // total
# CORRECT — float division
score = matched / total if total > 0 else 0.0Next Steps
- Create an Extraction Plugin to generate the extraction results your benchmark will score.
- Create an Ontology to define the fields your benchmark validates against.
- Read the Plugin Service API Reference for data model details (ExtendedDocument, ScoredDocument, etc.).
- Use the CLI to run benchmarks against different plugin versions and compare results.