Create a Benchmark

Build a benchmark plugin that scores, computes, and compares extraction quality against ground truth data.

Last updated: 2026-04-06

Benchmarks are specialized plugins that measure the quality of extraction results. They score individual documents against ground truth data, compute aggregate metrics, and compare different plugin versions or configurations. Unlike regular plugins, benchmarks are synchronous and do not call prompt_llm.


What Benchmarks Do

Benchmarks serve three purposes in the bizSupply platform:

  1. Score — Evaluate each extracted document against ground truth fields and assign a per-document score (0.0-1.0).
  2. Compute — Aggregate individual document scores into summary metrics (accuracy, precision, F1, etc.).
  3. Compare — Evaluate two sets of extraction results (e.g., plugin v1 vs. v2) and report the difference.
ℹ️Note

Benchmarks are distinct from regular plugins. They do not run inside pipelines and they do not call prompt_llm or any LLM service. They are purely computational — scoring, aggregating, and comparing data that has already been extracted.


Scaffold with the CLI

Use the SDK CLI to generate a benchmark project:

bash
bizsupply scaffold benchmark energy-contract-price
cd energy-contract-price/

This creates the following structure:

text
energy-contract-price/
├── benchmark.py         # Your benchmark implementation
├── benchmark.yaml       # Benchmark metadata
├── ground_truth/        # Sample ground truth data
│   └── sample.json
├── requirements.txt
└── tests/
    └── test_benchmark.py

Identity Properties

Every benchmark must define these class-level identity properties:

PropertyTypeRequiredDescription
namestrYesUnique benchmark identifier (kebab-case).
versionstrYesSemantic version (e.g., "1.0.0").
descriptionstrYesHuman-readable description of what this benchmark measures.
target_labelslist[str]YesDocument classification labels this benchmark applies to (e.g., ["invoice", "credit_note"]).
metric_unitstrYesThe unit of measurement for scores (e.g., "accuracy", "f1", "precision", "mae").
group_bystr | NoneNoOptional field name to group scores by (e.g., "vendor_name" to see per-vendor accuracy).

Aggregation Rules

Benchmarks can define MATCH_RULES to create named subsets of documents for focused scoring. Each MatchRule contains one or more MatchConditions that filter documents.

python
from bizsupply_sdk import MatchRule, MatchCondition

MATCH_RULES = [
    MatchRule(
        name="high-value",
        conditions=[
            MatchCondition(field="total_amount", operator="gte", value=10000),
        ],
        aggregation="mean",
    ),
    MatchRule(
        name="usd-invoices",
        conditions=[
            MatchCondition(field="currency", operator="eq", value="USD"),
            MatchCondition(field="document_type", operator="eq", value="invoice"),
        ],
        aggregation="median",
    ),
    MatchRule(
        name="multi-page",
        conditions=[
            MatchCondition(field="page_count", operator="gt", value=3),
        ],
        aggregation="mean",
    ),
]

MatchCondition Operators

OperatorDescriptionExample
eqEqualsMatchCondition(field="currency", operator="eq", value="USD")
neqNot equalsMatchCondition(field="status", operator="neq", value="draft")
gtGreater thanMatchCondition(field="total_amount", operator="gt", value=5000)
gteGreater than or equalMatchCondition(field="page_count", operator="gte", value=2)
ltLess thanMatchCondition(field="total_amount", operator="lt", value=100)
lteLess than or equalMatchCondition(field="confidence", operator="lte", value=0.5)
containsString containsMatchCondition(field="vendor_name", operator="contains", value="Corp")
not_containsString does not containMatchCondition(field="filename", operator="not_contains", value="draft")
regexRegex matchMatchCondition(field="invoice_number", operator="regex", value="^INV-202[56]")

Implement score()

The score() method evaluates a single document against its ground truth. It receives an ExtendedDocument (which includes both extracted_fields and ground_truth) and returns a float between 0.0 and 1.0.

python
def score(self, document: ExtendedDocument) -> ScoredDocument:
    """
    Score a single document against ground truth.

    Args:
        document: ExtendedDocument with:
            - document.extracted_fields: dict — values from the extraction plugin
            - document.ground_truth: dict — expected correct values
            - document.content, .filename, .metadata, etc.

    Returns:
        ScoredDocument with a score (0.0-1.0) and per-field details.
    """
    extracted = document.extracted_fields
    truth = document.ground_truth
    field_scores = {}
    total = 0
    matched = 0

    for field_name, expected_value in truth.items():
        actual_value = extracted.get(field_name)
        total += 1

        if actual_value is None:
            field_scores[field_name] = {"score": 0.0, "reason": "missing"}
            continue

        if isinstance(expected_value, (int, float)):
            # Numeric comparison with tolerance
            tolerance = abs(expected_value) * 0.01  # 1% tolerance
            is_match = abs(float(actual_value) - expected_value) <= tolerance
        elif isinstance(expected_value, str):
            # Case-insensitive string comparison
            is_match = str(actual_value).strip().lower() == expected_value.strip().lower()
        else:
            is_match = actual_value == expected_value

        score_val = 1.0 if is_match else 0.0
        matched += score_val
        field_scores[field_name] = {
            "score": score_val,
            "expected": expected_value,
            "actual": actual_value,
        }

    overall_score = matched / total if total > 0 else 0.0

    return ScoredDocument(
        document=document,
        score=overall_score,
        details=field_scores,
    )

Implement compute()

The compute() method aggregates scores across multiple scored documents to produce summary metrics.

python
def compute(self, scored_documents: list[ScoredDocument]) -> dict:
    """
    Aggregate individual document scores into summary metrics.

    Args:
        scored_documents: List of ScoredDocument from score().

    Returns:
        dict — summary metrics (e.g., accuracy, precision, etc.).
    """
    if not scored_documents:
        return {"accuracy": 0.0, "total_documents": 0}

    scores = [sd.score for sd in scored_documents]

    # Overall metrics
    metrics = {
        "accuracy": sum(scores) / len(scores),
        "min_score": min(scores),
        "max_score": max(scores),
        "median_score": sorted(scores)[len(scores) // 2],
        "total_documents": len(scored_documents),
        "perfect_scores": sum(1 for s in scores if s == 1.0),
        "zero_scores": sum(1 for s in scores if s == 0.0),
    }

    # Per-field accuracy
    field_totals: dict[str, list[float]] = {}
    for sd in scored_documents:
        for field_name, detail in sd.details.items():
            field_totals.setdefault(field_name, []).append(detail["score"])

    metrics["per_field_accuracy"] = {
        name: sum(values) / len(values)
        for name, values in field_totals.items()
    }

    # Apply match rules
    for rule in self.MATCH_RULES:
        matching = [sd for sd in scored_documents if self._matches_rule(sd, rule)]
        if matching:
            rule_scores = [sd.score for sd in matching]
            metrics[f"rule_{rule.name}"] = {
                "count": len(matching),
                "accuracy": sum(rule_scores) / len(rule_scores),
            }

    return metrics

Implement compare()

The compare() method evaluates two sets of metrics (typically from different plugin versions or configurations) and returns a comparison report.

python
def compare(self, baseline: dict, candidate: dict) -> dict:
    """
    Compare two sets of metrics.

    Args:
        baseline: Metrics from the baseline run (e.g., plugin v1).
        candidate: Metrics from the candidate run (e.g., plugin v2).

    Returns:
        dict — comparison report with deltas and verdict.
    """
    baseline_acc = baseline.get("accuracy", 0.0)
    candidate_acc = candidate.get("accuracy", 0.0)
    delta = candidate_acc - baseline_acc

    # Per-field comparison
    field_deltas = {}
    baseline_fields = baseline.get("per_field_accuracy", {})
    candidate_fields = candidate.get("per_field_accuracy", {})
    all_fields = set(baseline_fields.keys()) | set(candidate_fields.keys())

    for field in all_fields:
        b = baseline_fields.get(field, 0.0)
        c = candidate_fields.get(field, 0.0)
        field_deltas[field] = {
            "baseline": b,
            "candidate": c,
            "delta": c - b,
            "improved": c > b,
        }

    return {
        "baseline_accuracy": baseline_acc,
        "candidate_accuracy": candidate_acc,
        "delta": delta,
        "improved": delta > 0,
        "regression": delta < -0.02,  # >2% drop is a regression
        "per_field_deltas": field_deltas,
        "verdict": "improved" if delta > 0.01 else "regressed" if delta < -0.02 else "neutral",
    }

Validate and Register

bash
="color:#5c6370;font-style:italic"># Validate
bizsupply validate ./benchmark.py
="color:#5c6370;font-style:italic"># ✓ Benchmark class found: EnergyContractPriceBenchmark
="color:#5c6370;font-style:italic"># ✓ Base class: BenchmarkPlugin
="color:#5c6370;font-style:italic"># ✓ Required methods: score, compute, compare
="color:#5c6370;font-style:italic"># ✓ target_labels: ["energy_contract"]
="color:#5c6370;font-style:italic"># ✓ metric_unit: "accuracy"
="color:#5c6370;font-style:italic"># All checks passed.

="color:#5c6370;font-style:italic"># Run against ground truth data
bizsupply benchmark run ./benchmark.py \
  --ground-truth ./ground_truth/ \
  --plugin invoice-extractor
="color:#5c6370;font-style:italic"># ✓ Scored 150 documents
="color:#5c6370;font-style:italic"># ✓ Overall accuracy: 0.923
="color:#5c6370;font-style:italic"># ✓ Per-field: vendor_name=0.98, total_amount=0.95, contract_term=0.85

="color:#5c6370;font-style:italic"># Register
curl -X POST https://api.bizsupply.com/v1/benchmarks \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "energy-contract-price",
    "version": "1.0.0",
    "description": "Scores extraction accuracy for energy contract pricing fields.",
    "module_path": "energy_contract_price.benchmark.EnergyContractPriceBenchmark",
    "target_labels": ["energy_contract"],
    "metric_unit": "accuracy"
  }'

Complete Example

Here is a complete benchmark implementation for energy contract price extraction:

energy_contract_price/benchmark.pypython
from bizsupply_sdk import (
    BenchmarkPlugin, ExtendedDocument, ScoredDocument,
    MatchRule, MatchCondition,
)


class EnergyContractPriceBenchmark(BenchmarkPlugin):
    """Scores extraction accuracy for energy contract pricing fields."""

    name = "energy-contract-price"
    version = "1.0.0"
    description = "Scores extraction accuracy for energy contract pricing fields."

    # This benchmark applies to documents classified as "energy_contract"
    target_labels = ["energy_contract"]
    metric_unit = "accuracy"
    group_by = "supplier_name"

    # Aggregation rules for focused scoring
    MATCH_RULES = [
        MatchRule(
            name="fixed-rate",
            conditions=[
                MatchCondition(field="rate_type", operator="eq", value="fixed"),
            ],
            aggregation="mean",
        ),
        MatchRule(
            name="high-value",
            conditions=[
                MatchCondition(field="annual_value", operator="gte", value=50000),
            ],
            aggregation="mean",
        ),
    ]

    # Fields with numeric tolerance
    NUMERIC_FIELDS = {"unit_price", "annual_value", "contract_value", "tax_rate"}
    NUMERIC_TOLERANCE = 0.01  # 1% tolerance for numeric comparisons

    def score(self, document: ExtendedDocument) -> ScoredDocument:
        extracted = document.extracted_fields
        truth = document.ground_truth
        field_scores = {}
        total = 0
        matched = 0.0

        for field_name, expected in truth.items():
            actual = extracted.get(field_name)
            total += 1

            if actual is None:
                field_scores[field_name] = {"score": 0.0, "reason": "missing"}
                continue

            if field_name in self.NUMERIC_FIELDS:
                try:
                    exp_f = float(expected)
                    act_f = float(actual)
                    tolerance = abs(exp_f) * self.NUMERIC_TOLERANCE
                    is_match = abs(act_f - exp_f) <= max(tolerance, 0.01)
                except (ValueError, TypeError):
                    is_match = False
            else:
                is_match = str(actual).strip().lower() == str(expected).strip().lower()

            s = 1.0 if is_match else 0.0
            matched += s
            field_scores[field_name] = {
                "score": s,
                "expected": expected,
                "actual": actual,
            }

        return ScoredDocument(
            document=document,
            score=matched / total if total > 0 else 0.0,
            details=field_scores,
        )

    def compute(self, scored_documents: list[ScoredDocument]) -> dict:
        if not scored_documents:
            return {"accuracy": 0.0, "total_documents": 0}

        scores = [sd.score for sd in scored_documents]
        metrics = {
            "accuracy": sum(scores) / len(scores),
            "total_documents": len(scored_documents),
            "perfect_scores": sum(1 for s in scores if s == 1.0),
        }

        # Per-field accuracy
        field_totals: dict[str, list[float]] = {}
        for sd in scored_documents:
            for fn, detail in sd.details.items():
                field_totals.setdefault(fn, []).append(detail["score"])

        metrics["per_field_accuracy"] = {
            n: sum(v) / len(v) for n, v in field_totals.items()
        }

        # Group by supplier
        if self.group_by:
            groups: dict[str, list[float]] = {}
            for sd in scored_documents:
                key = sd.document.extracted_fields.get(self.group_by, "unknown")
                groups.setdefault(str(key), []).append(sd.score)
            metrics["groups"] = {
                k: {"count": len(v), "accuracy": sum(v) / len(v)}
                for k, v in groups.items()
            }

        return metrics

    def compare(self, baseline: dict, candidate: dict) -> dict:
        b_acc = baseline.get("accuracy", 0.0)
        c_acc = candidate.get("accuracy", 0.0)
        delta = c_acc - b_acc

        return {
            "baseline_accuracy": b_acc,
            "candidate_accuracy": c_acc,
            "delta": delta,
            "improved": delta > 0,
            "regression": delta < -0.02,
            "verdict": "improved" if delta > 0.01 else "regressed" if delta < -0.02 else "neutral",
        }

Common Mistakes

1. Calling prompt_llm() in a benchmark

python
# WRONG — benchmarks are synchronous and cannot call LLM services
def score(self, document):
    result = self.prompt_llm("Score this document...")  # Error!

# CORRECT — benchmarks only do computation, no LLM calls
def score(self, document):
    return self._compare_fields(document.extracted_fields, document.ground_truth)

2. Returning a raw float from score()

python
# WRONG — score() must return ScoredDocument, not a float
def score(self, document) -> ScoredDocument:
    return 0.85

# CORRECT — return ScoredDocument
def score(self, document) -> ScoredDocument:
    return ScoredDocument(document=document, score=0.85, details={})

3. Not handling missing ground truth fields

python
# WRONG — crashes if ground_truth is missing a field
for field in ALL_FIELDS:
    expected = document.ground_truth[field]  # KeyError!

# CORRECT — use .get() with fallback
for field in ALL_FIELDS:
    expected = document.ground_truth.get(field)
    if expected is None:
        continue  # Skip fields not in ground truth

4. Integer division in score calculation

python
# WRONG — integer division gives 0 in Python 2 (and confuses readers)
score = matched // total

# CORRECT — float division
score = matched / total if total > 0 else 0.0

Next Steps

  • Create an Extraction Plugin to generate the extraction results your benchmark will score.
  • Create an Ontology to define the fields your benchmark validates against.
  • Read the Plugin Service API Reference for data model details (ExtendedDocument, ScoredDocument, etc.).
  • Use the CLI to run benchmarks against different plugin versions and compare results.