Create an Extraction Plugin

Build an extraction plugin that pulls structured data fields from documents using ontology definitions and LLM-powered analysis.

Last updated: 2026-04-06

Extraction plugins are the core of document processing in bizSupply. They receive a classified document and an ontology field list, then extract structured data from the document content. The extracted fields are validated against the ontology and stored in the platform.


What Extraction Plugins Do

An extraction plugin receives a Document (with its text content) and a list of ontology fields that describe what data to extract. It returns an ExtractionResult — a dictionary mapping field names to their extracted values.

  • Extract structured key-value data from unstructured document text
  • Use LLM analysis guided by ontology field definitions
  • Support multiple field types: string, number, date, boolean, array, and object
  • Validate extracted data against field type constraints
  • Return confidence scores per field when available

How Extraction Works

The platform executes extraction in a six-step flow:

  1. Classification — The document is classified (e.g., "invoice") by a classification plugin.
  2. Ontology Lookup — The Engine selects the ontology that matches the classification label.
  3. Field Resolution — The Engine resolves the list of fields defined in the ontology.
  4. Plugin Dispatch — The Engine calls extract() on your plugin, passing the document and fields.
  5. Validation — The Engine validates extracted values against field type constraints.
  6. Storage — Valid extracted data is persisted and available via the API.

Step 1 — Write the Plugin Code

Create your extraction plugin by extending ExtractionPlugin. The extract() method receives a document and a list of field definitions, and must return an ExtractionResult.

invoice_extractor/plugin.pypython
from bizsupply_sdk import ExtractionPlugin, ExtractionResult, PluginError
import json


class InvoiceExtractorPlugin(ExtractionPlugin):
    """Extracts structured fields from invoice documents."""

    name = "invoice-extractor"
    version = "1.0.0"
    description = "Extracts invoice fields using LLM analysis."

    # Configurable parameters
    max_content_length: int = 8000
    include_line_items: bool = True

    def extract(self, document, fields: list[dict]) -> ExtractionResult:
        """
        Extract structured data from a document.

        Args:
            document: Document object with .content, .filename,
                      .mime_type, .metadata, .document_type.
            fields: List of ontology field definitions. Each dict has:
                - name: str (e.g., "vendor_name")
                - type: str (string, number, date, boolean, array)
                - description: str
                - required: bool

        Returns:
            ExtractionResult with field_name -> value mapping.
        """
        if not document.content or not document.content.strip():
            raise PluginError(
                "Document has no extractable text content.",
                retryable=False,
            )

        # Format the ontology fields into a prompt-friendly string
        fields_text = self.format_fields_for_prompt(fields)

        # Load the extraction prompt template
        prompt_template = self.get_prompt("invoice-extractor")

        # Build the final prompt
        prompt = prompt_template.replace(
            "{{FIELDS}}", fields_text
        ).replace(
            "{{DOCUMENT_CONTENT}}",
            document.content[:self.max_content_length],
        ).replace(
            "{{DOCUMENT_TYPE}}",
            document.document_type or "unknown",
        )

        # Call the LLM
        result = self.prompt_llm(prompt, temperature=0.1, max_tokens=2000)

        # Parse the JSON response
        try:
            extracted = json.loads(result)
        except json.JSONDecodeError:
            # Try to extract JSON from markdown code blocks
            if "```json" in result:
                json_str = result.split("```json")[1].split("```")[0].strip()
                extracted = json.loads(json_str)
            else:
                raise PluginError(
                    f"LLM returned invalid JSON: {result[:200]}",
                    retryable=True,
                )

        # Build the ExtractionResult
        extraction = ExtractionResult()
        for field in fields:
            field_name = field["name"]
            if field_name in extracted:
                extraction.set_field(
                    name=field_name,
                    value=extracted[field_name],
                    confidence=extracted.get(f"{field_name}_confidence", None),
                )

        self.log("info", f"Extracted {len(extraction.fields)} fields from '{document.filename}'.")
        return extraction

Step 2 — Create an Ontology with Fields

Define the fields your extraction plugin will populate. The ontology is registered separately and linked to a document type (classification label).

invoice-ontology.jsonjson
{
  "name": "invoice-ontology",
  "document_type": "invoice",
  "fields": [
    {
      "name": "vendor_name",
      "type": "string",
      "description": "The name of the vendor or supplier.",
      "required": true
    },
    {
      "name": "invoice_number",
      "type": "string",
      "description": "The unique invoice identifier.",
      "required": true
    },
    {
      "name": "invoice_date",
      "type": "date",
      "description": "The date the invoice was issued (ISO 8601).",
      "required": true
    },
    {
      "name": "due_date",
      "type": "date",
      "description": "The payment due date (ISO 8601).",
      "required": false
    },
    {
      "name": "subtotal",
      "type": "number",
      "description": "The subtotal before tax.",
      "required": false
    },
    {
      "name": "tax_amount",
      "type": "number",
      "description": "Total tax amount.",
      "required": false
    },
    {
      "name": "total_amount",
      "type": "number",
      "description": "The total amount due, including taxes.",
      "required": true
    },
    {
      "name": "currency",
      "type": "string",
      "description": "The currency code (e.g., USD, EUR).",
      "required": true
    },
    {
      "name": "line_items",
      "type": "array",
      "description": "List of line items with description, quantity, unit_price, and amount.",
      "required": false
    }
  ]
}

Step 3 — Create an Extraction Prompt

Register a reusable prompt template for your extractor:

bash
curl -X POST https://api.bizsupply.com/v1/prompts \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "invoice-extractor",
    "scope": "tenant",
    "template": "You are a document extraction expert.\n\nExtract the following fields from this {{DOCUMENT_TYPE}} document.\nReturn the result as a JSON object with field names as keys.\nFor each field, also include a confidence score as field_name_confidence (0.0 to 1.0).\nIf a field cannot be found, set its value to null.\n\nFields to extract:\n{{FIELDS}}\n\nDocument content:\n{{DOCUMENT_CONTENT}}\n\nRespond with ONLY the JSON object. No explanation."
  }'

Step 4 — Validate and Register

bash
="color:#5c6370;font-style:italic"># Validate
bizsupply validate ./plugin.py
="color:#5c6370;font-style:italic"># ✓ Plugin class found: InvoiceExtractorPlugin
="color:#5c6370;font-style:italic"># ✓ Base class: ExtractionPlugin
="color:#5c6370;font-style:italic"># ✓ Required method implemented: extract
="color:#5c6370;font-style:italic"># ✓ Return type: ExtractionResult
="color:#5c6370;font-style:italic"># All checks passed.

="color:#5c6370;font-style:italic"># Test with a sample document
bizsupply test ./plugin.py \
  --document sample-invoice.pdf \
  --ontology invoice-ontology.json
="color:#5c6370;font-style:italic"># ✓ extract() returned ExtractionResult with 7 fields
="color:#5c6370;font-style:italic"># ✓ vendor_name: "Acme Corp" (confidence: 0.95)
="color:#5c6370;font-style:italic"># ✓ invoice_number: "INV-2026-0042" (confidence: 0.99)
="color:#5c6370;font-style:italic"># ✓ total_amount: 1500.00 (confidence: 0.92)

="color:#5c6370;font-style:italic"># Register
curl -X POST https://api.bizsupply.com/v1/plugins \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "invoice-extractor",
    "type": "extraction",
    "version": "1.0.0",
    "description": "Extracts invoice fields using LLM analysis.",
    "module_path": "invoice_extractor.plugin.InvoiceExtractorPlugin"
  }'

Key Methods

MethodSignatureDescription
extractextract(self, document, fields) -> ExtractionResultRequired. Extracts structured data from the document using the provided field definitions.
prompt_llmself.prompt_llm(prompt, model?, temperature?, max_tokens?)Sends a prompt to the LLM and returns the text response.
get_promptself.get_prompt(name) -> strLoads a registered prompt template by name.
format_fields_for_promptself.format_fields_for_prompt(fields) -> strFormats ontology fields into a prompt-friendly string (name, type, required, description).
logself.log(level, message)Writes to the job execution log.
get_configself.get_config(key, default?)Retrieves a pipeline-level configuration value.

Field Types

Ontology fields support the following types. The platform validates extracted values against these types after extraction.

TypePython TypeExample ValueValidation
stringstr"Acme Corp"Must be a non-empty string.
numberint | float1500.00Must be numeric. Strings like "$1,500" are auto-parsed.
datestr (ISO 8601)"2026-01-15"Must be a valid ISO 8601 date string.
booleanbooltrueMust be true or false.
arraylist[dict][{"desc": "Widget", "qty": 10}]Must be a list. Each item is validated recursively if sub-fields are defined.
objectdict{"street": "123 Main", "city": "NYC"}Must be a dict. Nested fields are validated against sub-field definitions.

Common Mistakes

1. Returning a dict instead of ExtractionResult

python
# WRONG — extract() must return ExtractionResult, not a plain dict
def extract(self, document, fields) -> ExtractionResult:
    return {"vendor_name": "Acme Corp", "total": 1500}

# CORRECT — use ExtractionResult
def extract(self, document, fields) -> ExtractionResult:
    result = ExtractionResult()
    result.set_field("vendor_name", "Acme Corp", confidence=0.95)
    result.set_field("total_amount", 1500.00, confidence=0.92)
    return result

2. Using the old execute() method

python
# WRONG — execute() was removed in SDK 1.0
def execute(self, document, fields):
    ...

# CORRECT — use extract()
def extract(self, document, fields) -> ExtractionResult:
    ...

3. Missing bizsupply_sdk import

python
# WRONG — ExtractionResult not imported
from bizsupply_sdk import ExtractionPlugin

def extract(self, document, fields) -> ExtractionResult:  # NameError!
    result = ExtractionResult()

# CORRECT — import ExtractionResult
from bizsupply_sdk import ExtractionPlugin, ExtractionResult

Pipeline Order

ℹ️Note

Extraction always runs AFTER classification in a pipeline. The classification label determines which ontology (and therefore which fields) are passed to your extract() method. If a document is not classified, it skips the extraction stage entirely.

A typical pipeline order is: Source -> Classification -> Extraction -> Aggregation. You can have multiple extraction plugins in a pipeline, each handling different document types with different ontologies.


Next Steps

  • Create an Ontology with the field definitions your extractor will populate.
  • Create a Classification Plugin to route documents to the correct ontology.
  • Create a Pipeline to wire classification and extraction together.
  • Create a Benchmark to measure extraction accuracy against ground truth data.