Create an Ontology

Design, validate, and register an ontology to define exactly what structured data to extract from your documents.

Last updated: 2026-04-01

An ontology defines the structured data you want to extract from a specific document type. This guide walks you through designing, creating, registering, and testing an ontology from scratch.


1

Design Your Schema

Start by identifying the document type you are targeting and listing every field you need to extract. For each field, determine its type, whether it is required, and any validation constraints.

Consider these questions:

  • What document type does this ontology apply to? (This becomes the taxonomy value.)
  • What fields are always present on this document type? (These are required fields.)
  • What fields are sometimes present? (These are optional fields.)
  • Are there fields with a fixed set of allowed values? (Use allowed_values.)
  • Are there nested or repeated structures? (Use the array type with item definitions.)

2

Write the Ontology YAML

Ontologies are defined in YAML format. Here is a complete example for a purchase order:

purchase_order.yamlyaml
"color:#e06c75">taxonomy: purchase_order
"color:#e06c75">description: "Fields to extract from purchase order documents."
"color:#e06c75">version: "1.0.0"

"color:#e06c75">fields:
  - name: po_number
    "color:#e06c75">type: string
    "color:#e06c75">description: "The purchase order number."
    "color:#e06c75">required: "color:#d19a66">true

  - name: vendor_name
    "color:#e06c75">type: string
    "color:#e06c75">description: "Name of the supplier or vendor."
    "color:#e06c75">required: "color:#d19a66">true

  - name: order_date
    "color:#e06c75">type: date
    "color:#e06c75">description: "Date the purchase order was issued (YYYY-MM-DD)."
    "color:#e06c75">required: "color:#d19a66">true

  - name: delivery_date
    "color:#e06c75">type: date
    "color:#e06c75">description: "Expected delivery date (YYYY-MM-DD)."
    "color:#e06c75">required: "color:#d19a66">false

  - name: shipping_address
    "color:#e06c75">type: string
    "color:#e06c75">description: "Full shipping/delivery address."
    "color:#e06c75">required: "color:#d19a66">false

  - name: payment_terms
    "color:#e06c75">type: string
    "color:#e06c75">description: "Payment terms (e.g., Net 30, Net 60, Due on Receipt)."
    "color:#e06c75">required: "color:#d19a66">false
    "color:#e06c75">allowed_values:
      - "Net 15"
      - "Net 30"
      - "Net 60"
      - "Net 90"
      - "Due on Receipt"

  - name: currency
    "color:#e06c75">type: string
    "color:#e06c75">description: "ISO 4217 currency code."
    "color:#e06c75">required: "color:#d19a66">true
    "color:#e06c75">allowed_values: [USD, EUR, GBP, CHF, JPY, CAD, AUD]

  - name: subtotal
    "color:#e06c75">type: number
    "color:#e06c75">description: "Order subtotal before tax and shipping."
    "color:#e06c75">required: "color:#d19a66">true

  - name: tax_amount
    "color:#e06c75">type: number
    "color:#e06c75">description: "Total tax amount."
    "color:#e06c75">required: "color:#d19a66">false

  - name: total_amount
    "color:#e06c75">type: number
    "color:#e06c75">description: "Grand total including tax and shipping."
    "color:#e06c75">required: "color:#d19a66">true

  - name: line_items
    "color:#e06c75">type: array
    "color:#e06c75">description: "Individual items on the purchase order."
    "color:#e06c75">required: "color:#d19a66">true
    "color:#e06c75">items:
      - name: item_number
        "color:#e06c75">type: string
        "color:#e06c75">description: "Item or SKU number."
      - name: description
        "color:#e06c75">type: string
        "color:#e06c75">description: "Item description."
      - name: quantity
        "color:#e06c75">type: number
        "color:#e06c75">description: "Ordered quantity."
      - name: unit_price
        "color:#e06c75">type: number
        "color:#e06c75">description: "Price per unit."
      - name: amount
        "color:#e06c75">type: number
        "color:#e06c75">description: "Line total (quantity x unit_price)."

3

Validate with the SDK

Use the SDK to validate your ontology before registering it:

bash
bizsupply ontology validate ./purchase_order.yaml
="color:#5c6370;font-style:italic"># ✓ Taxonomy: purchase_order
="color:#5c6370;font-style:italic"># ✓ Fields: 11 total (5 required, 6 optional)
="color:#5c6370;font-style:italic"># ✓ Array field "line_items" has 5 sub-fields
="color:#5c6370;font-style:italic"># ✓ Allowed values validated for: payment_terms, currency
="color:#5c6370;font-style:italic"># ✓ No circular references detected
="color:#5c6370;font-style:italic"># All checks passed.

4

Register the Ontology

Register the ontology with the platform via the API:

bash
curl -X POST "https://api.bizsupply.com/v1/ontologies" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy": "purchase_order",
    "description": "Fields to extract from purchase order documents.",
    "version": "1.0.0",
    "fields": [
      {
        "name": "po_number",
        "type": "string",
        "description": "The purchase order number.",
        "required": true
      },
      {
        "name": "vendor_name",
        "type": "string",
        "description": "Name of the supplier or vendor.",
        "required": true
      },
      {
        "name": "total_amount",
        "type": "number",
        "description": "Grand total including tax and shipping.",
        "required": true
      }
    ]
  }'

(The example above is abbreviated -- include all fields from your YAML definition.)

json
{
  "id": "ont_po_v1",
  "taxonomy": "purchase_order",
  "version": "1.0.0",
  "field_count": 11,
  "status": "active",
  "created_at": "2026-01-18T09: 00: 00Z"
}

5

Create a Matching Extraction Plugin

Your Extraction plugin receives the ontology fields and uses them to construct LLM prompts. Here is a complete example:

po_extractor.pypython
from bizsupply_sdk import ExtractionPlugin, PluginError
import json


class PurchaseOrderExtractor(ExtractionPlugin):
    """Extracts structured data from purchase orders using LLM."""

    name = "po-extractor"
    version = "1.0.0"

    max_content_length: int = 8000

    def extract(self, document, fields: list[dict]) -> dict:
        if not document.content:
            raise PluginError("Document has no text content.", retryable=False)

        # Format fields for the prompt
        fields_text = self.format_fields_for_prompt(fields)

        prompt = f"""You are a document extraction assistant. Extract the following fields
from this purchase order document. Return a valid JSON object with the field values.
If a field is not found in the document, use null for optional fields.

Fields to extract:
{fields_text}

Document content:
{document.content[:self.max_content_length]}

Return ONLY a valid JSON object. Do not include markdown formatting or explanation."""

        result = self.prompt_llm(prompt, temperature=0.1)

        # Parse the LLM response
        try:
            clean = result.strip()
            if clean.startswith("```"):
                clean = clean.split("\n", 1)[1]
                clean = clean.rsplit("```", 1)[0]
            extracted = json.loads(clean)
        except json.JSONDecodeError:
            raise PluginError(
                f"LLM returned invalid JSON: {result[:200]}",
                retryable=True,
            )

        # Validate required fields
        required_fields = [f["name"] for f in fields if f.get("required")]
        missing = [f for f in required_fields if not extracted.get(f)]
        if missing:
            self.log("warning", f"Missing required fields: {missing}")

        return extracted

6

Wire It Into a Pipeline

Reference the ontology ID and extraction plugin when creating your pipeline. See Create a Pipeline for the full pipeline creation workflow.

7

Execute and Verify

Run the pipeline on a small test batch and inspect the extracted fields:

bash
="color:#5c6370;font-style:italic"># Execute with a small batch
curl -X POST "https://api.bizsupply.com/v1/pipelines/pip_po_pipeline/execute" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"options": {"max_documents": 3}}'

="color:#5c6370;font-style:italic"># Check job results
curl -X GET "https://api.bizsupply.com/v1/jobs/JOB_ID/documents" \
  -H "Authorization: Bearer YOUR_API_KEY"
8

Iterate on Field Descriptions

The quality of extracted data depends heavily on field descriptions. If the LLM is extracting incorrect values, improve the description text to be more specific about what to look for and how to format the value.

ℹ️Note

Field descriptions are directly included in LLM prompts via format_fields_for_prompt(). A clear, specific description like "The purchase order number, typically formatted as PO-YYYY-NNNNN" produces better results than a vague one like "PO number".

9

Version and Update

When you need to add or modify fields, create a new ontology version rather than modifying the existing one. This ensures previously extracted documents remain consistent.

bash
curl -X POST "https://api.bizsupply.com/v1/ontologies" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy": "purchase_order",
    "version": "1.1.0",
    "description": "PO extraction — added approval_status field.",
    "fields": [ ... ]
  }'

Then update your pipeline to reference the new ontology version.


Field Type Reference

TypeDescriptionExample ValueNotes
stringText value."Acme Corp"Max length 10,000 characters by default.
numberNumeric value (integer or decimal).2450.00Stored as a floating-point number.
dateDate value."2026-01-15"Expected format: YYYY-MM-DD. The LLM normalizes various formats.
booleanTrue or false.trueAccepts true/false, yes/no, 1/0 from LLM output.
arrayList of objects.[{"desc": "Item 1"}]Must define sub-fields via the items property.

Common Issues

IssueCauseResolution
LLM returns fields not in the ontologyPrompt does not restrict output to ontology fields.Use format_fields_for_prompt() and instruct the LLM to return ONLY the listed fields.
Date fields have inconsistent formatsSource documents use varied date formats.Add format guidance in the field description: "Date in YYYY-MM-DD format."
Array fields return a flat stringLLM did not parse the table or list structure correctly.Increase max_content_length to include the full table. Add explicit formatting instructions.
Required field returns nullField is not present in the document, or the LLM missed it.Improve field description specificity. Consider making the field optional if it is legitimately absent in some documents.

Best Practices

  • Write precise field descriptions — the description is the single most important factor in extraction quality. Be specific about location, format, and common variations.
  • Start with required fields only — add optional fields after you confirm the required ones extract correctly.
  • Use allowed_values sparingly — only for fields with a genuinely fixed set of valid values. Over-constraining can cause extraction failures.
  • Version your ontologies — never modify a production ontology in place. Create a new version and update your pipeline reference.
  • Test with real documents — synthetic test documents often do not capture the variability of real-world documents. Test with at least 10 real samples.
  • Include units in descriptions — for numeric fields, specify the expected unit (e.g., "Total amount in the document currency, as a decimal number").