Create an Ontology
Design, validate, and register an ontology to define exactly what structured data to extract from your documents.
An ontology defines the structured data you want to extract from a specific document type. This guide walks you through designing, creating, registering, and testing an ontology from scratch.
Design Your Schema
Start by identifying the document type you are targeting and listing every field you need to extract. For each field, determine its type, whether it is required, and any validation constraints.
Consider these questions:
- What document type does this ontology apply to? (This becomes the taxonomy value.)
- What fields are always present on this document type? (These are required fields.)
- What fields are sometimes present? (These are optional fields.)
- Are there fields with a fixed set of allowed values? (Use allowed_values.)
- Are there nested or repeated structures? (Use the array type with item definitions.)
Write the Ontology YAML
Ontologies are defined in YAML format. Here is a complete example for a purchase order:
"color:#e06c75">taxonomy: purchase_order
"color:#e06c75">description: "Fields to extract from purchase order documents."
"color:#e06c75">version: "1.0.0"
"color:#e06c75">fields:
- name: po_number
"color:#e06c75">type: string
"color:#e06c75">description: "The purchase order number."
"color:#e06c75">required: "color:#d19a66">true
- name: vendor_name
"color:#e06c75">type: string
"color:#e06c75">description: "Name of the supplier or vendor."
"color:#e06c75">required: "color:#d19a66">true
- name: order_date
"color:#e06c75">type: date
"color:#e06c75">description: "Date the purchase order was issued (YYYY-MM-DD)."
"color:#e06c75">required: "color:#d19a66">true
- name: delivery_date
"color:#e06c75">type: date
"color:#e06c75">description: "Expected delivery date (YYYY-MM-DD)."
"color:#e06c75">required: "color:#d19a66">false
- name: shipping_address
"color:#e06c75">type: string
"color:#e06c75">description: "Full shipping/delivery address."
"color:#e06c75">required: "color:#d19a66">false
- name: payment_terms
"color:#e06c75">type: string
"color:#e06c75">description: "Payment terms (e.g., Net 30, Net 60, Due on Receipt)."
"color:#e06c75">required: "color:#d19a66">false
"color:#e06c75">allowed_values:
- "Net 15"
- "Net 30"
- "Net 60"
- "Net 90"
- "Due on Receipt"
- name: currency
"color:#e06c75">type: string
"color:#e06c75">description: "ISO 4217 currency code."
"color:#e06c75">required: "color:#d19a66">true
"color:#e06c75">allowed_values: [USD, EUR, GBP, CHF, JPY, CAD, AUD]
- name: subtotal
"color:#e06c75">type: number
"color:#e06c75">description: "Order subtotal before tax and shipping."
"color:#e06c75">required: "color:#d19a66">true
- name: tax_amount
"color:#e06c75">type: number
"color:#e06c75">description: "Total tax amount."
"color:#e06c75">required: "color:#d19a66">false
- name: total_amount
"color:#e06c75">type: number
"color:#e06c75">description: "Grand total including tax and shipping."
"color:#e06c75">required: "color:#d19a66">true
- name: line_items
"color:#e06c75">type: array
"color:#e06c75">description: "Individual items on the purchase order."
"color:#e06c75">required: "color:#d19a66">true
"color:#e06c75">items:
- name: item_number
"color:#e06c75">type: string
"color:#e06c75">description: "Item or SKU number."
- name: description
"color:#e06c75">type: string
"color:#e06c75">description: "Item description."
- name: quantity
"color:#e06c75">type: number
"color:#e06c75">description: "Ordered quantity."
- name: unit_price
"color:#e06c75">type: number
"color:#e06c75">description: "Price per unit."
- name: amount
"color:#e06c75">type: number
"color:#e06c75">description: "Line total (quantity x unit_price)."Validate with the SDK
Use the SDK to validate your ontology before registering it:
bizsupply ontology validate ./purchase_order.yaml
="color:#5c6370;font-style:italic"># ✓ Taxonomy: purchase_order
="color:#5c6370;font-style:italic"># ✓ Fields: 11 total (5 required, 6 optional)
="color:#5c6370;font-style:italic"># ✓ Array field "line_items" has 5 sub-fields
="color:#5c6370;font-style:italic"># ✓ Allowed values validated for: payment_terms, currency
="color:#5c6370;font-style:italic"># ✓ No circular references detected
="color:#5c6370;font-style:italic"># All checks passed.Register the Ontology
Register the ontology with the platform via the API:
curl -X POST "https://api.bizsupply.com/v1/ontologies" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"taxonomy": "purchase_order",
"description": "Fields to extract from purchase order documents.",
"version": "1.0.0",
"fields": [
{
"name": "po_number",
"type": "string",
"description": "The purchase order number.",
"required": true
},
{
"name": "vendor_name",
"type": "string",
"description": "Name of the supplier or vendor.",
"required": true
},
{
"name": "total_amount",
"type": "number",
"description": "Grand total including tax and shipping.",
"required": true
}
]
}'(The example above is abbreviated -- include all fields from your YAML definition.)
{
"id": "ont_po_v1",
"taxonomy": "purchase_order",
"version": "1.0.0",
"field_count": 11,
"status": "active",
"created_at": "2026-01-18T09: 00: 00Z"
}Create a Matching Extraction Plugin
Your Extraction plugin receives the ontology fields and uses them to construct LLM prompts. Here is a complete example:
from bizsupply_sdk import ExtractionPlugin, PluginError
import json
class PurchaseOrderExtractor(ExtractionPlugin):
"""Extracts structured data from purchase orders using LLM."""
name = "po-extractor"
version = "1.0.0"
max_content_length: int = 8000
def extract(self, document, fields: list[dict]) -> dict:
if not document.content:
raise PluginError("Document has no text content.", retryable=False)
# Format fields for the prompt
fields_text = self.format_fields_for_prompt(fields)
prompt = f"""You are a document extraction assistant. Extract the following fields
from this purchase order document. Return a valid JSON object with the field values.
If a field is not found in the document, use null for optional fields.
Fields to extract:
{fields_text}
Document content:
{document.content[:self.max_content_length]}
Return ONLY a valid JSON object. Do not include markdown formatting or explanation."""
result = self.prompt_llm(prompt, temperature=0.1)
# Parse the LLM response
try:
clean = result.strip()
if clean.startswith("```"):
clean = clean.split("\n", 1)[1]
clean = clean.rsplit("```", 1)[0]
extracted = json.loads(clean)
except json.JSONDecodeError:
raise PluginError(
f"LLM returned invalid JSON: {result[:200]}",
retryable=True,
)
# Validate required fields
required_fields = [f["name"] for f in fields if f.get("required")]
missing = [f for f in required_fields if not extracted.get(f)]
if missing:
self.log("warning", f"Missing required fields: {missing}")
return extractedWire It Into a Pipeline
Reference the ontology ID and extraction plugin when creating your pipeline. See Create a Pipeline for the full pipeline creation workflow.
Execute and Verify
Run the pipeline on a small test batch and inspect the extracted fields:
="color:#5c6370;font-style:italic"># Execute with a small batch
curl -X POST "https://api.bizsupply.com/v1/pipelines/pip_po_pipeline/execute" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"options": {"max_documents": 3}}'
="color:#5c6370;font-style:italic"># Check job results
curl -X GET "https://api.bizsupply.com/v1/jobs/JOB_ID/documents" \
-H "Authorization: Bearer YOUR_API_KEY"Iterate on Field Descriptions
The quality of extracted data depends heavily on field descriptions. If the LLM is extracting incorrect values, improve the description text to be more specific about what to look for and how to format the value.
Field descriptions are directly included in LLM prompts via format_fields_for_prompt(). A clear, specific description like "The purchase order number, typically formatted as PO-YYYY-NNNNN" produces better results than a vague one like "PO number".
Version and Update
When you need to add or modify fields, create a new ontology version rather than modifying the existing one. This ensures previously extracted documents remain consistent.
curl -X POST "https://api.bizsupply.com/v1/ontologies" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"taxonomy": "purchase_order",
"version": "1.1.0",
"description": "PO extraction — added approval_status field.",
"fields": [ ... ]
}'Then update your pipeline to reference the new ontology version.
Field Type Reference
| Type | Description | Example Value | Notes |
|---|---|---|---|
| string | Text value. | "Acme Corp" | Max length 10,000 characters by default. |
| number | Numeric value (integer or decimal). | 2450.00 | Stored as a floating-point number. |
| date | Date value. | "2026-01-15" | Expected format: YYYY-MM-DD. The LLM normalizes various formats. |
| boolean | True or false. | true | Accepts true/false, yes/no, 1/0 from LLM output. |
| array | List of objects. | [{"desc": "Item 1"}] | Must define sub-fields via the items property. |
Common Issues
| Issue | Cause | Resolution |
|---|---|---|
| LLM returns fields not in the ontology | Prompt does not restrict output to ontology fields. | Use format_fields_for_prompt() and instruct the LLM to return ONLY the listed fields. |
| Date fields have inconsistent formats | Source documents use varied date formats. | Add format guidance in the field description: "Date in YYYY-MM-DD format." |
| Array fields return a flat string | LLM did not parse the table or list structure correctly. | Increase max_content_length to include the full table. Add explicit formatting instructions. |
| Required field returns null | Field is not present in the document, or the LLM missed it. | Improve field description specificity. Consider making the field optional if it is legitimately absent in some documents. |
Best Practices
- Write precise field descriptions — the description is the single most important factor in extraction quality. Be specific about location, format, and common variations.
- Start with required fields only — add optional fields after you confirm the required ones extract correctly.
- Use allowed_values sparingly — only for fields with a genuinely fixed set of valid values. Over-constraining can cause extraction failures.
- Version your ontologies — never modify a production ontology in place. Create a new version and update your pipeline reference.
- Test with real documents — synthetic test documents often do not capture the variability of real-world documents. Test with at least 10 real samples.
- Include units in descriptions — for numeric fields, specify the expected unit (e.g., "Total amount in the document currency, as a decimal number").