The 5-document rule for new pipelines
Always run a new pipeline with max_documents: 5 before unleashing it on a full source. The cost of waiting an extra five minutes is much lower than the cost of a misconfigured run.
Every new pipeline has at least one assumption baked into it that is wrong. The classifier prompt is a bit too permissive, the ontology has an extra required field, the extraction prompt assumes the document language is English. None of those assumptions are visible until you see real output, and by then you have either spent credits or skipped them — but never both.
max_documents: 5 andskip_duplicates: true. Read every output. Then scale.What you are checking
Five documents is enough to spot the four most common configuration problems:
- Classification overreach — non-target documents being labelled as in-scope.
- Schema mismatches — required fields the LLM cannot find, returning nulls or invalid JSON.
- Truncation losses —
max_content_lengthcutting off the page where the total appears. - Prompt drift — extraction returning summaries instead of structured fields.
What to do with the output
If all five rows look right, increase to 50 and re-check. If even one looks wrong, fix and re-run the same five before going wider. The goal is to catch a misconfiguration on a sample of five, not on a sample of five thousand.