The 5-document rule for new pipelines

Always run a new pipeline with max_documents: 5 before unleashing it on a full source. The cost of waiting an extra five minutes is much lower than the cost of a misconfigured run.

3 min read

Every new pipeline has at least one assumption baked into it that is wrong. The classifier prompt is a bit too permissive, the ontology has an extra required field, the extraction prompt assumes the document language is English. None of those assumptions are visible until you see real output, and by then you have either spent credits or skipped them — but never both.

The rule

Before scheduling a pipeline or running it on the whole source, execute it once with max_documents: 5 andskip_duplicates: true. Read every output. Then scale.

What you are checking

Five documents is enough to spot the four most common configuration problems:

Classification overreach — non-target documents being labelled as in-scope.
Schema mismatches — required fields the LLM cannot find, returning nulls or invalid JSON.
Truncation losses — max_content_length cutting off the page where the total appears.
Prompt drift — extraction returning summaries instead of structured fields.

What to do with the output

If all five rows look right, increase to 50 and re-check. If even one looks wrong, fix and re-run the same five before going wider. The goal is to catch a misconfiguration on a sample of five, not on a sample of five thousand.

Build the habit into your team

Anyone with permission to schedule a pipeline should be in the habit of running it manually at least twice with small batches first. It is the cheapest insurance bizSupply offers.

Still need help?

If this article does not solve it, the bizSupply team is one ticket away.

Submit a ticket