Validate your data source before connecting it

A 50,000-email mailbox that is 90% spam is 45,000 wasted extractions. Spot-check the source before you let bizSupply loose on it.

4 min read

bizSupply is happy to process whatever you connect. That is a feature when the source is curated and a problem when it is not. Pointing a pipeline at a generic inbox or a shared drive without checking what it contains is the single most common way to burn through your monthly credit allocation without seeing any value.

The pitfall in one line

If most of the documents in your source are not the type you are extracting, you will pay to extract noise and discover that almost nothing useful came out the other end.

Why it happens

bizSupply meters every successful AI call. Classification runs on every document the source returns; extraction runs on every document classification accepts. A "junk" PDF that looks contract-shaped on a quick glance still gets a full extraction call, and you pay for that call even when the result is empty or wrong.

From the docs: classification consumes 1–3 credits per document, a standard extraction 3–8 credits, and complex ontologies with nested arrays can hit 10–25 credits per document. Multiply that by tens of thousands of documents in an unfiltered source and the math gets ugly fast.

What to do instead

Treat connecting a source as a two-step process: verify, then ingest. Spend ten minutes confirming the source is worth it before you spend any credits.

·Open the source manually and skim 20–30 documents. Estimate the share that match what you want to extract.
·For mailboxes, restrict the IMAP folder or label to the one already used for invoices, contracts, or receipts.
·For drives, point the pipeline at the specific subfolder, not the root.
·Use pre-conditions on file size, MIME type, and content length to drop obvious non-matches before extraction.
·Run with max_documents: 5 first and review the output before scaling up.

Don't

·Connect a personal or shared mailbox at the root level "to be safe".
·Trust the classifier alone to filter out unrelated documents — it costs credits to run.
·Re-run the same broad source in a tight schedule before checking what came out the first time.
·Skip pre-conditions because "the LLM will figure it out".

A concrete example

Say you want to track supplier contracts and you connect a 15-year-old company mailbox with 50,000 messages. A quick sample says roughly 8% of those messages are contract-related. The other 92% — meeting invites, internal threads, newsletters — still get classified, and many of them get extracted before being marked low-confidence. At ~5 credits per attempted extraction across 46,000 non-target messages, that is 230,000 credits gone before you have a single useful row of data.

The fix is upstream: ask IT for a label or shared folder where finance archives signed contracts, point bizSupply there, and leave the noisy inbox alone. The same pattern applies to drives, S3 buckets, and any other connector — the cheapest filter is the one applied before the source plugin hands the document to bizSupply.

Rule of thumb

If you cannot describe the contents of a source in one sentence ("the supplier-contracts shared folder", "the invoices@ inbox from the last 12 months"), it is too broad to connect as-is. Narrow it first.

Still need help?

If this article does not solve it, the bizSupply team is one ticket away.

Submit a ticket