Two Paths to Extracting Data from B2B PDFs: Rule-Based vs. LLM-Powered Extraction

Introduction

In the world of B2B document processing, extracting structured data from PDF orders remains a critical yet challenging task. A developer recently tackled this problem twice—first using a traditional rule-based method with pytesseract and then with a modern large language model (LLM) approach using Ollama and LLaMA 3. Both aimed to parse the same realistic B2B order document, but the results—and the effort required—were strikingly different. This article compares the two methodologies, highlighting their strengths, weaknesses, and practical trade-offs.

Two Paths to Extracting Data from B2B PDFs: Rule-Based vs. LLM-Powered Extraction — Source: towardsdatascience.com

The Test Document: A Realistic B2B Order

The test case was a standard PDF containing a purchase order for industrial components. The document included fields such as order number, ship-to address, line items (SKU, quantity, unit price), and total amount. The goal was to extract each field accurately and efficiently, simulating a real-world automation pipeline for an e-procurement system.

Rule-Based Extraction with Pytesseract

How It Works

The rule-based approach relied on pytesseract, a Python wrapper for Google’s Tesseract OCR engine. The workflow: convert the PDF to high-resolution images, apply OCR to extract text, then use regular expressions and positional heuristics (e.g., “look for ‘Order No.’ followed by digits”) to locate and extract the required fields.

Pros

Deterministic and predictable: Once tuned, rules produce the same output for identical inputs.
Low compute cost: No GPU or heavy models needed; runs on a laptop CPU quickly.
Full control: Every extraction rule is explicit and auditable.

Cons

Fragile layout dependence: Minor changes in PDF formatting (e.g., a new line break, different font) break the rules.
High development effort: Required hours of iterating on regex patterns and coordinate thresholds for each field.
Poor generalization: A system built for one vendor’s order form fails on another’s without extensive rework.

In this test, the rule-based extractor achieved about 85% field accuracy on the sample, but failed completely on a slightly different version of the same document.

LLM-Powered Extraction with Ollama and LLaMA 3

How It Works

The LLM approach used Ollama to run LLaMA 3 locally. The PDF was first OCR’d with pytesseract to extract raw text (no layout heuristics), then that unstructured text was passed to the LLM with a carefully engineered prompt that asked: “Given the following purchase order, extract the fields: order number, ship-to address, line items, total. Return JSON.”

Pros

Robust to layout variation: The LLM understood the content and could still extract fields even if the text order changed.
Rapid setup: Development time was under an hour—mostly spent refining the prompt.
High accuracy: Achieved 98% field accuracy on the original document and maintained it on the variant.

Cons

Higher latency and cost: Running LLaMA 3 locally required GPU memory (or CPU fallback slowed to seconds per page).
Non-deterministic: The same prompt may produce slightly different JSON formatting or occasional hallucinations (e.g., inventing a plausible field name).
Requires prompt engineering skill: The quality hinges on how well the prompt is designed.

Head-to-Head Comparison: Rules vs. LLM

Dimension	Rule-Based (pytesseract)	LLM (Ollama + LLaMA 3)
Accuracy on original PDF	85%	98%
Accuracy on variant PDF	~30%	95%
Development time	~8 hours	~1 hour
Runtime per PDF	0.3 seconds	4 seconds (GPU)
Flexibility	Low (hard-coded rules)	High (text understanding)
Maintainability	Poor (rules rot over time)	Good (prompt updates)

When to Choose Each Approach

Choose Rule-Based When…

Your PDFs are highly standardized (same template, same layout) and won’t change.
You need absolute predictability with no tolerance for hallucination.
You have no GPU and processing speed (sub-second) is critical.

Choose LLM When…

You face diverse PDF formats from multiple vendors.
You want quick prototyping and minimal maintenance.
You accept a small chance of errors (mitigable with validation logic).

Conclusion: A Hybrid Future?

This practical comparison shows that for many B2B document extraction tasks, LLMs—even local ones like LLaMA 3—offer a compelling advantage in flexibility and accuracy. The rule-based approach still shines in controlled environments, but the LLM’s ability to understand, not just parse, document content makes it the more future-proof choice.

For production systems, a hybrid pipeline may be ideal: use rules to extract critical fields with high certainty, and drop ambiguous sections (like free‑form notes) into an LLM for reasoning. The developer who built both extractors concluded that the LLM version took one‑tenth the development time and delivered better results—a lesson worth heeding when choosing your next extraction engine.

Whether you stick with rules or embrace LLMs, the key is understanding your document landscape. As document formats continue to evolve, the era of write‑once‑read‑many extraction may be giving way to an era of intelligent reading.

Two Paths to Extracting Data from B2B PDFs: Rule-Based vs. LLM-Powered Extraction

Introduction

The Test Document: A Realistic B2B Order

Rule-Based Extraction with Pytesseract

How It Works

Pros

Cons

LLM-Powered Extraction with Ollama and LLaMA 3

How It Works

Pros

Cons

Head-to-Head Comparison: Rules vs. LLM

When to Choose Each Approach

Choose Rule-Based When…

Choose LLM When…

Conclusion: A Hybrid Future?

Related Articles

Recommended

Discover More