1. Best Open‑Source Libraries for Page‑Level Extraction
PDF Extraction
These are the most reliable for clean, page‑segmented text:
1. pdfminer.six
- Very mature, stable, Pythonic.
- Preserves page boundaries naturally.
- Good for text‑heavy PDFs.
- Noise level: low, but layout fidelity is limited.
2. PyMuPDF (fitz)
- Fastest and most robust for mixed‑content PDFs.
- Extracts:
-
text
-
bounding boxes
-
fonts
-
images
- Excellent for downstream diffing because you can get structured blocks.
3. PDFPlumber
- Built on pdfminer but adds:
- Very clean API for page‑wise extraction.
Recommendation:
Use PyMuPDF as your primary extractor. Fall back to pdfminer.six for edge cases.
-–
DOCX Extraction
1. python-docx
- The standard library for DOCX.
- Extracts paragraphs, runs, metadata.
- Does not preserve page boundaries (because DOCX is flow‑based).
2. docx2python
- Better at preserving structure.
- Still no true page boundaries (because DOCX doesn’t store them).
3. LibreOffice headless conversion → PDF → PyMuPDF
This is the industry-standard workaround when page fidelity matters.
Recommendation:
If page‑level comparison is required, convert DOCX → PDF → extract pages.
Otherwise, use python-docx for semantic diffing.
-–
2. Best Approaches for Page‑Wise Comparison
You essentially have three families of methods:
-–
A. Classical Diff (token/line/paragraph diff)
Pros
- Deterministic.
- Great for legal, policy, and technical documents.
- Easy to highlight exact insertions/deletions.
Cons
- Sensitive to formatting changes.
- Sensitive to OCR noise.
Best use:
When documents are text‑heavy and formatting is stable.
-–
B. Embedding‑Based Similarity (semantic comparison)
Use sentence‑transformers or similar models to embed each page.
Pros
- Robust to formatting changes.
- Captures semantic shifts (“added clause”, “reworded section”).
Cons
- Cannot show exact diffs.
- Requires threshold tuning.
Best use:
When documents evolve semantically but maintain structure.
-–
C. Hybrid Approach (the real-world winner)
This is what most document‑intelligence pipelines use:
1. Embedding similarity to detect whether a page changed.
2. Classical diff to describe how it changed.
Why hybrid works best:
- Embeddings filter out false positives.
- Diff gives human‑readable change summaries.
-–
3. Thresholds for “Modified vs Unchanged”
There is no universal threshold, but these are strong defaults from production systems:
| Method | Threshold | Meaning |
|-------|-----------|---------|
| Embedding cosine similarity | 0.92–0.95 | Above = unchanged, below = modified |
| Token-level Jaccard similarity | 0.85 | Below = modified |
| Levenshtein ratio | 0.90 | Below = modified |
Recommended combined rule:
A page is “modified” if two of three signals fall below threshold:
- cosine similarity < 0.93
- Jaccard < 0.85
- Levenshtein < 0.90
This dramatically reduces noise.
-–
4. Architecture Pattern That Works in Production
A clean, modular pipeline looks like this:
-–
Step 1 — Ingestion Layer
- Detect file type.
- Convert DOCX → PDF if page fidelity required.
- Extract:
-–
Step 2 — Normalization Layer
Normalize text:
- remove headers/footers
- collapse whitespace
- standardize bullet points
- remove page numbers
This step alone improves diff accuracy by 30–50%.
-–
Step 3 — Storage Layer
Store:
- raw text per page
- normalized text per page
- embeddings per page
- metadata
- version ID
Use a simple structure like:
`
document_id/
version_001/
page_001.json
page_002.json
...
version_002/
...
`
Or store in a vector DB (Weaviate, Chroma, Pinecone) if scaling.
-–
Step 4 — Comparison Layer
For each page:
1. Compare embeddings.
2. If similarity < threshold → run classical diff.
3. Produce:
-
similarity score
-
change ratio
-
diff summary
-
modified page flag
-–
Step 5 — Output Layer (for orchestration)
Produce a structured JSON object:
`json
{
“document_id”: “abc123”,
“version_new”: “v3”,
“version_old”: “v2”,
“overall_similarity”: 0.94,
“pages_modified”: [3, 14, 22],
“page_summaries”: {
"14": "Added two new clauses regarding payment terms."
}
}
`
This is orchestration‑friendly and model‑agnostic.
-–
5. Lessons Learned from Real Pipelines
1. Page numbers shift — don’t rely on absolute page alignment
Use embedding similarity to match pages before comparing them.
2. Normalization is more important than the diff algorithm
Removing headers/footers eliminates 80% of false positives.
3. Embeddings catch semantic changes that diffs miss
E.g., “shall” → “must” is legally significant but diff‑small.
4. Always store both raw and normalized text
You’ll need raw text for human review and normalized for algorithms.
5. Avoid OCR unless absolutely necessary
OCR noise destroys diff accuracy. If you must OCR, use:
- Tesseract with LSTM mode
- or PaddleOCR (more accurate)
I hope this helps? Regards Antony.