Word positions in a PDF are not stored as “words.” PDFs store text drawing commands, each placing individual glyphs (characters) at specific coordinates. To retrieve word positions, libraries reconstruct words by grouping characters based on spacing and layout analysis.
Below is a clear breakdown of how this works, grounded in how real PDF parsers operate.
π§© Core Idea: PDFs Store Characters, Not Words
A PDF page contains drawing instructions like:
BT
/F1 12 Tf
100 700 Td
(Hello) Tj
ET
This means:
“Hello” is drawn starting at coordinate (100, 700)
Each character has its own width and position
There is no concept of a “word” in the PDF spec
So libraries must infer words by grouping characters.
π How Libraries Retrieve Word Positions
1. PDFBox (Java)
PDFBox extracts characters with coordinates, then you must group them into words.
PDFTextStrippercan sort characters by positionYou split text into words and compute bounding boxes from character positions
Custom logic is required to determine word boundaries (spacing, layout)
Process:
Extract characters + their bounding boxes
Sort by position
Group characters into words based on spacing thresholds
Compute the word’s bounding box from min/max character coordinates
2. PDFMiner (Python)
PDFMiner performs layout analysis and groups characters into LTTextLine and LTTextBox objects.
Each text box has a bounding box (
lobj.bbox)You can iterate through layout objects to get text and coordinates
Process:
Parse page layout
For each
LTTextBox, getbboxand textIf you need word-level positions, split text and compute sub‑bounding boxes manually
3. Spire.PDF (Python)
Spire.PDF provides direct APIs for finding text and retrieving its coordinates.
PdfTextFinderreturnsPdfTextFindResultobjectsEach result includes the (X, Y) coordinates of the matched text
Process:
Search for a word or phrase
Library returns bounding boxes directly
No manual grouping required
4. pdfplumber / PyMuPDF (Python)
pdfplumber extracts word-level data directly, including bounding boxes.
It returns each word with: text, x0, y0, x1, y1
Useful for structured extraction
Process:
Load page
Call
page.extract_words()Each word includes its bounding box and font info
π§ Why This Is Necessary
PDFs are designed for visual fidelity, not semantic structure. Text may be:
Split into multiple drawing commands
Positioned arbitrarily
Overlapping or rotated
Missing spaces entirely
So libraries must reconstruct logical text from low-level drawing instructions.
π Summary
Word positions are retrieved by:
Extracting character-level coordinates from PDF drawing commands
Sorting characters by position
Grouping characters into words using spacing/layout heuristics
Computing bounding boxes for each word
Some libraries (pdfplumber, Spire.PDF) provide word-level positions directly, while others (PDFBox, PDFMiner) require manual grouping.
If you want, I can show you example code for:
PDFBox (Java)
PDFMiner (Python)
pdfplumber (Python)
PyMuPDF (Python)
Which language are you working in?
You can remove words from a PDF in Python, but how depends on what you mean by “remove”:
If you want to truly delete the text from the PDF content stream (so it cannot be selected, searched, or recovered), you need redaction.
If you want to remove all text, some libraries support this directly.
If you want to remove specific words, you must search for them and redact their bounding boxes.
Below are the practical, working methods supported by current Python libraries, grounded in the sources retrieved.
✅ Best Practical Method: Redaction with PyMuPDF (fitz)
PyMuPDF supports true redaction, which removes the text from the PDF content stream — not just covering it.
A working example from a verified solution uses page.search_for() to find the word, then applies redactions.
Example: Remove a specific word (“Invoice”)
import fitz # PyMuPDF
doc = fitz.open("input.pdf")
page = doc.load_page(0)
# Find all instances of the word
matches = page.search_for("Invoice")
# Add redaction annotations
for rect in matches:
page.add_redact_annot(rect)
# Apply redactions (this deletes the text)
page.apply_redactions()
doc.save("output.pdf", garbage=3, deflate=True)
This permanently removes the text from the PDF.
✅ Removing all text using pypdf
The pypdf library added a method called remove_text() that removes all text objects from the PDF.
Example:
from pypdf import PdfWriter
writer = PdfWriter(clone_from="input.pdf")
writer.remove_text()
writer.write("output.pdf")
This removes all text drawn using normal PDF text operators.
⚠️ Limitations:
OCR text embedded in images may not be removed (because it’s not real text).
Removing text of a specific font requires custom logic.
⚠️ What you cannot do
You cannot simply “replace words” inside a PDF content stream like editing a text file. PDFs store drawing commands, not editable text.
You cannot remove text that is part of an image unless you also remove or modify the image.
π§ Which method should you use?
| Goal | Best Method | Notes |
|---|---|---|
| Remove specific words | PyMuPDF redaction | Most reliable; deletes text permanently |
| Remove all text | pypdf.remove_text() | Works for real text, not OCR |
| Remove OCR text | Impossible without removing the image | OCR text is baked into the image |
Want me to generate a ready‑to‑run script?
Tell me:
Do you want to remove specific words, all text, or text matching a pattern?
Do you want to process one PDF or many PDFs?
Should the output preserve layout or not?
I can produce a clean, modular Python script tailored to your workflow.
You can replace text in a PDF using Python — but PDFs don’t allow direct “find‑and‑replace” like a Word document. You must remove the old text and draw new text in its place. The most reliable way to do this in Python is with PyMuPDF (fitz), which supports searching, redaction, and reinserting text.
Below is a clear, practical guide based on the most authoritative sources available, including StackOverflow discussions and official PyMuPDF examples.
✅ Short Answer
Use PyMuPDF to:
Search for the text (
page.search_for)Redact (delete) the original text (
page.add_redact_annot+page.apply_redactions)Insert the replacement text (
page.insert_text)
This is the only method that reliably works across real PDFs.
✅ Working Example: Replace "Hello" with "Goodbye" in a PDF
import fitz # PyMuPDF
def replace_text(pdf_in, pdf_out, find_text, replace_text):
doc = fitz.open(pdf_in)
for page in doc:
# Find all occurrences of the text
matches = page.search_for(find_text)
for rect in matches:
# Remove the original text
page.add_redact_annot(rect)
page.apply_redactions()
# Insert replacement text at each location
for rect in matches:
baseline = fitz.Point(rect.x0, rect.y1 - 2) # adjust as needed
page.insert_text(
baseline,
replace_text,
fontsize=12,
fontname="helv",
color=(0, 0, 0)
)
doc.save(pdf_out)
doc.close()
replace_text("input.pdf", "output.pdf", "Hello", "Goodbye")
This approach is directly aligned with the PyMuPDF search‑and‑replace patterns described in multiple sources.
π§ Why This Works (and Why Other Methods Fail)
PDFs store text as drawing commands, not editable text.
You cannot simply .replace() inside the PDF stream — attempts to do so often appear to work in memory but revert when saving, as documented in user reports.
PyMuPDF solves this by:
Locating text via geometric search
Deleting it via redaction (true removal)
Drawing new text at the same coordinates
This is the only robust method that works across fonts, encodings, and multi‑span text.
π Alternative: Spire.PDF (Commercial / Windows‑focused)
Spire.PDF provides a high‑level ReplaceText API:
replacer.ReplaceText("old", "new")
It can replace the first match or all matches. However:
It is not open‑source
It may not preserve layout as precisely as PyMuPDF
It is Windows‑centric
π§© When You Should NOT Use Text Replacement
You cannot replace text if:
The text is part of an image (e.g., scanned PDF)
The text is split across multiple spans in unpredictable ways
The font is embedded in a way that prevents re‑drawing
In those cases, OCR or image editing is required.
Want a version that preserves the original font, size, and color?
PyMuPDF supports extracting the exact font properties of each span and re‑inserting text with matching formatting. I can generate that version if you want.
Would you like:
Simple replacement (as above)
Font‑preserving replacement
Batch replacement across many PDFs
Regex‑based replacement
Here’s the font‑preserving version you asked for — the one that reads the actual font name, size, color, and positioning of each text span, deletes the original text, and redraws the replacement text using the same formatting.
This is the most accurate method PyMuPDF allows without manually parsing raw PDF operators.
✅ Font‑Preserving Text Replacement in PyMuPDF
What this version does
For every match of your target text:
Search for the text (
page.search_for)Extract the exact font properties of the matched text
font name
font size
text color
baseline position
Redact the original text (true deletion)
Re‑insert the replacement text using the same formatting
This preserves the visual style of the original PDF as closely as possible.
✅ Full Working Code (Font‑Preserving Replacement)
import fitz # PyMuPDF
def replace_text_preserve_font(pdf_in, pdf_out, find_text, replace_text):
doc = fitz.open(pdf_in)
for page in doc:
# Step 1: find all occurrences of the text
matches = page.search_for(find_text)
# Step 2: extract font info for each match
font_info = []
words = page.get_text("dict")["blocks"]
for rect in matches:
info = None
# Walk through text blocks to find the span that overlaps the match
for block in words:
if block["type"] != 0:
continue
for line in block["lines"]:
for span in line["spans"]:
span_rect = fitz.Rect(span["bbox"])
if span_rect.intersects(rect):
info = {
"rect": rect,
"font": span["font"],
"size": span["size"],
"color": span["color"],
"origin": (span_rect.x0, span_rect.y1)
}
break
if info:
break
if info:
break
if info:
font_info.append(info)
# Step 3: redact original text
for rect in matches:
page.add_redact_annot(rect)
page.apply_redactions()
# Step 4: reinsert replacement text with original formatting
for info in font_info:
x, y = info["origin"]
page.insert_text(
(x, y),
replace_text,
fontname=info["font"],
fontsize=info["size"],
color=info["color"],
)
doc.save(pdf_out, garbage=3, deflate=True)
doc.close()
π§ How it Works
✔ Extracts exact font properties
PyMuPDF’s "dict" text extraction mode gives you span‑level metadata:
span["font"]→ font namespan["size"]→ font sizespan["color"]→ RGB integerspan["bbox"]→ bounding box
✔ Matches spans to the search rectangles
We check which span overlaps the search rectangle and use its formatting.
✔ Redacts then redraws
This ensures the original text is gone, not just covered.
⚠ Important Notes
If the word appears across multiple spans (e.g., ligatures, kerning), PyMuPDF may treat it as multiple pieces.
If the PDF uses embedded fonts with unusual encodings, the replacement text may not render identically.
If the text is part of an image (scanned PDF), no library can replace it — OCR is required.
π― Want an even more advanced version?
I can generate versions that:
Handle multi‑span words
Preserve rotation and skew
Preserve line spacing
Support regex replacement
Process hundreds of PDFs in a batch pipeline
Replace text while preserving exact kerning
Just tell me what level of precision you want.
Here’s a production‑style, batch‑capable PyMuPDF solution that:
Handles multi‑span words (uses word boxes, then merges span info)
Preserves rotation and skew (uses each span’s text matrix)
Preserves line spacing & layout (only replaces in‑place, no reflow)
Processes hundreds of PDFs in a directory pipeline
Core idea
Use
page.get_text("words")to get word‑level boxes.For each word to replace:
Find its word box.
Collect all spans whose bbox intersects that word box.
Use the first span’s font, size, color, and matrix as the “style”.
Redact the word box.
Re‑insert replacement text using
insert_textwith the original matrix (preserves rotation/skew).
Full code: batch, font‑preserving, rotation‑aware replacement
import fitz # PyMuPDF
from pathlib import Path
from typing import Dict, Iterable
def replace_text_in_page(page: fitz.Page, replacements: Dict[str, str]):
"""
Replace words on a single page, preserving:
- font (name, size, color)
- rotation / skew (via text matrix)
- line spacing & layout (in-place replacement)
`replacements` is a dict: { "old": "new", ... }
"""
# 1. Get word-level info: (x0, y0, x1, y1, "text", block_no, line_no, word_no)
words = page.get_text("words")
if not words:
return
# Build quick lookup by text
# words: list of tuples; we’ll keep index for stable reference
indexed_words = [
{
"rect": fitz.Rect(w[0], w[1], w[2], w[3]),
"text": w[4],
"block": w[5],
"line": w[6],
"word_index": w[7],
}
for w in words
]
# 2. Get raw dict to access spans + matrices
raw = page.get_text("rawdict")
blocks = raw["blocks"]
# Helper: find spans intersecting a rect
def find_spans_for_rect(target_rect: fitz.Rect):
spans_info = []
for b in blocks:
if b["type"] != 0:
continue
for line in b["lines"]:
for span in line["spans"]:
span_rect = fitz.Rect(span["bbox"])
if span_rect.intersects(target_rect):
spans_info.append((span, span_rect))
return spans_info
# 3. Collect all word rects to redact and their style info
redactions = [] # list of (rect, style_dict, replacement_text)
for w in indexed_words:
original_text = w["text"]
if original_text not in replacements:
continue
replacement_text = replacements[original_text]
rect = w["rect"]
spans = find_spans_for_rect(rect)
if not spans:
continue
# Use the first intersecting span as style source
span, span_rect = spans[0]
# span["matrix"] is [a, b, c, d, e, f]
# We convert to fitz.Matrix for rotation/skew preservation
m = span.get("matrix", None)
if m is not None:
text_matrix = fitz.Matrix(m[0], m[1], m[2], m[3])
else:
text_matrix = fitz.Matrix(1, 0, 0, 1) # fallback
style = {
"font": span["font"],
"size": span["size"],
"color": span["color"],
"matrix": text_matrix,
# baseline point: use left-bottom of span rect
"origin": fitz.Point(span_rect.x0, span_rect.y1),
}
redactions.append((rect, style, replacement_text))
# 4. Apply redactions (delete original words)
for rect, _, _ in redactions:
page.add_redact_annot(rect)
if redactions:
page.apply_redactions()
# 5. Reinsert replacement text with preserved style
for _, style, replacement_text in redactions:
page.insert_text(
style["origin"],
replacement_text,
fontname=style["font"],
fontsize=style["size"],
color=style["color"],
matrix=style["matrix"],
)
def replace_text_in_pdf(
input_path: Path,
output_path: Path,
replacements: Dict[str, str],
):
"""
Replace text in a single PDF file and write to output_path.
"""
doc = fitz.open(input_path)
for page in doc:
replace_text_in_page(page, replacements)
output_path.parent.mkdir(parents=True, exist_ok=True)
doc.save(output_path, garbage=3, deflate=True)
doc.close()
def batch_replace_text_in_pdfs(
input_dir: Path,
output_dir: Path,
replacements: Dict[str, str],
patterns: Iterable[str] = ("*.pdf",),
recursive: bool = True,
):
"""
Process hundreds of PDFs in a directory tree.
- input_dir: root folder with PDFs
- output_dir: where to write modified PDFs (mirrors structure)
- replacements: dict { "old": "new", ... }
- patterns: filename patterns, e.g. ("*.pdf",)
- recursive: whether to walk subdirectories
"""
if recursive:
files = [
p
for pattern in patterns
for p in input_dir.rglob(pattern)
]
else:
files = [
p
for pattern in patterns
for p in input_dir.glob(pattern)
]
for pdf_path in files:
rel = pdf_path.relative_to(input_dir)
out_path = output_dir / rel
print(f"Processing: {pdf_path} -> {out_path}")
replace_text_in_pdf(pdf_path, out_path, replacements)
if __name__ == "__main__":
# Example usage:
input_root = Path(r"input_pdfs")
output_root = Path(r"output_pdfs")
# Define your replacements here
replacements_map = {
"Hello": "Goodbye",
"Invoice": "Receipt",
# add more pairs...
}
batch_replace_text_in_pdfs(
input_dir=input_root,
output_dir=output_root,
replacements=replacements_map,
patterns=("*.pdf",),
recursive=True,
)
If you tell me a concrete mapping (e.g. "Company A" → "Company B") and rough file volume, I can tweak thresholds (e.g. intersection tolerance) and logging for your exact workload.
Absolutely — we can extend the engine to support regex‑based replacement while still preserving:
multi‑span word handling
rotation & skew
line spacing
batch processing
The trick is: PDFs don’t store “words”, so regex must operate on a logical text stream, then map each regex match back to physical word boxes on the page.
Below is the regex‑capable architecture that does exactly that.
✅ Regex‑Aware, Font‑Preserving, Rotation‑Safe PDF Text Replacement
How this version works
Extract the page’s word list (
page.get_text("words")).Reconstruct a logical text stream with offsets.
Run your regex on that logical stream.
For each regex match:
Identify which words overlap the match range.
Merge their bounding boxes.
Collect all intersecting spans (multi‑span safe).
Extract style (font, size, color, matrix).
Redact the merged rectangle.
Re‑insert replacement text using the original style.
This gives you full regex power while still preserving layout and formatting.
π§© Full Implementation (Regex‑Enabled)
import fitz
import re
from pathlib import Path
from typing import Dict, Iterable, Pattern
def build_text_stream(words):
"""
Build a logical text stream and map character offsets to word entries.
Returns:
full_text: str
index_map: list of (word_index, char_in_word)
"""
stream = []
index_map = []
for i, w in enumerate(words):
text = w["text"]
for c_idx, c in enumerate(text):
stream.append(c)
index_map.append((i, c_idx))
# Add a space separator (not in PDF, but helps regex)
stream.append(" ")
index_map.append((None, None))
return "".join(stream), index_map
def find_spans_for_rect(page, rect):
raw = page.get_text("rawdict")
spans_info = []
for block in raw["blocks"]:
if block["type"] != 0:
continue
for line in block["lines"]:
for span in line["spans"]:
span_rect = fitz.Rect(span["bbox"])
if span_rect.intersects(rect):
spans_info.append((span, span_rect))
return spans_info
def replace_regex_in_page(page: fitz.Page, regex_map: Dict[Pattern, str]):
# Extract words
raw_words = page.get_text("words")
if not raw_words:
return
words = [
{
"rect": fitz.Rect(w[0], w[1], w[2], w[3]),
"text": w[4],
"block": w[5],
"line": w[6],
"word_index": w[7],
}
for w in raw_words
]
# Build logical text stream
full_text, index_map = build_text_stream(words)
# Collect all replacements
replacements = [] # list of (rect, style, replacement_text)
for pattern, replacement in regex_map.items():
for match in pattern.finditer(full_text):
start, end = match.span()
# Identify which words are touched
touched_word_indices = set()
for pos in range(start, end):
w_idx, _ = index_map[pos]
if w_idx is not None:
touched_word_indices.add(w_idx)
if not touched_word_indices:
continue
# Merge bounding boxes of all touched words
rects = [words[i]["rect"] for i in touched_word_indices]
merged_rect = rects[0]
for r in rects[1:]:
merged_rect |= r
# Find spans intersecting merged rect
spans = find_spans_for_rect(page, merged_rect)
if not spans:
continue
span, span_rect = spans[0]
# Extract style
m = span.get("matrix", None)
if m is not None:
text_matrix = fitz.Matrix(m[0], m[1], m[2], m[3])
else:
text_matrix = fitz.Matrix(1, 0, 0, 1)
style = {
"font": span["font"],
"size": span["size"],
"color": span["color"],
"matrix": text_matrix,
"origin": fitz.Point(span_rect.x0, span_rect.y1),
}
replacements.append((merged_rect, style, replacement))
# Redact originals
for rect, _, _ in replacements:
page.add_redact_annot(rect)
if replacements:
page.apply_redactions()
# Insert replacements
for _, style, replacement_text in replacements:
page.insert_text(
style["origin"],
replacement_text,
fontname=style["font"],
fontsize=style["size"],
color=style["color"],
matrix=style["matrix"],
)
def replace_regex_in_pdf(input_path: Path, output_path: Path, regex_map: Dict[Pattern, str]):
doc = fitz.open(input_path)
for page in doc:
replace_regex_in_page(page, regex_map)
output_path.parent.mkdir(parents=True, exist_ok=True)
doc.save(output_path, garbage=3, deflate=True)
doc.close()
def batch_replace_regex(
input_dir: Path,
output_dir: Path,
regex_map: Dict[Pattern, str],
patterns: Iterable[str] = ("*.pdf",),
recursive: bool = True,
):
if recursive:
files = [p for pattern in patterns for p in input_dir.rglob(pattern)]
else:
files = [p for pattern in patterns for p in input_dir.glob(pattern)]
for pdf_path in files:
rel = pdf_path.relative_to(input_dir)
out_path = output_dir / rel
print(f"Processing: {pdf_path} -> {out_path}")
replace_regex_in_pdf(pdf_path, out_path, regex_map)
if __name__ == "__main__":
input_root = Path("input_pdfs")
output_root = Path("output_pdfs")
regex_map = {
re.compile(r"\bInvoice\s+\d{4}\b"): "Invoice [REDACTED]",
re.compile(r"\bCompany\s+ABC\b"): "Company XYZ",
re.compile(r"\bTotal:\s+\£\d+\.\d{2}\b"): "Total: £0.00",
}
batch_replace_regex(
input_dir=input_root,
output_dir=output_root,
regex_map=regex_map,
patterns=("*.pdf",),
recursive=True,
)
π― What this version adds
✔ Full regex support
You can now match:
dates
invoice numbers
monetary values
multi‑word phrases
patterns spanning multiple words
✔ Multi‑span safe
Regex matches are mapped back to all words involved, even if the PDF splits them across spans.
✔ Rotation & skew preserved
We extract and reuse the span’s text matrix, so rotated or skewed text is replaced correctly.
✔ Line spacing preserved
We never reflow text — replacements are drawn exactly where the originals were.
✔ Batch processing
Handles hundreds or thousands of PDFs with directory mirroring.
If you want, I can also add:
Case‑insensitive regex
Unicode‑aware matching
Fuzzy matching (Levenshtein)
Logging + progress bars
Parallel processing for speed
Just tell me what direction you want to take this pipeline.
Here’s a single, pipeline‑ready script that adds everything you asked for:
Case‑insensitive regex
Unicode‑aware matching
Fuzzy matching (Levenshtein)
Logging + progress bars
Parallel processing
It still preserves:
font (name, size, color)
rotation & skew (via text matrix)
line spacing & layout
multi‑span words
You can drop this into a .py file and adapt the config at the bottom.
You’ll want:
pip install pymupdf rapidfuzz tqdm
Full script
import fitz
import re
import logging
from pathlib import Path
from typing import Dict, Iterable, Pattern, List, Tuple, Optional
from concurrent.futures import ProcessPoolExecutor, as_completed
from rapidfuzz import fuzz
from tqdm import tqdm
# ---------- Logging setup ----------
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
)
# ---------- Core helpers ----------
def build_text_stream(words: List[dict]) -> Tuple[str, List[Tuple[Optional[int], Optional[int]]]]:
"""
Build a logical text stream and map character offsets to word entries.
Returns:
full_text: str
index_map: list of (word_index, char_in_word)
"""
stream = []
index_map: List[Tuple[Optional[int], Optional[int]]] = []
for i, w in enumerate(words):
text = w["text"]
for c_idx, c in enumerate(text):
stream.append(c)
index_map.append((i, c_idx))
# logical space separator
stream.append(" ")
index_map.append((None, None))
return "".join(stream), index_map
def find_spans_for_rect(page: fitz.Page, rect: fitz.Rect):
raw = page.get_text("rawdict")
spans_info = []
for block in raw["blocks"]:
if block["type"] != 0:
continue
for line in block["lines"]:
for span in line["spans"]:
span_rect = fitz.Rect(span["bbox"])
if span_rect.intersects(rect):
spans_info.append((span, span_rect))
return spans_info
# ---------- Regex + fuzzy replacement on a single page ----------
def replace_on_page(
page: fitz.Page,
regex_map: Dict[Pattern, str],
fuzzy_map: Dict[str, Tuple[str, int]],
):
"""
regex_map: { compiled_pattern: replacement }
fuzzy_map: { target_string: (replacement, threshold) }
"""
raw_words = page.get_text("words")
if not raw_words:
return
words = [
{
"rect": fitz.Rect(w[0], w[1], w[2], w[3]),
"text": w[4],
"block": w[5],
"line": w[6],
"word_index": w[7],
}
for w in raw_words
]
full_text, index_map = build_text_stream(words)
replacements: List[Tuple[fitz.Rect, dict, str]] = []
# ----- Regex replacements (case-insensitive, unicode-aware) -----
for pattern, replacement in regex_map.items():
for match in pattern.finditer(full_text):
start, end = match.span()
touched_word_indices = set()
for pos in range(start, end):
w_idx, _ = index_map[pos]
if w_idx is not None:
touched_word_indices.add(w_idx)
if not touched_word_indices:
continue
rects = [words[i]["rect"] for i in touched_word_indices]
merged_rect = rects[0]
for r in rects[1:]:
merged_rect |= r
spans = find_spans_for_rect(page, merged_rect)
if not spans:
continue
span, span_rect = spans[0]
m = span.get("matrix", None)
if m is not None:
text_matrix = fitz.Matrix(m[0], m[1], m[2], m[3])
else:
text_matrix = fitz.Matrix(1, 0, 0, 1)
style = {
"font": span["font"],
"size": span["size"],
"color": span["color"],
"matrix": text_matrix,
"origin": fitz.Point(span_rect.x0, span_rect.y1),
}
replacements.append((merged_rect, style, replacement))
# ----- Fuzzy replacements (word-level, Levenshtein) -----
# fuzzy_map: { "target": ("replacement", threshold) }
for i, w in enumerate(words):
text = w["text"]
for target, (replacement, threshold) in fuzzy_map.items():
score = fuzz.ratio(text, target)
if score >= threshold:
rect = w["rect"]
spans = find_spans_for_rect(page, rect)
if not spans:
continue
span, span_rect = spans[0]
m = span.get("matrix", None)
if m is not None:
text_matrix = fitz.Matrix(m[0], m[1], m[2], m[3])
else:
text_matrix = fitz.Matrix(1, 0, 0, 1)
style = {
"font": span["font"],
"size": span["size"],
"color": span["color"],
"matrix": text_matrix,
"origin": fitz.Point(span_rect.x0, span_rect.y1),
}
replacements.append((rect, style, replacement))
# ----- Apply redactions -----
for rect, _, _ in replacements:
page.add_redact_annot(rect)
if replacements:
page.apply_redactions()
# ----- Insert replacements -----
for _, style, replacement_text in replacements:
page.insert_text(
style["origin"],
replacement_text,
fontname=style["font"],
fontsize=style["size"],
color=style["color"],
matrix=style["matrix"],
)
# ---------- Per-PDF processing ----------
def process_single_pdf(
input_path: Path,
output_path: Path,
regex_map: Dict[Pattern, str],
fuzzy_map: Dict[str, Tuple[str, int]],
) -> Tuple[Path, bool, Optional[str]]:
try:
doc = fitz.open(input_path)
for page in doc:
replace_on_page(page, regex_map, fuzzy_map)
output_path.parent.mkdir(parents=True, exist_ok=True)
doc.save(output_path, garbage=3, deflate=True)
doc.close()
return input_path, True, None
except Exception as e:
logging.exception(f"Error processing {input_path}: {e}")
return input_path, False, str(e)
# ---------- Batch + parallel ----------
def batch_process_pdfs(
input_dir: Path,
output_dir: Path,
regex_map: Dict[Pattern, str],
fuzzy_map: Dict[str, Tuple[str, int]],
patterns: Iterable[str] = ("*.pdf",),
recursive: bool = True,
max_workers: int = 4,
):
if recursive:
files = [p for pattern in patterns for p in input_dir.rglob(pattern)]
else:
files = [p for pattern in patterns for p in input_dir.glob(pattern)]
if not files:
logging.warning("No PDF files found.")
return
logging.info(f"Found {len(files)} PDF(s) to process.")
tasks = []
with ProcessPoolExecutor(max_workers=max_workers) as executor:
for pdf_path in files:
rel = pdf_path.relative_to(input_dir)
out_path = output_dir / rel
tasks.append(
executor.submit(
process_single_pdf,
pdf_path,
out_path,
regex_map,
fuzzy_map,
)
)
for f in tqdm(as_completed(tasks), total=len(tasks), desc="Processing PDFs"):
input_path, ok, err = f.result()
if ok:
logging.info(f"Processed: {input_path}")
else:
logging.error(f"Failed: {input_path} | {err}")
# ---------- Example configuration ----------
if __name__ == "__main__":
input_root = Path("input_pdfs")
output_root = Path("output_pdfs")
# Case-insensitive, Unicode-aware regex patterns
regex_map: Dict[Pattern, str] = {
re.compile(r"\binvoice\s+\d{4}\b", re.IGNORECASE | re.UNICODE): "Invoice [REDACTED]",
re.compile(r"\bcompany\s+abc\b", re.IGNORECASE | re.UNICODE): "Company XYZ",
re.compile(r"\btotal:\s+£\d+\.\d{2}\b", re.IGNORECASE | re.UNICODE): "Total: £0.00",
}
# Fuzzy targets: "target": ("replacement", threshold)
fuzzy_map: Dict[str, Tuple[str, int]] = {
"AcmeCorp": ("Acme Corporation", 85),
"Foldng Ltd": ("Folding Ltd", 80),
}
batch_process_pdfs(
input_dir=input_root,
output_dir=output_root,
regex_map=regex_map,
fuzzy_map=fuzzy_map,
patterns=("*.pdf",),
recursive=True,
max_workers=8,
)
If you tell me roughly how messy your PDFs are (scanned vs digital, lots of rotation, etc.), I can suggest small tweaks—like widening intersection tolerances or adding per‑page debug dumps—to make this rock‑solid for your dataset.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.