Extracting Specific Text Segments from Multiple PDFs Using Python
Implementation
Too isolate specific text bounded by defined markers within multiple PDF documents, we can utilize the PyPDF2 library. The following script defines a function that scens each page of a given document, locates the start and end markers, and retrieves the content sandwiched between them.
import PyPDF2
from pathlib import Path
def isolate_document_content(doc_path, marker_start, marker_end):
accumulated_content = []
with open(doc_path, 'rb') as binary_file:
reader = PyPDF2.PdfReader(binary_file)
for sheet in reader.pages:
raw_text = sheet.extract_text()
pos_start = raw_text.find(marker_start)
pos_end = raw_text.find(marker_end)
if pos_start != -1 and pos_end != -1 and pos_start < pos_end:
isolated_segment = raw_text[pos_start + len(marker_start):pos_end]
accumulated_content.append(isolated_segment)
return "".join(accumulated_content)
target_docs = ['report1.pdf', 'report2.pdf']
start_marker = "Introduction"
end_marker = "Conclusion"
for doc in target_docs:
if Path(doc).exists():
segment = isolate_document_content(doc, start_marker, end_marker)
print(segment)
Code Logic Breakdown
The process begins by opening the target PDF in binary read mode ('rb'), which is required for parsing the binary structure of PDF files. The with context manager guarantees that the file handle is released properly after processing.
An instance of PyPDF2.PdfReader is initialized to parse the document. We then iterate through the reader.pages collection. For each sheet, the extract_text() method converts the page layout into a raw string.
String slicing is performed by calculating the index positions of the boundary markers using the find() method. If both markers exist on the same page and the start position precedes the end position, the substring between them is extracted. This substring is adjusted to exclude the start marker itself by offsetting the start index with its own length.
All matching segments are aggregated into a list and ultimately joined into a single output string.
Limitations
The extract_text() method relies on internal PDF text coordinates and may produce fragmented or inaccurate results for documents with complex tables, multicolumn layouts, or embedded graphics. Furthermore, scanned documents where text is rendered as images, as well as password-protected PDFs, cannot be processed using this approach. Optical Character Recognition (OCR) tools must be employed for image-based text extraction.