Python pyPDF4 code to bookmark pdf based upon date text

Joined
Jan 18, 2023
Messages
1
Reaction score
0
I am trying to create a python function that will transform a pdf file. Specifically I want to create bookmarks based upon text in the pdf. The pdf is a set of medical treatment notes from various dates. This particular set of medical records sets out the treatment date in the header in the form of "Visit date: ##/##/####". So I want to create a bookmark for the treatment records for 01/01/2022 for a bookmark titled 01/01/2022, etc.

This code runs and creates a new pdf file, "NewPDF2.pdf", that is identifical to "NewMedical.pdf". However, there are no bookmarks. Cannot figure out what I am doing wrong.

Code:
import PyPDF4
import re
 
# Open the PDF file for reading
pdf_file = open(r"C:\Users\StanleyDenman\Documents\NewMedical.pdf", 'rb')
pdf_reader = PyPDF4.PdfFileReader(pdf_file)
pdf_writer = PyPDF4.PdfFileWriter()
 
 
# Define the regular expression for finding the bookmark locations
regex = re.compile("Visit date: '\b\d{2}/\d{2}/\d{4}\b'")
 
# Iterate through the pages of the PDF
for i in range(len(pdf_reader.pages)):
    page = pdf_reader.getPage(i)
    text = page.extractText()
    matches = re.findall(regex, text)
    pdf_writer.addPage(page)
    
 
for match in matches:
   #  Create a bookmark for each match
       bookmark = PyPDF4.pdf.Destination()
       bookmark.page = pdf_writer.addpage()
       bookmark.title = match.group(1)
    
 
# Write the new PDF file with bookmarks
output_file = open('NewPDF2.pdf', 'wb')
pdf_writer.write(output_file)
output_file.close()
pdf_file.close()
 
Joined
Jan 30, 2023
Messages
107
Reaction score
13
This code appears to be incorrect in several places. Here are some of the issues:

  1. The regular expression to find the visit dates is incorrect. It should be regex = re.compile("Visit date: \d{2}/\d{2}/\d{4}").
  2. The group method should not be used in the line bookmark.title = match.group(1). Instead, you can use the match result directly: bookmark.title = match.
  3. You are creating a new page for each match using pdf_writer.addpage(), but you are not adding any content to it. This could result in an error.
  4. You should also set the bookmark destination using bookmark.dest = pdf_writer.getPage(i).
Here's a corrected version of the code:

Python:
import PyPDF4
import re
 
# Open the PDF file for reading
pdf_file = open(r"C:\Users\StanleyDenman\Documents\NewMedical.pdf", 'rb')
pdf_reader = PyPDF4.PdfFileReader(pdf_file)
pdf_writer = PyPDF4.PdfFileWriter()
 
# Define the regular expression for finding the bookmark locations
regex = re.compile("Visit date: \d{2}/\d{2}/\d{4}")
 
# Iterate through the pages of the PDF
for i in range(len(pdf_reader.pages)):
    page = pdf_reader.getPage(i)
    text = page.extractText()
    matches = re.findall(regex, text)
    pdf_writer.addPage(page)
    
    for match in matches:
        # Create a bookmark for each match
        bookmark = PyPDF4.pdf.Destination()
        bookmark.title = match
        bookmark.dest = pdf_writer.getPage(i)
        pdf_writer.addBookmark(bookmark)
    
# Write the new PDF file with bookmarks
output_file = open('NewPDF2.pdf', 'wb')
pdf_writer.write(output_file)
output_file.close()
pdf_file.close()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top