searching pdf files for certain info

rbt · Feb 22, 2005

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt

Diez B. Roggisch · Feb 22, 2005

rbt said:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

There is a commercial tool pdflib availabla, that might help. It has a free
evaluation version, and python bindings.

If it's only about text, maybe pdf2text helps.

Andreas Lobinger · Feb 22, 2005

Aloha,

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

First of all,
http://groups.google.de/[email protected]&output=gplain
still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.

>>> import pdffile
>>> import pages
>>> import zlib
>>> pf = pdffile.pdffile('../pdf-testset1/a.pdf')
>>> pp = pages.pages(pf)
>>> c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
>>> op = pdftool.parse_content(c)
>>> sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
>>> for a in sop:

Click to expand...

Click to expand...

print a[0]

Wishing a happy day
LOBI

rbt · Feb 22, 2005

Andreas said:
Aloha,

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Click to expand...

First of all,
http://groups.google.de/[email protected]&output=gplain

still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.

import pdffile
import pages
import zlib
pf = pdffile.pdffile('../pdf-testset1/a.pdf')
pp = pages.pages(pf)
c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
op = pdftool.parse_content(c)
sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
for a in sop:

Click to expand...

Click to expand...

print a[0]

Wishing a happy day
LOBI

Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Andreas Lobinger · Feb 22, 2005

Aloha,

Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

rbt · Feb 22, 2005

Andreas said:
Aloha,

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.

Tom Willis · Feb 22, 2005

I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?

rbt · Feb 22, 2005

Tom said:
I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?

I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.

Click to expand...

For my purpose, it works fine. I'm searching for certain strings that
might be in the document... all I need is a readable file. Layout, fonts
and/or presentation is unimportant to me.

Kartic · Feb 22, 2005

rbt said the following on 2/22/2005 8:53 AM:

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt

Hi,

Try pdftotext which is part of the XPdf project. pdftotext extracts
textual information from a PDF file to an output text file of your
choice. I have used it in the past (not with Python) to do what you are
attempting. It is a small program and you can invoke from python and
search for the string/pattern you want.

You can download for your OS from:
http://www.foolabs.com/xpdf/download.html

Thanks,
-Kartic

Tom Willis · Feb 23, 2005

Well sporadic spaces in strings would cause problems would it not?

an example....

The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.

rbt · Feb 23, 2005

Tom said:
Well sporadic spaces in strings would cause problems would it not?

an example....

The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.

No, I do not see that type of behavior. I'm looking for strings that
resemble SS numbers. So my strings look like this: nnn-nn-nnnn.

The ps2ascii util in ghostscript reproduces strings in the format that I
expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.

Tom Willis · Feb 23, 2005

Ah that makes sense. I only see the behavior in pdftotext. ps2ascii
doesn't give me the layout , which for my purposes, I certainly need.

Thanks for the info, Looks like I'll keep searching for that silver bullet.

Follower · Feb 25, 2005

rbt said:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

I've had success with both:

<http://www.boddie.org.uk/david/Projects/Python/pdftools/>

<http://www.adaptive-enterprises.com.au/~d/software/pdffile/pdffile.py>

although my preference is for the latter as it transparently handles
decryption. (I've previously posted an enhancement to the `pdftools`
utility that adds decryption handling to it, but now use the `pdffile`
library as it handles it better.)

The ease of text extraction depends a lot on how the PDFs have been
created.

--Phil.

PDF File Code	4	Apr 20, 2023
How to create PDF file in Batch	5	May 11, 2022
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
PDF extraction of specific data	1	Jun 13, 2021
io module and pdf question	2	Jun 25, 2013
How to make Intellisense Quick Info for JavaScript and CSS files to appear in Visual Studio 2017?	0	Mar 31, 2022
Sending Error when attaching files	1	Aug 7, 2023
Image shifts to the right when export the page to pdf	4	May 5, 2023

searching pdf files for certain info

rbt

Diez B. Roggisch

Andreas Lobinger

rbt

Andreas Lobinger

rbt

Tom Willis

rbt

Kartic

Tom Willis

rbt

Tom Willis

Follower

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads