searching pdf files for certain info

R

rbt

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt
 
D

Diez B. Roggisch

rbt said:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

There is a commercial tool pdflib availabla, that might help. It has a free
evaluation version, and python bindings.

If it's only about text, maybe pdf2text helps.
 
A

Andreas Lobinger

Aloha,
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

First of all,
http://groups.google.de/[email protected]&output=gplain
still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.
>>> import pdffile
>>> import pages
>>> import zlib
>>> pf = pdffile.pdffile('../pdf-testset1/a.pdf')
>>> pp = pages.pages(pf)
>>> c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
>>> op = pdftool.parse_content(c)
>>> sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
>>> for a in sop:
print a[0]

Wishing a happy day
LOBI
 
R

rbt

Andreas said:
Aloha,
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.


First of all,
http://groups.google.de/[email protected]&output=gplain

still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.
import pdffile
import pages
import zlib
pf = pdffile.pdffile('../pdf-testset1/a.pdf')
pp = pages.pages(pf)
c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
op = pdftool.parse_content(c)
sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
for a in sop:
print a[0]

Wishing a happy day
LOBI

Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?
 
A

Andreas Lobinger

Aloha,
Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?

Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI
 
R

rbt

Andreas said:
Aloha,



Not really...
The classical PS Drivers (f.e. Acroread4-Unix print-> ps) simply
define the pdf graphics and text operators as PS commands and
copy the pdf content directly.

Wishing a happy day
LOBI

I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.
 
T

Tom Willis

I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?
 
R

rbt

Tom said:
I tried that for something not python related and I was getting
sporadic spaces everywhere.

I am assuming this is not the case in your experience?


I downloaded ghostscript for Win32 and added it to my PATH
(C:\gs\gs8.15\lib AND C:\gs\gs8.15\bin). I found that ps2ascii works
well on PDF files and it's entirely free.

Usage:

ps2ascii PDF_file.pdf > ASCII_file.txt

However, bundling a 9+ MB package with a 5K script and convincing users
to install it is another matter altogether.

For my purpose, it works fine. I'm searching for certain strings that
might be in the document... all I need is a readable file. Layout, fonts
and/or presentation is unimportant to me.
 
K

Kartic

rbt said the following on 2/22/2005 8:53 AM:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

Thanks, rbt

Hi,

Try pdftotext which is part of the XPdf project. pdftotext extracts
textual information from a PDF file to an output text file of your
choice. I have used it in the past (not with Python) to do what you are
attempting. It is a small program and you can invoke from python and
search for the string/pattern you want.

You can download for your OS from:
http://www.foolabs.com/xpdf/download.html

Thanks,
-Kartic
 
T

Tom Willis

Well sporadic spaces in strings would cause problems would it not?

an example....


The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.


 
R

rbt

Tom said:
Well sporadic spaces in strings would cause problems would it not?

an example....


The String: "Patient Face Sheet"--->pdftotext--->"P a tie n t Face Sheet"

I'm just curious if you see anything like that, since I really have no
clue about ps or pdf etc...but I have a strong desire to replace a
really flaky commercial tool. And if I can do it with free stuff, all
the better my boss will love me.

No, I do not see that type of behavior. I'm looking for strings that
resemble SS numbers. So my strings look like this: nnn-nn-nnnn.

The ps2ascii util in ghostscript reproduces strings in the format that I
expect. BTW, I'm not using pdftotext. I'm using *ps2ascii*.
 
T

Tom Willis

Ah that makes sense. I only see the behavior in pdftotext. ps2ascii
doesn't give me the layout , which for my purposes, I certainly need.

Thanks for the info, Looks like I'll keep searching for that silver bullet.:(
 
F

Follower

rbt said:
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I'd like to read
PDF's, decode them, and then search the data for certain strings.

I've had success with both:

<http://www.boddie.org.uk/david/Projects/Python/pdftools/>

<http://www.adaptive-enterprises.com.au/~d/software/pdffile/pdffile.py>

although my preference is for the latter as it transparently handles
decryption. (I've previously posted an enhancement to the `pdftools`
utility that adds decryption handling to it, but now use the `pdffile`
library as it handles it better.)

The ease of text extraction depends a lot on how the PDFs have been
created.

--Phil.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top