Fw: PDF library for reading PDF files

Peter Galfi · Jan 18, 2004

Hi!

I am looking for a library in Python that would read PDF files and I could extract information from the PDF with it. I have searched with google, but only found libraries that can be used to write PDF files.

Any ideas?

Peter

Harald Massa · Jan 18, 2004

I am looking for a library in Python that would read PDF files and I

could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.

reportlab has a lib called pagecatcher; it is fully supported with python,
it is not free.

Harald

David Boddie · Jan 18, 2004

Peter Galfi said:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.

Any ideas?

I quickly searched back through Google, but I knew exactly what I was
looking for: ;-)

http://groups.google.com/[email protected]

The page referred to is here:

http://www.boddie.org.uk/david/Projects/Python/pdftools/

The module is very much a "work in progress". You can probably get
some text and bitmap images out of a few documents, but that's
probably all you can expect unless you want to improve it (and
submit patches).

Good luck!

David

Cameron Laird · Jan 18, 2004

reportlab has a lib called pagecatcher; it is fully supported with python,
it is not free.

Harald

ReportLab's libraries are great things--but they do not "extract
information from the PDF" in the sense I believe the original
questioner intended. As Andreas suggested, he's probably best
off using existing stand-alone applications as separate processes,
controlled from Python.

Robert Kern · Jan 18, 2004

Cameron said:
reportlab has a lib called pagecatcher; it is fully supported with python,
it is not free.

Harald

ReportLab's libraries are great things--but they do not "extract
information from the PDF" in the sense I believe the original
questioner intended. [/QUOTE]

No, but ReportLab (the company) has a product separate from reportlab
(the package) called PageCatcher that does exactly what the OP asked
for. It is not open source, however, and costs a chunk of change.

Cameron Laird · Jan 19, 2004

No, but ReportLab (the company) has a product separate from reportlab
(the package) called PageCatcher that does exactly what the OP asked
for. It is not open source, however, and costs a chunk of change.

Let's take this one step farther. Two posts now have
quite clearly recommended ReportLab's PageCatcher <URL:
http://reportlab.com/docs/pagecatcher-ds.pdf >. I
completely understand and agree that ReportLab supports
a mix of open-source, no-fee, and for-fee products, and
that PageCatcher carries a significant license fee. I
entirely agree that PageCatcher "read PDF files ...
and ... extract information from the PDF with it."

HOWEVER, I suspect that what the original questioner
meant by his words was some sort of PDF-to-text "extrac-
tion" (true?) and, unless PageCatcher has changed a lot
since I got my last copy, PDF-to-text is NOT one of its
functions.

Robin Becker · Jan 19, 2004

No, but ReportLab (the company) has a product separate from reportlab
(the package) called PageCatcher that does exactly what the OP asked
for. It is not open source, however, and costs a chunk of change.

Let's take this one step farther. Two posts now have
quite clearly recommended ReportLab's PageCatcher <URL:
http://reportlab.com/docs/pagecatcher-ds.pdf >. I
completely understand and agree that ReportLab supports
a mix of open-source, no-fee, and for-fee products, and
that PageCatcher carries a significant license fee. I
entirely agree that PageCatcher "read PDF files ...
and ... extract information from the PDF with it."

HOWEVER, I suspect that what the original questioner
meant by his words was some sort of PDF-to-text "extrac-
tion" (true?) and, unless PageCatcher has changed a lot
since I got my last copy, PDF-to-text is NOT one of its
functions. [/QUOTE]
I suspect Cameron is right. ReportLab does have a product called
pageCatcher, but its main function is to grab individual pages for
reuse. I believe it could be extended to go deeper and mess about with
text streams, but it certainly doesn't do that now and would take some
effort to do properly as text can be complicated in PDF (or postscript).

Andreas Lobinger · Jan 19, 2004

Aloha,

Peter Galfi schrieb:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.
Any ideas?

Use file, split, zlib and a broad knowledge of the PDF-spec...

Accessing certain objects in the .pdf is not that complicated if
you f.e. try to read the /Info dictionary. Getting text from
actual page content could be very complicated.

Can you explain your 'information' further?

Wishing a happy day
LOBI

Robert Kern · Jan 19, 2004

Cameron said:
Robert Kern said:

No, but ReportLab (the company) has a product separate from reportlab
(the package) called PageCatcher that does exactly what the OP asked
for. It is not open source, however, and costs a chunk of change.

Click to expand...

Let's take this one step farther. Two posts now have
quite clearly recommended ReportLab's PageCatcher <URL:
http://reportlab.com/docs/pagecatcher-ds.pdf >. I
completely understand and agree that ReportLab supports
a mix of open-source, no-fee, and for-fee products, and
that PageCatcher carries a significant license fee. I
entirely agree that PageCatcher "read PDF files ...
and ... extract information from the PDF with it."

HOWEVER, I suspect that what the original questioner
meant by his words was some sort of PDF-to-text "extrac-
tion" (true?) and, unless PageCatcher has changed a lot
since I got my last copy, PDF-to-text is NOT one of its
functions.

Rereading http://www.reportlab.com/PageCatchIntro.html , you're right.
My apologies. I thought you were talking about the open source reportlab
package and not PageCatcher specifically.

Peter Galfi · Jan 20, 2004

Thanks. I am studying the PDF spec, it just does not seem to be that easy
having to implement all the decompressions, etc. The "information" I am
trying to extract from the PDF file is the text, specifically in a way to
keep the original paragraphs of the text. I have seen so far one shareware
standalone tool that extracts the text (and a lot of other formatting
garbage) into an RTF document keeping the paragraphs as well. I would need
only the text.

Any suggestions?

Peter

----- Original Message -----
From: "Andreas Lobinger" <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Monday, January 19, 2004 5:02 PM
Subject: Re: Fw: PDF library for reading PDF files

Aloha,

Peter Galfi schrieb:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.
Any ideas?

Use file, split, zlib and a broad knowledge of the PDF-spec...

Accessing certain objects in the .pdf is not that complicated if
you f.e. try to read the /Info dictionary. Getting text from
actual page content could be very complicated.

Can you explain your 'information' further?

Wishing a happy day
LOBI

Josiah Carlson · Jan 20, 2004

Thanks. I am studying the PDF spec, it just does not seem to be that easy

having to implement all the decompressions, etc. The "information" I am
trying to extract from the PDF file is the text, specifically in a way to
keep the original paragraphs of the text. I have seen so far one shareware
standalone tool that extracts the text (and a lot of other formatting
garbage) into an RTF document keeping the paragraphs as well. I would need
only the text.

Any suggestions?

Peter,

Suggestion: extract the document to RTF using that other tool, then use
any one of the few dozen RTF parsers to convert them into plaintext.

- Josiah

Andreas Lobinger · Jan 20, 2004

Aloha,

Peter said:
Thanks. I am studying the PDF spec, it just does not seem to be that easy
having to implement all the decompressions, etc. The "information" I am
trying to extract from the PDF file is the text, specifically in a way to
keep the original paragraphs of the text. I have seen so far one shareware
standalone tool that extracts the text (and a lot of other formatting
garbage) into an RTF document keeping the paragraphs as well. I would need
only the text.

As others wrote here, the simplest solution is to use a external
pdf-2-text programm and postprocess the data. Read comp.text.pdf

There is no simple and consistent way to extract text from a .pdf
because there are many ways to set text. The optical impression
of a paragraph may not be represented by a similar command structure
in the .pdf.

Adobe recognized the difficulties for document reuse and introduced
tagged .pdf in 1.4. With tagged-pdf it is possible to insert
structural information in the .pdf. If you are interested in
using this, contact me.

Wishing a happy day
LOBI

Cameron Laird · Jan 20, 2004

Aloha,

Peter Galfi schrieb: .
.
. .
.
.
As others wrote here, the simplest solution is to use a external
pdf-2-text programm and postprocess the data. Read comp.text.pdf

There is no simple and consistent way to extract text from a .pdf
because there are many ways to set text. The optical impression

.
.
.
I want to emphasize that final sentence. If you insist on pursuing
this, though, refer to <URL:
http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >.

Dennis Lee Bieber · Jan 20, 2004

Any suggestions?

Configure a text-only printer, with "print to file" capability,
and "print" the PDF file to it... Then read the print-out...

--

Jeff Sandys · Jan 20, 2004

Peter said:
....
The "information" I am trying to extract from the PDF file is the text,
specifically in a way to keep the original paragraphs of the text. ....

Any suggestions?

Ghostscript has an Extract Text capability that I have used
successfully on some pdf files (but not on some others):
http://www.cs.wisc.edu/~ghost/

Thanks,
Jeff Sandys

PDF File Code	4	Apr 20, 2023
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
How to create PDF file in Batch	5	May 11, 2022
How to extract image from PDF in Python	0	May 24, 2022
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
io module and pdf question	2	Jun 25, 2013
Extract images from PDF files	2	Jul 28, 2009

Fw: PDF library for reading PDF files

Peter Galfi

Harald Massa

David Boddie

Cameron Laird

Robert Kern

Cameron Laird

Robin Becker

Andreas Lobinger

Robert Kern

Peter Galfi

Josiah Carlson

Andreas Lobinger

Cameron Laird

Dennis Lee Bieber

Jeff Sandys

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads