Thanks. I am studying the PDF spec, it just does not seem to be that easy
having to implement all the decompressions, etc. The "information" I am
trying to extract from the PDF file is the text, specifically in a way to
keep the original paragraphs of the text. I have seen so far one shareware
standalone tool that extracts the text (and a lot of other formatting
garbage) into an RTF document keeping the paragraphs as well. I would need
only the text.
Any suggestions?
Peter
----- Original Message -----
From: "Andreas Lobinger" <
[email protected]>
Newsgroups: comp.lang.python
To: <
[email protected]>
Sent: Monday, January 19, 2004 5:02 PM
Subject: Re: Fw: PDF library for reading PDF files
Aloha,
Peter Galfi schrieb:
I am looking for a library in Python that would read PDF files and I
could extract information from the PDF with it. I have searched with
google, but only found libraries that can be used to write PDF files.
Any ideas?
Use file, split, zlib and a broad knowledge of the PDF-spec...
Accessing certain objects in the .pdf is not that complicated if
you f.e. try to read the /Info dictionary. Getting text from
actual page content could be very complicated.
Can you explain your 'information' further?
Wishing a happy day
LOBI