Analyse of PDF (or EPS?)

Johan Holst Nielsen · Nov 20, 2003

Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know

Regards,
Johan

Peter Hansen · Nov 20, 2003

Johan said:
Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know

I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.

-Peter

Johan Holst Nielsen · Nov 20, 2003

Peter said:
I believe the not-for-free version of ReportLab has this sort of capability,
at least in some sense.

Aah, you think about the product "PageCatcher", right?

I haven't seen it yet

I will contact ReportLab for further details,
thanks

Please let me know, if other know any alternatives

(in case that I
cannot use ReportLab's version)

Regards,
Johan

Johan Holst Nielsen · Nov 20, 2003

Johan said:
Aah, you think about the product "PageCatcher", right?

Just found the pricing

I think USD 25,000 are way out of my budget

I have someone have some alternatives

Regards,
Johan

Bengt Richter · Nov 20, 2003

Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know

IIRC you can get the full specs of pdf and eps at the adobe site.
Some stuff is easy to get at, some may be compressed and/or encrypted,
and not so easy.

Conforming docs are supposed to be structured so that it is relatively easy
to grab chunks of document and do the kinds of things printing business s/w does,
like rotating and scaling and reordering pages, etc.

There are whole books on pdf and postscript also, which you could browse at a good
tech book store or tech library.

Regards,
Bengt Richter

Andrew MacIntyre · Nov 20, 2003

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

I believe that Ghostscript can be used to print PDFs on Postscript
printers; you would then need to find tools to analyse Postscript files.

David Boddie · Nov 21, 2003

It depends on the type of images (bitmap vs. vector).

IIRC you can get the full specs of pdf and eps at the adobe site.

The full PDF specification is not exactly short, but it's fairly readable.

Some stuff is easy to get at, some may be compressed and/or encrypted,
and not so easy.

Although the FlateDecode compression format is straightforward with existing
libraries, some of the other compression techniques may be less accessible.

Conforming docs are supposed to be structured so that it is relatively easy
to grab chunks of document and do the kinds of things printing business s/w does,
like rotating and scaling and reordering pages, etc.

I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.

Maybe it's time to stick a license on it and upload it somewhere.

David

Grzegorz Makarewicz · Nov 21, 2003

Johan said:
Hi,

Is there any Python packages to analyse or get some information out of
an PDF document...

Like where the text are placed - what text are placed - fonts, embedded
PDFs/fonts/images etc.

Please let me know

Regards,
Johan

http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.

mak

Johan Holst Nielsen · Nov 21, 2003

Grzegorz said:
http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.

Hmmm
http://www.trisoft.com.pl/~mak/wxpdf.zip
Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

Did I get the wrong URL

Regards,
Johan

Johan Holst Nielsen · Nov 21, 2003

David said:
It depends on the type of images (bitmap vs. vector).

Yes I know - but the vector based images should be extracted just as it
is - bitmap as selfcontained files :=)

The full PDF specification is not exactly short, but it's fairly readable.

Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too

Although the FlateDecode compression format is straightforward with existing
libraries, some of the other compression techniques may be less accessible.

Well, no problem with the compression/encrypting. It is for an internal
application - so people just HAVE to not encrypt or secure the document.

I have a Python library which is able to identify a lot of the structure in simple
documents, including basic text extraction, but I've become pretty disillusioned
with it because so much work is required to extract more complex information.

Maybe it's time to stick a license on it and upload it somewhere.

Well, let me know

Maybe I could get an demo or something? That would
be nice

Regards,
Johan

Johan Holst Nielsen · Nov 21, 2003

Grzegorz said:
http://www.trisoft.com.pl/~mak/wxpdf.zip

My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
python and wxPython - binaries for python22 (windows) are included.

Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

Can you please try to upload it again?

Regards,
Johan

Grzegorz Makarewicz · Nov 21, 2003

Johan Holst Nielsen wrote:
[...]

> Not Found
> The requested URL /~mak/wxpdf.zip was not found on this server.
>
> Can you please try to upload it again?
>
> Johan
>

Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip

Regards,
Grzegorz Makarewicz

Johan Holst Nielsen · Nov 21, 2003

Grzegorz said:
Johan Holst Nielsen wrote:
[...]

Not Found
The requested URL /~mak/wxpdf.zip was not found on this server.

Can you please try to upload it again?

Johan

Click to expand...

Sorry for the missing link, this one works:

http://www.trisoft.com.pl/mak/wxpdf.zip

Thanks Grzegorz, I will look at it in next week. If you want an reply
about if I can use - please send a message to me at tcr480 ( a t )
yahoo.dk

Regards,
Johan

David Boddie · Nov 25, 2003

Johan Holst Nielsen said:
David Boddie wrote:

Yep... I tried it... but there are no reason to do exactly the same - if
other people already have done that. And time is an issue too

Time is always an issue. How much of it do you have? ;-)

Well, let me know Maybe I could get an demo or something? That would
be nice

You may be disappointed, but here it is:

http://www.boddie.org.uk/david/Projects/Python/pdftools/

The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.

Basic use:

import pdftools

file = "MyFile.pdf"
doc = pdftools.PDFdocument(file)

print "Document uses PDF format version", doc.document_version()

pages = doc.count_pages()
print "Document contains %i pages." % pages

if pages > 123:

page123 = doc.read_page(123)
contents123 = page123.read_contents()

print "The objects found in this page:"
print
print contents123.contents

I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted.

Have fun, and don't expect too much,

David

How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
How to go about with PDF regression	1	Feb 18, 2013
Measuring a string of text	1	Sep 15, 2022
Generate report containing pdf or ps figures?	21	Apr 23, 2007
creating EPS with Python	3	Dec 23, 2003
Python PDF + Pictures	4	Mar 11, 2008

Analyse of PDF (or EPS?)

Johan Holst Nielsen

Peter Hansen

Johan Holst Nielsen

Johan Holst Nielsen

Bengt Richter

Andrew MacIntyre

David Boddie

Grzegorz Makarewicz

Johan Holst Nielsen

Johan Holst Nielsen

Johan Holst Nielsen

Grzegorz Makarewicz

Johan Holst Nielsen

David Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads