OT: extract data from PDF's

Spartanicus · Feb 6, 2004

Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?

Paul Furman · Feb 6, 2004

Automatically with php, manually with the select or text tools in
acrobat. http://us3.php.net/manual/en/ref.pdf.php

Marc Nadeau · Feb 7, 2004

Spartanicus a écrit:

Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?

There is a command line utility called pdftohtml at

http://pdftohtml.sourceforge.net/

It works in windows and linux.

You may have to compile the source.

I tried it and it works.

Bonne chance!

Jukka K. Korpela · Feb 7, 2004

Marc Nadeau said:
http://pdftohtml.sourceforge.net/

It works in windows and linux.

For some values of "work". It generates an attempt at exact imitation
of the visual appearance of the PDF document, hence trying to fight
against the strengths of HTML. It creates _no_ structural markup, just
div, span, nobr, b, etc., combined with the use of CSS positioning in a
manner that relies on browsers' violations of explicit requirements in
CSS specifications. And instead of generating a single HTML document,
it converts each page separately and makes them appear in a frame,
inside a frameset with no noframes element and with frames named
"links" and "rechts", which is _so_ informative e.g. to a blind person,
is it not? And with <title>014-048_Man_282198_50</title>.

It's probably easier to use cut and paste to get the textual content of
a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly generated
markup first. But it's not a big difference really, so if you don't
know how to cut and paste from your favorite PDF viewer, you might
almost as well use pdftohtml.

Toby A Inkster · Feb 7, 2004

Spartanicus said:
Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?

Ghostscript contains a tool ps2ascii, which can handle PDF input.

Marc Nadeau · Feb 8, 2004

Jukka K. Korpela a écrit:

For some values of "work". It generates an attempt at exact imitation
of the visual appearance of the PDF document, hence trying to fight
against the strengths of HTML. It creates _no_ structural markup, just
div, span, nobr, b, etc., combined with the use of CSS positioning in a
manner that relies on browsers' violations of explicit requirements in
CSS specifications. And instead of generating a single HTML document,
it converts each page separately and makes them appear in a frame,
inside a frameset with no noframes element and with frames named
"links" and "rechts", which is _so_ informative e.g. to a blind person,
is it not? And with <title>014-048_Man_282198_50</title>.

It's probably easier to use cut and paste to get the textual content of
a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly generated
markup first. But it's not a big difference really, so if you don't
know how to cut and paste from your favorite PDF viewer, you might
almost as well use pdftohtml.

Agreed. I should have say it works BUT you have to do a *lot* of hand
editing after the conversion.

Some of my customers send me their documents as .doc files and altough these
can easily be exported as html files it is much better (and finally less
work) to just cut and paste the text and add markup by hand.

Spartanicus · Feb 8, 2004

Marc Nadeau said:
Agreed. I should have say it works BUT you have to do a *lot* of hand
editing after the conversion.

I have tried a similar utility called pdf2htm (it also produces dreadful
code btw). Getting rid of the code isn't a major problem, but I want to
extract images in their native format (assuming that images inside pdf's
are in a standard image format (?)) without recompressing them. Pdf2htm
converts all graphics to jpg's, this is especially unwanted because the
images in question (1bit line drawings) are unsuitable for the jpeg
format.

So I'm still looking for a utility that extracts the raw data,
unformatted text and the native images.

Sid Ismail · Feb 8, 2004

: I have tried a similar utility called pdf2htm (it also produces dreadful
: code btw). Getting rid of the code isn't a major problem, but I want to
: extract images in their native format (assuming that images inside pdf's
: are in a standard image format (?)) without recompressing them.

Screen capture to bmp ? then convert it...

Sid

Spartanicus · Feb 8, 2004

Sid Ismail said:
: I have tried a similar utility called pdf2htm (it also produces dreadful
: code btw). Getting rid of the code isn't a major problem, but I want to
: extract images in their native format (assuming that images inside pdf's
: are in a standard image format (?)) without recompressing them.

Screen capture to bmp ? then convert it...

That would prevent me from reusing the images in their native format,
and there are hundreds of images in the pdf's, all would need manual
cropping etc, not a realistic option.

Zak McGregor · Feb 9, 2004

I have tried a similar utility called pdf2htm (it also produces dreadful
code btw). Getting rid of the code isn't a major problem, but I want to
extract images in their native format (assuming that images inside pdf's
are in a standard image format (?)) without recompressing them. Pdf2htm
converts all graphics to jpg's, this is especially unwanted because the
images in question (1bit line drawings) are unsuitable for the jpeg
format.

So I'm still looking for a utility that extracts the raw data,
unformatted text and the native images.

There are a slew of pdf2* and pdfto* utilities:
df2dsc pdf2ps pdffonts pdfimages pdfinfo pdfopt
pdftopbm pdftops pdftotext

(just on my machine - have made no real effort to get them there either).
Presumably pdfimages will do more or less what you want, it uses ppm
format by default although if the pdf contains jpegs you can specify that
they be left as jpegs.

HTH

Ciao

Zak

Spartanicus · Feb 10, 2004

Presumably pdfimages will do more or less what you want

Linux only, and extracting text and images separately makes it very
labour intensive to bring the two together again (there are hundreds of
pages).

Scrap data from pdf file to excel using python	0	Jun 21, 2023
How to extract image from PDF in Python	0	May 24, 2022
SQL Problem Using Extract Command	0	Apr 8, 2022
I am having trouble finding a method of using the git enterprise api to scrape data from projects	1	Jun 1, 2023
Searching in array for numbers between two numbers	4	Oct 2, 2022
Extracting value from nested JSON	2	Sep 10, 2023
Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023

OT: extract data from PDF's

Spartanicus

Paul Furman

Marc Nadeau

Jukka K. Korpela

Toby A Inkster

Marc Nadeau

Spartanicus

Sid Ismail

Spartanicus

Zak McGregor

Spartanicus

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads