OT: extract data from PDF's

S

Spartanicus

Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?
 
J

Jukka K. Korpela

Marc Nadeau said:
http://pdftohtml.sourceforge.net/

It works in windows and linux.

For some values of "work". It generates an attempt at exact imitation
of the visual appearance of the PDF document, hence trying to fight
against the strengths of HTML. It creates _no_ structural markup, just
div, span, nobr, b, etc., combined with the use of CSS positioning in a
manner that relies on browsers' violations of explicit requirements in
CSS specifications. And instead of generating a single HTML document,
it converts each page separately and makes them appear in a frame,
inside a frameset with no noframes element and with frames named
"links" and "rechts", which is _so_ informative e.g. to a blind person,
is it not? And with <title>014-048_Man_282198_50</title>.

It's probably easier to use cut and paste to get the textual content of
a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly generated
markup first. But it's not a big difference really, so if you don't
know how to cut and paste from your favorite PDF viewer, you might
almost as well use pdftohtml.
 
T

Toby A Inkster

Spartanicus said:
Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?

Ghostscript contains a tool ps2ascii, which can handle PDF input.
 
M

Marc Nadeau

Jukka K. Korpela a écrit:
For some values of "work". It generates an attempt at exact imitation
of the visual appearance of the PDF document, hence trying to fight
against the strengths of HTML. It creates _no_ structural markup, just
div, span, nobr, b, etc., combined with the use of CSS positioning in a
manner that relies on browsers' violations of explicit requirements in
CSS specifications. And instead of generating a single HTML document,
it converts each page separately and makes them appear in a frame,
inside a frameset with no noframes element and with frames named
"links" and "rechts", which is _so_ informative e.g. to a blind person,
is it not? And with <title>014-048_Man_282198_50</title>.

It's probably easier to use cut and paste to get the textual content of
a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly generated
markup first. But it's not a big difference really, so if you don't
know how to cut and paste from your favorite PDF viewer, you might
almost as well use pdftohtml.

Agreed. I should have say it works BUT you have to do a *lot* of hand
editing after the conversion.

Some of my customers send me their documents as .doc files and altough these
can easily be exported as html files it is much better (and finally less
work) to just cut and paste the text and add markup by hand.
 
S

Spartanicus

Marc Nadeau said:
Agreed. I should have say it works BUT you have to do a *lot* of hand
editing after the conversion.

I have tried a similar utility called pdf2htm (it also produces dreadful
code btw). Getting rid of the code isn't a major problem, but I want to
extract images in their native format (assuming that images inside pdf's
are in a standard image format (?)) without recompressing them. Pdf2htm
converts all graphics to jpg's, this is especially unwanted because the
images in question (1bit line drawings) are unsuitable for the jpeg
format.

So I'm still looking for a utility that extracts the raw data,
unformatted text and the native images.
 
S

Sid Ismail

: I have tried a similar utility called pdf2htm (it also produces dreadful
: code btw). Getting rid of the code isn't a major problem, but I want to
: extract images in their native format (assuming that images inside pdf's
: are in a standard image format (?)) without recompressing them.


Screen capture to bmp ? then convert it...

Sid
 
S

Spartanicus

Sid Ismail said:
: I have tried a similar utility called pdf2htm (it also produces dreadful
: code btw). Getting rid of the code isn't a major problem, but I want to
: extract images in their native format (assuming that images inside pdf's
: are in a standard image format (?)) without recompressing them.

Screen capture to bmp ? then convert it...

That would prevent me from reusing the images in their native format,
and there are hundreds of images in the pdf's, all would need manual
cropping etc, not a realistic option.
 
Z

Zak McGregor

I have tried a similar utility called pdf2htm (it also produces dreadful
code btw). Getting rid of the code isn't a major problem, but I want to
extract images in their native format (assuming that images inside pdf's
are in a standard image format (?)) without recompressing them. Pdf2htm
converts all graphics to jpg's, this is especially unwanted because the
images in question (1bit line drawings) are unsuitable for the jpeg
format.

So I'm still looking for a utility that extracts the raw data,
unformatted text and the native images.

There are a slew of pdf2* and pdfto* utilities:
df2dsc pdf2ps pdffonts pdfimages pdfinfo pdfopt
pdftopbm pdftops pdftotext

(just on my machine - have made no real effort to get them there either).
Presumably pdfimages will do more or less what you want, it uses ppm
format by default although if the pdf contains jpegs you can specify that
they be left as jpegs.

HTH

Ciao

Zak
 
S

Spartanicus

Presumably pdfimages will do more or less what you want

Linux only, and extracting text and images separately makes it very
labour intensive to bring the two together again (there are hundreds of
pages).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top