S
Spartanicus
Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?
Acrobat)?
Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?
Marc Nadeau said:
Spartanicus said:Can anyone recommend a tool to extract the data from PDF's (other than
Acrobat)?
For some values of "work". It generates an attempt at exact imitation
of the visual appearance of the PDF document, hence trying to fight
against the strengths of HTML. It creates _no_ structural markup, just
div, span, nobr, b, etc., combined with the use of CSS positioning in a
manner that relies on browsers' violations of explicit requirements in
CSS specifications. And instead of generating a single HTML document,
it converts each page separately and makes them appear in a frame,
inside a frameset with no noframes element and with frames named
"links" and "rechts", which is _so_ informative e.g. to a blind person,
is it not? And with <title>014-048_Man_282198_50</title>.
It's probably easier to use cut and paste to get the textual content of
a PDF file (and grab the images separately) and add adequate HTML
markup by hand. At least you wouldn't need to remove randomly generated
markup first. But it's not a big difference really, so if you don't
know how to cut and paste from your favorite PDF viewer, you might
almost as well use pdftohtml.
Marc Nadeau said:Agreed. I should have say it works BUT you have to do a *lot* of hand
editing after the conversion.
Sid Ismail said:: I have tried a similar utility called pdf2htm (it also produces dreadful
: code btw). Getting rid of the code isn't a major problem, but I want to
: extract images in their native format (assuming that images inside pdf's
: are in a standard image format (?)) without recompressing them.
Screen capture to bmp ? then convert it...
I have tried a similar utility called pdf2htm (it also produces dreadful
code btw). Getting rid of the code isn't a major problem, but I want to
extract images in their native format (assuming that images inside pdf's
are in a standard image format (?)) without recompressing them. Pdf2htm
converts all graphics to jpg's, this is especially unwanted because the
images in question (1bit line drawings) are unsuitable for the jpeg
format.
So I'm still looking for a utility that extracts the raw data,
unformatted text and the native images.
Presumably pdfimages will do more or less what you want
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.