OT: extract data from PDF's

Discussion in 'HTML' started by Spartanicus, Feb 6, 2004.

  1. Spartanicus

    Spartanicus Guest

    Can anyone recommend a tool to extract the data from PDF's (other than
    Acrobat)?

    --
    Spartanicus
     
    Spartanicus, Feb 6, 2004
    #1
    1. Advertising

  2. Spartanicus

    Paul Furman Guest

    Automatically with php, manually with the select or text tools in
    acrobat. http://us3.php.net/manual/en/ref.pdf.php

    Spartanicus wrote:

    > Can anyone recommend a tool to extract the data from PDF's (other than
    > Acrobat)?
    >
     
    Paul Furman, Feb 6, 2004
    #2
    1. Advertising

  3. Spartanicus

    Marc Nadeau Guest

    Spartanicus a écrit:

    > Can anyone recommend a tool to extract the data from PDF's (other than
    > Acrobat)?
    >


    There is a command line utility called pdftohtml at

    http://pdftohtml.sourceforge.net/

    It works in windows and linux.

    You may have to compile the source.

    I tried it and it works.

    Bonne chance!

    --
    Ce qui fait que la plupart des femmes sont peu touchées de l'amitié,
    c'est qu'elle est fade quand on a senti l'amour. La Rochefoucauld
     
    Marc Nadeau, Feb 7, 2004
    #3
  4. Marc Nadeau <> wrote:

    > http://pdftohtml.sourceforge.net/
    >
    > It works in windows and linux.


    For some values of "work". It generates an attempt at exact imitation
    of the visual appearance of the PDF document, hence trying to fight
    against the strengths of HTML. It creates _no_ structural markup, just
    div, span, nobr, b, etc., combined with the use of CSS positioning in a
    manner that relies on browsers' violations of explicit requirements in
    CSS specifications. And instead of generating a single HTML document,
    it converts each page separately and makes them appear in a frame,
    inside a frameset with no noframes element and with frames named
    "links" and "rechts", which is _so_ informative e.g. to a blind person,
    is it not? And with <title>014-048_Man_282198_50</title>.

    It's probably easier to use cut and paste to get the textual content of
    a PDF file (and grab the images separately) and add adequate HTML
    markup by hand. At least you wouldn't need to remove randomly generated
    markup first. But it's not a big difference really, so if you don't
    know how to cut and paste from your favorite PDF viewer, you might
    almost as well use pdftohtml.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
     
    Jukka K. Korpela, Feb 7, 2004
    #4
  5. Spartanicus wrote:

    > Can anyone recommend a tool to extract the data from PDF's (other than
    > Acrobat)?


    Ghostscript contains a tool ps2ascii, which can handle PDF input.

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me - http://www.goddamn.co.uk/tobyink/?page=132
     
    Toby A Inkster, Feb 7, 2004
    #5
  6. Spartanicus

    Marc Nadeau Guest

    Jukka K. Korpela a écrit:

    > Marc Nadeau <> wrote:
    >
    >> http://pdftohtml.sourceforge.net/
    >>
    >> It works in windows and linux.

    >
    > For some values of "work". It generates an attempt at exact imitation
    > of the visual appearance of the PDF document, hence trying to fight
    > against the strengths of HTML. It creates _no_ structural markup, just
    > div, span, nobr, b, etc., combined with the use of CSS positioning in a
    > manner that relies on browsers' violations of explicit requirements in
    > CSS specifications. And instead of generating a single HTML document,
    > it converts each page separately and makes them appear in a frame,
    > inside a frameset with no noframes element and with frames named
    > "links" and "rechts", which is _so_ informative e.g. to a blind person,
    > is it not? And with <title>014-048_Man_282198_50</title>.
    >
    > It's probably easier to use cut and paste to get the textual content of
    > a PDF file (and grab the images separately) and add adequate HTML
    > markup by hand. At least you wouldn't need to remove randomly generated
    > markup first. But it's not a big difference really, so if you don't
    > know how to cut and paste from your favorite PDF viewer, you might
    > almost as well use pdftohtml.
    >


    Agreed. I should have say it works BUT you have to do a *lot* of hand
    editing after the conversion.

    Some of my customers send me their documents as .doc files and altough these
    can easily be exported as html files it is much better (and finally less
    work) to just cut and paste the text and add markup by hand.


    --
    Passer pour un idiot aux yeux d'un imbecile est
    une volupte de fin gourmet. Alphonse Allais.
     
    Marc Nadeau, Feb 8, 2004
    #6
  7. Spartanicus

    Spartanicus Guest

    Marc Nadeau <> wrote:

    >> It's probably easier to use cut and paste to get the textual content of
    >> a PDF file (and grab the images separately) and add adequate HTML
    >> markup by hand. At least you wouldn't need to remove randomly generated
    >> markup first. But it's not a big difference really, so if you don't
    >> know how to cut and paste from your favorite PDF viewer, you might
    >> almost as well use pdftohtml.

    >
    >Agreed. I should have say it works BUT you have to do a *lot* of hand
    >editing after the conversion.


    I have tried a similar utility called pdf2htm (it also produces dreadful
    code btw). Getting rid of the code isn't a major problem, but I want to
    extract images in their native format (assuming that images inside pdf's
    are in a standard image format (?)) without recompressing them. Pdf2htm
    converts all graphics to jpg's, this is especially unwanted because the
    images in question (1bit line drawings) are unsuitable for the jpeg
    format.

    So I'm still looking for a utility that extracts the raw data,
    unformatted text and the native images.

    --
    Spartanicus
     
    Spartanicus, Feb 8, 2004
    #7
  8. Spartanicus

    Sid Ismail Guest

    On Sun, 08 Feb 2004 08:15:24 +0000, Spartanicus <> wrote:

    : I have tried a similar utility called pdf2htm (it also produces dreadful
    : code btw). Getting rid of the code isn't a major problem, but I want to
    : extract images in their native format (assuming that images inside pdf's
    : are in a standard image format (?)) without recompressing them.


    Screen capture to bmp ? then convert it...

    Sid
     
    Sid Ismail, Feb 8, 2004
    #8
  9. Spartanicus

    Spartanicus Guest

    Sid Ismail <> wrote:

    >: I have tried a similar utility called pdf2htm (it also produces dreadful
    >: code btw). Getting rid of the code isn't a major problem, but I want to
    >: extract images in their native format (assuming that images inside pdf's
    >: are in a standard image format (?)) without recompressing them.
    >
    >Screen capture to bmp ? then convert it...


    That would prevent me from reusing the images in their native format,
    and there are hundreds of images in the pdf's, all would need manual
    cropping etc, not a realistic option.

    --
    Spartanicus
     
    Spartanicus, Feb 8, 2004
    #9
  10. Spartanicus

    Zak McGregor Guest

    On Sun, 08 Feb 2004 10:15:24 +0200, Spartanicus <"Spartanicus"
    <>> wrote:

    > Marc Nadeau <> wrote:
    >
    >>> It's probably easier to use cut and paste to get the textual content
    >>> of a PDF file (and grab the images separately) and add adequate HTML
    >>> markup by hand. At least you wouldn't need to remove randomly
    >>> generated markup first. But it's not a big difference really, so if
    >>> you don't know how to cut and paste from your favorite PDF viewer, you
    >>> might almost as well use pdftohtml.

    >>
    >>Agreed. I should have say it works BUT you have to do a *lot* of hand
    >>editing after the conversion.

    >
    > I have tried a similar utility called pdf2htm (it also produces dreadful
    > code btw). Getting rid of the code isn't a major problem, but I want to
    > extract images in their native format (assuming that images inside pdf's
    > are in a standard image format (?)) without recompressing them. Pdf2htm
    > converts all graphics to jpg's, this is especially unwanted because the
    > images in question (1bit line drawings) are unsuitable for the jpeg
    > format.
    >
    > So I'm still looking for a utility that extracts the raw data,
    > unformatted text and the native images.


    There are a slew of pdf2* and pdfto* utilities:
    df2dsc pdf2ps pdffonts pdfimages pdfinfo pdfopt
    pdftopbm pdftops pdftotext

    (just on my machine - have made no real effort to get them there either).
    Presumably pdfimages will do more or less what you want, it uses ppm
    format by default although if the pdf contains jpegs you can specify that
    they be left as jpegs.

    HTH

    Ciao

    Zak

    --
    ========================================================================
    http://www.carfolio.com/ Searchable database of 10 000+ car specs
    ========================================================================
     
    Zak McGregor, Feb 9, 2004
    #10
  11. Spartanicus

    Spartanicus Guest

    Zak McGregor <> wrote:

    >> So I'm still looking for a utility that extracts the raw data,
    >> unformatted text and the native images.


    >Presumably pdfimages will do more or less what you want


    Linux only, and extracting text and images separately makes it very
    labour intensive to bring the two together again (there are hundreds of
    pages).

    --
    Spartanicus
     
    Spartanicus, Feb 10, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. crazyprakash
    Replies:
    4
    Views:
    3,433
    adrian
    Oct 30, 2005
  2. Luigi

    Extract data from Pdf files

    Luigi, Apr 4, 2008, in forum: ASP .Net
    Replies:
    1
    Views:
    458
    George Ter-Saakov
    Apr 4, 2008
  3. Ricardo Pog
    Replies:
    1
    Views:
    487
    Austin Ziegler
    Mar 26, 2008
  4. Sean Nakasone
    Replies:
    1
    Views:
    428
    Farrel Lifson
    Apr 14, 2008
  5. P Rajmohan Banavi-A17190

    extract contents from pdf (pdf reader)

    P Rajmohan Banavi-A17190, Sep 22, 2008, in forum: Ruby
    Replies:
    1
    Views:
    149
    Gregory Brown
    Sep 22, 2008
Loading...

Share This Page