Extract Text from PDF

Discussion in 'Ruby' started by Mark Dodwell, Apr 13, 2007.

  1. Mark Dodwell

    Mark Dodwell Guest

    Hi,

    Does anyone know a way to extract plain text from a PDF using Ruby?

    Many Thanks,

    ~ Mark

    --
    Posted via http://www.ruby-forum.com/.
     
    Mark Dodwell, Apr 13, 2007
    #1
    1. Advertising

  2. On 13.04.2007 14:06, Mark Dodwell wrote:
    > Does anyone know a way to extract plain text from a PDF using Ruby?


    IIRC there is a project under way to extend PDFWriter with reading
    capabilities. I don't know the current status of that. HTH

    robert
     
    Robert Klemme, Apr 13, 2007
    #2
    1. Advertising

  3. Mark Dodwell

    Chris Lowis Guest

    Robert Klemme wrote:
    > On 13.04.2007 14:06, Mark Dodwell wrote:
    >> Does anyone know a way to extract plain text from a PDF using Ruby?

    >
    > IIRC there is a project under way to extend PDFWriter with reading
    > capabilities. I don't know the current status of that. HTH


    In the meantime, you could use the commandline tools pdf2ps and ps2ascii
    (I think they use ghostscript as a backend), and read the resulting
    ascii file with ruby in the usual way.

    Regards,


    Chris

    --
    Posted via http://www.ruby-forum.com/.
     
    Chris Lowis, Apr 13, 2007
    #3
  4. Mark Dodwell

    Kouhei Sutou Guest

    Kouhei Sutou, Apr 13, 2007
    #4
  5. Robert Klemme wrote:
    > On 13.04.2007 14:06, Mark Dodwell wrote:
    >> Does anyone know a way to extract plain text from a PDF using Ruby?

    >
    > IIRC there is a project under way to extend PDFWriter with reading
    > capabilities. I don't know the current status of that. HTH
    >
    > robert

    At least on Linux, there is "pdftotext", which is part of the "poppler"
    package. So you can simply shell out to it if it's installed. If you're
    more ambitious, you could write an extension to use the underlying
    libraries in poppler.
    >
    >



    --
    M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
    http://borasky-research.net/

    If God had meant for carrots to be eaten cooked, He would have given rabbits fire.
     
    M. Edward (Ed) Borasky, Apr 13, 2007
    #5
  6. Mark Dodwell

    John Joyce Guest

    The trouble is, pdf is not always the same thing. Sometimes, there is
    no text at all in a pdf. It can be all vector art outlines or even
    all raster image graphics. There is never a guarantee that you will
    get any or all text that may otherwise be human readable in a pdf.
    Pdf has really become a kitchen sink format, so it is good to
    anticipate trouble parsing pdf files.
     
    John Joyce, Apr 14, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. crazyprakash
    Replies:
    4
    Views:
    3,377
    adrian
    Oct 30, 2005
  2. Julien ARNOUX

    Read and extract text from pdf

    Julien ARNOUX, Apr 21, 2006, in forum: Python
    Replies:
    3
    Views:
    635
  3. Ricardo Pog
    Replies:
    1
    Views:
    435
    Austin Ziegler
    Mar 26, 2008
  4. Sean Nakasone
    Replies:
    1
    Views:
    383
    Farrel Lifson
    Apr 14, 2008
  5. P Rajmohan Banavi-A17190

    extract contents from pdf (pdf reader)

    P Rajmohan Banavi-A17190, Sep 22, 2008, in forum: Ruby
    Replies:
    1
    Views:
    133
    Gregory Brown
    Sep 22, 2008
Loading...

Share This Page