Read and extract text from pdf

Discussion in 'Python' started by Julien ARNOUX, Apr 21, 2006.

  1. Hi,
    I have a problem :), I just want to extract text from pdf file with
    python. There is differents libraries for that but it doesn't work...

    pyPdf and pdfTools, I don't know why but it doesn't works with some
    pdf... For example space chars are delete in the text..
    Pdf playground : I don't understand how it work.

    If you have an idea, a tutorial, a library or anything who can help me
    to do that.
    Julien ARNOUX, Apr 21, 2006
    #1
    1. Advertising

  2. Julien ARNOUX

    Rene Pijlman Guest

    Julien ARNOUX:
    >I have a problem :), I just want to extract text from pdf file with
    >python. There is differents libraries for that but it doesn't work...
    >
    >pyPdf and pdfTools, I don't know why but it doesn't works with some
    >pdf...


    Text can be represented in different ways in PDF: as tagged text, bitmap
    and vector images, and even algorithms (IIRC). Most tools will only be
    able to retrieve text represented as tagged text. So some tools may work
    on some texts in some files and fail on others.

    --
    René Pijlman

    Wat wil jij leren? http://www.leren.nl
    Rene Pijlman, Apr 21, 2006
    #2
    1. Advertising

  3. Julien ARNOUX

    avishay Guest

    You can use Ghostscript for that purpose. Look at ps2ascii script (or
    batch file) in the Ghostscript distribution. You can either call
    Ghostscript from command line or use its DLL (don't know if Python
    binding already exists...). The limitations the previous author has
    mentioned, however, still apply.

    Avishay
    avishay, Apr 21, 2006
    #3
  4. Julien ARNOUX

    Jim Guest

    There is a pdftotext executable, at least on Linux.
    Jim, Apr 21, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. crazyprakash
    Replies:
    4
    Views:
    3,362
    adrian
    Oct 30, 2005
  2. Julien ARNOUX

    Read and extract text from pdf

    Julien ARNOUX, Apr 24, 2006, in forum: Python
    Replies:
    0
    Views:
    252
    Julien ARNOUX
    Apr 24, 2006
  3. Ricardo Pog
    Replies:
    1
    Views:
    396
    Austin Ziegler
    Mar 26, 2008
  4. Sean Nakasone
    Replies:
    1
    Views:
    338
    Farrel Lifson
    Apr 14, 2008
  5. P Rajmohan Banavi-A17190

    extract contents from pdf (pdf reader)

    P Rajmohan Banavi-A17190, Sep 22, 2008, in forum: Ruby
    Replies:
    1
    Views:
    125
    Gregory Brown
    Sep 22, 2008
Loading...

Share This Page