Script to extract text from PDF files

Discussion in 'Python' started by David Boddie, Sep 26, 2007.

  1. David Boddie

    David Boddie Guest

    On Wed Sep 26 15:06:54 CEST 2007, byte8bits wrote:

    > On Sep 25, 10:19 pm, Lawrence D'Oliveiro <l... at geek-
    > central.gen.new_zealand> wrote:
    >
    > > This is inherent in the nature of PDF: it's a page-description language,
    > > not a document-interchange language. Each text-drawing command can put a
    > > block of text anywhere on the page, so you have no idea, just from
    > > parsing the PDF content, how to join these blocks up into lines,
    > > paragraphs, columns etc.

    >
    > So (I'm not being a wise guy) how does pdftotext do it so well?


    There's a little information on that online:

    http://www.glyphandcog.com/textext.html

    You would need to look at the source code to see exactly what it does.

    > The text I can extract from PDFs is extracted as it appears in the doc.
    > Although there are various ways to insert and encode text in PDFs,
    > it's also well documented in the PDF specifications (http://
    > www.adobe.com/devnet/pdf/pdf_reference.html).


    Just because inserting and encoding is well documented doesn't mean that the
    reverse processes are easy. :-/

    > Going back to pdftotext... it works well at extracting text from PDF.
    > I'd like a native Python library that does the same.


    Maybe you should look at the source code for pdftotext, if that's an option.

    > This can be done.
    > And, it can be done in Python. I've made a small start, my hope was that
    > others would be interested in helping, but I can do it on my own
    > too... it'll just take a lot longer :)


    Can I suggest that you approach one or more authors of the existing Python
    PDF solutions and work with them on this? There are at least four PDF parsers
    written in Python out there.

    David
     
    David Boddie, Sep 26, 2007
    #1
    1. Advertising

  2. David Boddie

    brad Guest

    David Boddie wrote:
    > There's a little information on that online:
    > http://www.glyphandcog.com/textext.html


    Thanks, I'll read that.

    > Just because inserting and encoding is well documented doesn't mean that the
    > reverse processes are easy. :-/


    Boy, that's an understatement... most of the PDF tools (in fact almost
    all) I come across write PDF docs... they output things to PDF. It's
    like anyone can generate PDF files... it's dead simple, but extracting
    text out of them in an accurate, reliable manner is much more difficult.

    > Maybe you should look at the source code for pdftotext, if that's an option.


    I'm not sure it's opensource/free software with source available, but
    I'll look into that.

    > Can I suggest that you approach one or more authors of the existing Python
    > PDF solutions and work with them on this? There are at least four PDF parsers
    > written in Python out there.


    I appreciate that suggestion, but again, none of the current solutions
    I've seen and tried, extract text from pdf documents. I'd love to be
    proven wrong on this point. So if one of those four current PDF
    solutions you mention do that, please let me know.

    Thanks,

    Brad
     
    brad, Sep 26, 2007
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. crazyprakash
    Replies:
    4
    Views:
    3,378
    adrian
    Oct 30, 2005
  2. vasudevram
    Replies:
    0
    Views:
    551
    vasudevram
    Jul 22, 2006
  3. brad
    Replies:
    7
    Views:
    3,966
    Svenn Are Bjerkem
    Sep 27, 2007
  4. Ricardo Pog
    Replies:
    1
    Views:
    435
    Austin Ziegler
    Mar 26, 2008
  5. P Rajmohan Banavi-A17190

    extract contents from pdf (pdf reader)

    P Rajmohan Banavi-A17190, Sep 22, 2008, in forum: Ruby
    Replies:
    1
    Views:
    133
    Gregory Brown
    Sep 22, 2008
Loading...

Share This Page