Script to extract text from PDF files

Discussion in 'Python' started by brad, Sep 25, 2007.

  1. brad

    brad Guest

    I have a very crude Python script that extracts text from some (and I
    emphasize some) PDF documents. On many PDF docs, I cannot extract text,
    but this is because I'm doing something wrong. The PDF spec is large and
    complex and there are various ways in which to store and encode text. I
    wanted to post here and ask if anyone is interested in helping make the
    script better which means it should accurately extract text from most
    any pdf file... not just some.

    I know the topic of reading/extracting the text from a PDF document
    natively in Python comes up every now and then on comp.lang.python...
    I've posted about it in the past myself. After searching for other
    solutions, I've resorted to attempting this on my own in my spare time.
    Using apps external to Python (pdftotext, etc.) is not really an option
    for me. If someone knows of a free native Python app that does this now,
    let me know and I'll use that instead!

    So, if other more experienced programmer are interested in helping make
    the script better, please let me know. I can host a website and the
    latest revision and do all of the grunt work.

    Thanks,

    Brad
     
    brad, Sep 25, 2007
    #1
    1. Advertising

  2. brad

    Paul Hankin Guest

    On Sep 25, 6:41 pm, brad <> wrote:
    > I have a very crude Python script that extracts text from some (and I
    > emphasize some) PDF documents. On many PDF docs, I cannot extract text,
    > but this is because I'm doing something wrong. The PDF spec is large and
    > complex and there are various ways in which to store and encode text. I
    > wanted to post here and ask if anyone is interested in helping make the
    > script better which means it should accurately extract text from most
    > any pdf file... not just some.
    >
    > I know the topic of reading/extracting the text from a PDF document
    > natively in Python comes up every now and then on comp.lang.python...
    > I've posted about it in the past myself. After searching for other
    > solutions, I've resorted to attempting this on my own in my spare time.
    > Using apps external to Python (pdftotext, etc.) is not really an option
    > for me. If someone knows of a free native Python app that does this now,
    > let me know and I'll use that instead!


    Googling for 'pdf to text python' and following the first link gives
    http://pybrary.net/pyPdf/

    --
    Paul Hankin
     
    Paul Hankin, Sep 25, 2007
    #2
    1. Advertising

  3. brad

    Guest

    On Sep 25, 3:02 pm, Paul Hankin <> wrote:
    > Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/


    Doesn't work that well, I've tried it, you should too... the author
    even admits this:

    extractText() [#]

    Locate all text drawing commands, in the order they are provided
    in the content stream, and extract the text. This works well for some
    PDF files, but poorly for others, depending on the generator used.
    This will be refined in the future. Do not rely on the order of text
    coming out of this function, as it will change if this function is
    made more sophisticated. - source http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html
     
    , Sep 25, 2007
    #3
  4. In message <>,
    wrote:

    > On Sep 25, 3:02 pm, Paul Hankin <> wrote:
    >
    >> Googling for 'pdf to text python' and following the first link
    >> giveshttp://pybrary.net/pyPdf/

    >
    > Doesn't work that well...


    This is inherent in the nature of PDF: it's a page-description language, not
    a document-interchange language. Each text-drawing command can put a block
    of text anywhere on the page, so you have no idea, just from parsing the
    PDF content, how to join these blocks up into lines, paragraphs, columns
    etc.
     
    Lawrence D'Oliveiro, Sep 26, 2007
    #4
  5. brad

    Guest

    On Sep 25, 10:19 pm, Lawrence D'Oliveiro <l...@geek-
    central.gen.new_zealand> wrote:

    > > Doesn't work that well...

    >
    > This is inherent in the nature of PDF: it's a page-description language, not
    > a document-interchange language. Each text-drawing command can put a block
    > of text anywhere on the page, so you have no idea, just from parsing the
    > PDF content, how to join these blocks up into lines, paragraphs, columns
    > etc.


    So (I'm not being a wise guy) how does pdftotext do it so well? The
    text I can extract from PDFs is extracted as it appears in the doc.
    Although there are various ways to insert and encode text in PDFs,
    it's also well documented in the PDF specifications (http://
    www.adobe.com/devnet/pdf/pdf_reference.html). Going back to
    pdftotext... it works well at extracting text from PDF. I'd like a
    native Python library that does the same. This can be done. And, it
    can be done in Python. I've made a small start, my hope was that
    others would be interested in helping, but I can do it on my own
    too... it'll just take a lot longer :)

    Brad
     
    , Sep 26, 2007
    #5
  6. On Sep 25, 9:18 pm, wrote:
    > On Sep 25, 3:02 pm, Paul Hankin <> wrote:
    >
    > > Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/

    >
    > Doesn't work that well, I've tried it, you should too... the author
    > even admits this:
    >
    > extractText() [#]
    >
    > Locate all text drawing commands, in the order they are provided
    > in the content stream, and extract the text. This works well for some
    > PDF files, but poorly for others, depending on the generator used.
    > This will be refined in the future. Do not rely on the order of text
    > coming out of this function, as it will change if this function is
    > made more sophisticated. - sourcehttp://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html


    I have downloaded this package and installed it and found that the
    text-extraction is more or less useless. Looking into the code and
    comparing with the PDF spec show a very early implementation of text
    extraction. Luckily it is possible to overwrite the textextraction
    method in the base class without having to fiddle with the original
    code. I tried to contact the developer to offer some help on
    implementing text extraction, but he didn't answer my emails.
    --
    Svenn
     
    Svenn Are Bjerkem, Sep 26, 2007
    #6
  7. brad

    Guest

    On Sep 26, 4:49 pm, Svenn Are Bjerkem <>
    wrote:

    > I have downloaded this package and installed it and found that the
    > text-extraction is more or less useless. Looking into the code and
    > comparing with the PDF spec show a very early implementation of text
    > extraction. Luckily it is possible to overwrite the textextraction
    > method in the base class without having to fiddle with the original
    > code. I tried to contact the developer to offer some help on
    > implementing text extraction, but he didn't answer my emails.
    > --
    > Svenn


    Well, feel free to send any ideas or help to me! It seems simple... Do
    a binary read. Find 'stream' and 'endstream' sections.
    zlib.decompress() all the streams. Find BT and ET markers (Begin Text
    & End Text) and finally locate the parens within those and string the
    text together. This works great on 3 out of 10 PDF documents, but my
    main issue seems to be the zlib compressed streams. Some of them don't
    seem to be FlateDecodeable (although they claim to be) or the header
    is somehow incorrect. But, once I get a good stream and decompress it,
    things are OK from that point on. Seriously, if you have ideas, please
    let me know. I'll be glad to share what I've got so far.

    Not many people seem to be interested. I'll stop adding to this
    thread... I don't want to beat a dead horse. Anyone interested in
    helping, can contact me via emial.

    Thanks,

    Brad
     
    , Sep 26, 2007
    #7
  8. On Sep 26, 11:50 pm, wrote:
    > On Sep 26, 4:49 pm, Svenn Are Bjerkem <>
    > wrote:
    >
    > > I have downloaded this package and installed it and found that the
    > > text-extraction is more or less useless. Looking into the code and
    > > comparing with the PDF spec show a very early implementation of text
    > > extraction. Luckily it is possible to overwrite the textextraction
    > > method in the base class without having to fiddle with the original
    > > code. I tried to contact the developer to offer some help on
    > > implementing text extraction, but he didn't answer my emails.
    > > --
    > > Svenn

    >
    > Well, feel free to send any ideas or help to me! It seems simple... Do
    > a binary read. Find 'stream' and 'endstream' sections.
    > zlib.decompress() all the streams. Find BT and ET markers (Begin Text
    > & End Text) and finally locate the parens within those and string the
    > text together. This works great on 3 out of 10 PDF documents, but my
    > main issue seems to be the zlib compressed streams. Some of them don't
    > seem to be FlateDecodeable (although they claim to be) or the header
    > is somehow incorrect. But, once I get a good stream and decompress it,
    > things are OK from that point on. Seriously, if you have ideas, please
    > let me know. I'll be glad to share what I've got so far.


    So far I have found that extracting text from the IEEE journal papers
    is not as simple as described above. The IEEE journals are typesetting
    things in typical journal style with two columns body text and one
    column abstract and a blob of header and author information. Take
    figures and formulas and footnotes and spread them around in the
    journal and you are basically using all block text layout commands
    there is in PDF.

    I wanted to to get the pdftotext from xpdf package to see what that
    tool does to the IEEE pdfs in order to see if I should dive into the
    sources to see what they do to get things right. So far I have not got
    this far. Purpose of my work was to extract the abstract of each paper
    to put into a database for later search, but IEEE also has a search
    engine on their journal DVD => postpone python work.

    Got my gentoo machine back on track so that may maybe change
    again......
    --
    Svenn
     
    Svenn Are Bjerkem, Sep 27, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. crazyprakash
    Replies:
    4
    Views:
    3,377
    adrian
    Oct 30, 2005
  2. vasudevram
    Replies:
    0
    Views:
    550
    vasudevram
    Jul 22, 2006
  3. David Boddie

    Script to extract text from PDF files

    David Boddie, Sep 26, 2007, in forum: Python
    Replies:
    1
    Views:
    481
  4. Ricardo Pog
    Replies:
    1
    Views:
    435
    Austin Ziegler
    Mar 26, 2008
  5. P Rajmohan Banavi-A17190

    extract contents from pdf (pdf reader)

    P Rajmohan Banavi-A17190, Sep 22, 2008, in forum: Ruby
    Replies:
    1
    Views:
    133
    Gregory Brown
    Sep 22, 2008
Loading...

Share This Page