Script to extract text from PDF files

brad · Sep 25, 2007

I have a very crude Python script that extracts text from some (and I
emphasize some) PDF documents. On many PDF docs, I cannot extract text,
but this is because I'm doing something wrong. The PDF spec is large and
complex and there are various ways in which to store and encode text. I
wanted to post here and ask if anyone is interested in helping make the
script better which means it should accurately extract text from most
any pdf file... not just some.

I know the topic of reading/extracting the text from a PDF document
natively in Python comes up every now and then on comp.lang.python...
I've posted about it in the past myself. After searching for other
solutions, I've resorted to attempting this on my own in my spare time.
Using apps external to Python (pdftotext, etc.) is not really an option
for me. If someone knows of a free native Python app that does this now,
let me know and I'll use that instead!

So, if other more experienced programmer are interested in helping make
the script better, please let me know. I can host a website and the
latest revision and do all of the grunt work.

Thanks,

Brad

Paul Hankin · Sep 25, 2007

I have a very crude Python script that extracts text from some (and I
emphasize some) PDF documents. On many PDF docs, I cannot extract text,
but this is because I'm doing something wrong. The PDF spec is large and
complex and there are various ways in which to store and encode text. I
wanted to post here and ask if anyone is interested in helping make the
script better which means it should accurately extract text from most
any pdf file... not just some.

I know the topic of reading/extracting the text from a PDF document
natively in Python comes up every now and then on comp.lang.python...
I've posted about it in the past myself. After searching for other
solutions, I've resorted to attempting this on my own in my spare time.
Using apps external to Python (pdftotext, etc.) is not really an option
for me. If someone knows of a free native Python app that does this now,
let me know and I'll use that instead!

Googling for 'pdf to text python' and following the first link gives
http://pybrary.net/pyPdf/

byte8bits · Sep 25, 2007

Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/

Doesn't work that well, I've tried it, you should too... the author
even admits this:

extractText() [#]

Locate all text drawing commands, in the order they are provided
in the content stream, and extract the text. This works well for some
PDF files, but poorly for others, depending on the generator used.
This will be refined in the future. Do not rely on the order of text
coming out of this function, as it will change if this function is
made more sophisticated. - source http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html

Lawrence D'Oliveiro · Sep 26, 2007

In message <[email protected]>,

Doesn't work that well...

This is inherent in the nature of PDF: it's a page-description language, not
a document-interchange language. Each text-drawing command can put a block
of text anywhere on the page, so you have no idea, just from parsing the
PDF content, how to join these blocks up into lines, paragraphs, columns
etc.

byte8bits · Sep 26, 2007

This is inherent in the nature of PDF: it's a page-description language, not
a document-interchange language. Each text-drawing command can put a block
of text anywhere on the page, so you have no idea, just from parsing the
PDF content, how to join these blocks up into lines, paragraphs, columns
etc.

So (I'm not being a wise guy) how does pdftotext do it so well? The
text I can extract from PDFs is extracted as it appears in the doc.
Although there are various ways to insert and encode text in PDFs,
it's also well documented in the PDF specifications (http://
www.adobe.com/devnet/pdf/pdf_reference.html). Going back to
pdftotext... it works well at extracting text from PDF. I'd like a
native Python library that does the same. This can be done. And, it
can be done in Python. I've made a small start, my hope was that
others would be interested in helping, but I can do it on my own
too... it'll just take a lot longer

Brad

Svenn Are Bjerkem · Sep 26, 2007

Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/

Click to expand...

Doesn't work that well, I've tried it, you should too... the author
even admits this:

extractText() [#]

Locate all text drawing commands, in the order they are provided
in the content stream, and extract the text. This works well for some
PDF files, but poorly for others, depending on the generator used.
This will be refined in the future. Do not rely on the order of text
coming out of this function, as it will change if this function is
made more sophisticated. - sourcehttp://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html

I have downloaded this package and installed it and found that the
text-extraction is more or less useless. Looking into the code and
comparing with the PDF spec show a very early implementation of text
extraction. Luckily it is possible to overwrite the textextraction
method in the base class without having to fiddle with the original
code. I tried to contact the developer to offer some help on
implementing text extraction, but he didn't answer my emails.

byte8bits · Sep 26, 2007

I have downloaded this package and installed it and found that the
text-extraction is more or less useless. Looking into the code and
comparing with the PDF spec show a very early implementation of text
extraction. Luckily it is possible to overwrite the textextraction
method in the base class without having to fiddle with the original
code. I tried to contact the developer to offer some help on
implementing text extraction, but he didn't answer my emails.

Well, feel free to send any ideas or help to me! It seems simple... Do
a binary read. Find 'stream' and 'endstream' sections.
zlib.decompress() all the streams. Find BT and ET markers (Begin Text
& End Text) and finally locate the parens within those and string the
text together. This works great on 3 out of 10 PDF documents, but my
main issue seems to be the zlib compressed streams. Some of them don't
seem to be FlateDecodeable (although they claim to be) or the header
is somehow incorrect. But, once I get a good stream and decompress it,
things are OK from that point on. Seriously, if you have ideas, please
let me know. I'll be glad to share what I've got so far.

Not many people seem to be interested. I'll stop adding to this
thread... I don't want to beat a dead horse. Anyone interested in
helping, can contact me via emial.

Thanks,

Brad

Svenn Are Bjerkem · Sep 27, 2007

Well, feel free to send any ideas or help to me! It seems simple... Do
a binary read. Find 'stream' and 'endstream' sections.
zlib.decompress() all the streams. Find BT and ET markers (Begin Text
& End Text) and finally locate the parens within those and string the
text together. This works great on 3 out of 10 PDF documents, but my
main issue seems to be the zlib compressed streams. Some of them don't
seem to be FlateDecodeable (although they claim to be) or the header
is somehow incorrect. But, once I get a good stream and decompress it,
things are OK from that point on. Seriously, if you have ideas, please
let me know. I'll be glad to share what I've got so far.

So far I have found that extracting text from the IEEE journal papers
is not as simple as described above. The IEEE journals are typesetting
things in typical journal style with two columns body text and one
column abstract and a blob of header and author information. Take
figures and formulas and footnotes and spread them around in the
journal and you are basically using all block text layout commands
there is in PDF.

I wanted to to get the pdftotext from xpdf package to see what that
tool does to the IEEE pdfs in order to see if I should dive into the
sources to see what they do to get things right. So far I have not got
this far. Purpose of my work was to extract the abstract of each paper
to put into a database for later search, but IEEE also has a search
engine on their journal DVD => postpone python work.

Got my gentoo machine back on track so that may maybe change
again......

Scrap data from pdf file to excel using python	0	Jun 21, 2023
How to extract image from PDF in Python	0	May 24, 2022
Script to extract text from PDF files	1	Sep 26, 2007
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Extract images from PDF files	2	Jul 28, 2009
Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
PDF extraction of specific data	1	Jun 13, 2021

Script to extract text from PDF files

brad

Paul Hankin

byte8bits

Lawrence D'Oliveiro

byte8bits

Svenn Are Bjerkem

byte8bits

Svenn Are Bjerkem

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads