Script to extract text from PDF files

David Boddie · Sep 26, 2007

So (I'm not being a wise guy) how does pdftotext do it so well?

There's a little information on that online:

http://www.glyphandcog.com/textext.html

You would need to look at the source code to see exactly what it does.

The text I can extract from PDFs is extracted as it appears in the doc.
Although there are various ways to insert and encode text in PDFs,
it's also well documented in the PDF specifications (http://
www.adobe.com/devnet/pdf/pdf_reference.html).

Just because inserting and encoding is well documented doesn't mean that the
reverse processes are easy. :-/

Going back to pdftotext... it works well at extracting text from PDF.
I'd like a native Python library that does the same.

Maybe you should look at the source code for pdftotext, if that's an option.

This can be done.
And, it can be done in Python. I've made a small start, my hope was that
others would be interested in helping, but I can do it on my own
too... it'll just take a lot longer

Can I suggest that you approach one or more authors of the existing Python
PDF solutions and work with them on this? There are at least four PDF parsers
written in Python out there.

David

brad · Sep 26, 2007

David said:
There's a little information on that online:
http://www.glyphandcog.com/textext.html

Thanks, I'll read that.

Just because inserting and encoding is well documented doesn't mean that the
reverse processes are easy. :-/

Boy, that's an understatement... most of the PDF tools (in fact almost
all) I come across write PDF docs... they output things to PDF. It's
like anyone can generate PDF files... it's dead simple, but extracting
text out of them in an accurate, reliable manner is much more difficult.

Maybe you should look at the source code for pdftotext, if that's an option.

I'm not sure it's opensource/free software with source available, but
I'll look into that.

Can I suggest that you approach one or more authors of the existing Python
PDF solutions and work with them on this? There are at least four PDF parsers
written in Python out there.

I appreciate that suggestion, but again, none of the current solutions
I've seen and tried, extract text from pdf documents. I'd love to be
proven wrong on this point. So if one of those four current PDF
solutions you mention do that, please let me know.

Thanks,

Brad

How to extract image from PDF in Python	0	May 24, 2022
Script to extract text from PDF files	7	Sep 25, 2007
Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Jasper Reports PDF/UA	0	Jun 12, 2019
Puzzling PDF	1	Feb 16, 2014
How to extract Arabic Text from PDF file	3	Jan 28, 2009

Script to extract text from PDF files

David Boddie

brad

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads