Read and extract text from pdf

J

Julien ARNOUX

Hi,
I have a problem :), I just want to extract text from pdf file with
python. There is differents libraries for that but it doesn't work...

pyPdf and pdfTools, I don't know why but it doesn't works with some
pdf... For example space chars are delete in the text..
Pdf playground : I don't understand how it work.

If you have an idea, a tutorial, a library or anything who can help me
to do that.
 
R

Rene Pijlman

Julien ARNOUX:
I have a problem :), I just want to extract text from pdf file with
python. There is differents libraries for that but it doesn't work...

pyPdf and pdfTools, I don't know why but it doesn't works with some
pdf...

Text can be represented in different ways in PDF: as tagged text, bitmap
and vector images, and even algorithms (IIRC). Most tools will only be
able to retrieve text represented as tagged text. So some tools may work
on some texts in some files and fail on others.
 
A

avishay

You can use Ghostscript for that purpose. Look at ps2ascii script (or
batch file) in the Ghostscript distribution. You can either call
Ghostscript from command line or use its DLL (don't know if Python
binding already exists...). The limitations the previous author has
mentioned, however, still apply.

Avishay
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top