convert .pdf files to .txt files


D

Davor

Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:


from pdftools.pdffile import PDFDocument
from pdftools.pdftext import Text

def contents_to_text (contents):
for item in contents:
if isinstance (item, type ([])):
for i in contents_to_text (item):
yield i
elif isinstance (item, Text):
yield item.text

doc = PDFDocument ("/home/dave/pruebas_ficheros/carlos.pdf")
n_pages = doc.count_pages ()
text = []

for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_contents ().contents
text.extend (contents_to_text (contents))

print "".join (text)

the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange
characters
if someone knows how to use the pdftools and can help me it makes me
very happy.

Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt


Sorry for my english.
Thanks for all.
 
Ad

Advertisements

B

Baiju M

Davor said:
Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,

If you have 'xpdf' installed in your system,
'pdftotext' command will be available in your system.

Now to convert a pdf to text from Python use system call.
For example:

import os
os.system("pdftotext -layout my_pdf_file.pdf")

This will create 'my_pdf_file.txt' file.

Regards,
Baiju M
 
V

vasudevram

If you don't already have xpdf, you can get it here:

http://glyphandcog.com/Xpdf.html

Install it and then try what Baiju said, should work.
I've used it, its good, that's why I say it should work. If any
problems, post here again.

-------------------------------------------------------------------------------------------
Vasudev Ram
Independent software consultant
Personal site: http://www.geocities.com/vasudevram
PDF conversion tools: http://sourceforge.net/projects/xtopdf
-------------------------------------------------------------------------------------------
 
D

David Boddie

Davor said:
Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:
[...]

for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_contents ().contents
text.extend (contents_to_text (contents))

print "".join (text)

the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange
characters

pdftools just extracts the textual data in the file and stores it in
Text instances - it doesn't try to interpret or decode the text. I'd
like to fix the library so that it does try and decode the text
properly and put it into unicode strings, but I don't have the time
right now.

Remember that text can be stored in PDF files in many different
ways, and that the text cannot always be extracted in its original
form.
if someone knows how to use the pdftools and can help me it makes me
very happy.

Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt

You need to do something like this:

f = open("myfilename", "w").write("".join (text))
Sorry for my english.

Don't worry about it. It's much better than my Spanish will ever be.

Sorry I couldn't give you more help with this. You may find that the
other tools mentioned by people in this thread will do what you
need better than pdftools can at the moment.

David
 
Ad

Advertisements

D

Davor

Thanks for all you wrote, It will be very usefull to me, at the end I
use that code and the file I introduce is converted to .txt on the
directory where the file is placed, and in documents written in spanish
this do not gives problems on "acentos" in words like "camión" or
"introducción" that was very important to me. Thanks!

import os
os.system("pdftotext -layout my_pdf_file.pdf")

#This will create 'my_pdf_file.txt' file.
 

Top