pdf2txt

B

B P

Is there a way via Perl or even Python to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well. I'm not very versed yet at this language, but I am learning.
Any assistance or push in the right direction would be appreciated.

BP
 
P

Peter J Lusby

B said:
Is there a way via Perl or even Python to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well. I'm not very versed yet at this language, but I am learning.
Any assistance or push in the right direction would be appreciated.

BP

If the documents were scanned, then the textual information has been
lost - all the PDF files will contain is TIF images of the pages. There
is a program called eXtr@ct It! that will extract the intelligence from
the image files, and give you RTF output for the text and Autocad DXF or
HPGL for the graphic data. If your company is interested, let me know
and I will put you in touch with the software company. My return
address is not spamtrapped.

Regards
Peter

--
"A dust whom England bore, shaped, made aware"- Rupert Brooke, "The Soldier"

Peter J. Lusby
San Diego, California, USA
http://www.lusby.org
 
R

Rene van Leeuwen

Is there a way via Perl or even Python to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well. I'm not very versed yet at this language, but I am learning.
Any assistance or push in the right direction would be appreciated.

You may be able to use pdf2txt.pl from Shigeru Ishida.
ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
but the site seems to be unavailable at this time...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,780
Messages
2,569,614
Members
45,288
Latest member
Top CryptoTwitterChannels

Latest Threads

Top