Extract Text from PDF

M

Mark Dodwell

Hi,

Does anyone know a way to extract plain text from a PDF using Ruby?

Many Thanks,

~ Mark
 
R

Robert Klemme

Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

robert
 
C

Chris Lowis

Robert said:
IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

In the meantime, you could use the commandline tools pdf2ps and ps2ascii
(I think they use ghostscript as a backend), and read the resulting
ascii file with ruby in the usual way.

Regards,


Chris
 
M

M. Edward (Ed) Borasky

Robert said:
IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

robert
At least on Linux, there is "pdftotext", which is part of the "poppler"
package. So you can simply shell out to it if it's installed. If you're
more ambitious, you could write an extension to use the underlying
libraries in poppler.
 
J

John Joyce

The trouble is, pdf is not always the same thing. Sometimes, there is
no text at all in a pdf. It can be all vector art outlines or even
all raster image graphics. There is never a guarantee that you will
get any or all text that may otherwise be human readable in a pdf.
Pdf has really become a kitchen sink format, so it is good to
anticipate trouble parsing pdf files.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,533
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top