Extract Text from PDF

Mark Dodwell · Apr 13, 2007

Hi,

Does anyone know a way to extract plain text from a PDF using Ruby?

Many Thanks,

~ Mark

Robert Klemme · Apr 13, 2007

Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

robert

Chris Lowis · Apr 13, 2007

Robert said:
IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

In the meantime, you could use the commandline tools pdf2ps and ps2ascii
(I think they use ghostscript as a backend), and read the resulting
ascii file with ruby in the usual way.

Regards,

Chris

Kouhei Sutou · Apr 13, 2007

Hi,

2007/4/13 said:
Does anyone know a way to extract plain text from a PDF using Ruby?

You can use Ruby/Poppler:
http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby/Poppler

Here is an example to do that:
http://ruby-gnome2.cvs.sourceforge..../sample/pdf2text.rb?revision=HEAD&view=markup

Thanks,

M. Edward (Ed) Borasky · Apr 13, 2007

Robert said:
IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

robert

At least on Linux, there is "pdftotext", which is part of the "poppler"
package. So you can simply shell out to it if it's installed. If you're
more ambitious, you could write an extension to use the underlying
libraries in poppler.

John Joyce · Apr 14, 2007

The trouble is, pdf is not always the same thing. Sometimes, there is
no text at all in a pdf. It can be all vector art outlines or even
all raster image graphics. There is never a guarantee that you will
get any or all text that may otherwise be human readable in a pdf.
Pdf has really become a kitchen sink format, so it is good to
anticipate trouble parsing pdf files.

Scrap data from pdf file to excel using python	0	Jun 21, 2023
How to extract image from PDF in Python	0	May 24, 2022
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
PDF file won't open	1	Jun 21, 2022
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
PDF File Code	4	Apr 20, 2023

Extract Text from PDF

Mark Dodwell

Robert Klemme

Chris Lowis

Kouhei Sutou

M. Edward (Ed) Borasky

John Joyce

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads