Pdf Parsing Project Example

Felipe Espinoza · May 9, 2011

Hi,

I'm looking for an example of parsing pdf. I tried to implement this
with ruby
and docsplit gem, but it uses an external tool to extract the text, and
there are problems with number references, and you have to parse the
text file according to the regular expressions

I want to parse some papers in pdf format, to extract it's title,
keywords, authors, authors's mails, institutions, etc.

I'm looking for some experience ruby developer with a better way to do
this without parsing a textfile through regular expressions

Greetings

James · May 9, 2011

[Note: parts of this message were removed to make it a legal post.]

Regular Expressions are pretty much the standard way of parsing text files,
aren't they? Certainly they're what I've been using for years now.

What's the problem you're having with them?

Phillip Gawlowski · May 9, 2011

Regular Expressions are pretty much the standard way of parsing text files,
aren't they? Certainly they're what I've been using for years now.

PDFs aren't "just" text files.

A randomly-chosen excerpt from a random PDF I have lying about:

11 0 obj
<< /Title(1. The Quest for Quantum Gravity)
/Dest/section.1
/Parent 10 0 R
/Next 12 0 Rendobj

Source: <http://arxiv.org/abs/1010.3420v1>

I could have excerpted parts of the binary blob this PDF includes at
the start, but I rather not break anyone's email client without
intending to.

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.

Martin Boßlet · May 10, 2011

I recently spotted

https://github.com/yob/pdf-reader

but haven't had the time to play with it yet.

Regards,
Martin

Alex Young · May 10, 2011

Whenever I've done this in the past, I've used pdftohtml to produce an
HTML file which Nokogiri can then handle. Yes, it's an external tool,
but it's been reliable for me in the past.

Pdf Parsing Challenge	7	May 17, 2011
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
Parsing pdf files	7	Aug 22, 2009
Parsing some pdf files failed	1	Mar 21, 2011
1.9 CSV Parsing Issues	5	Nov 4, 2010
Advice for html project	1	Jul 14, 2010
Reading from a PDF	4	Dec 18, 2009
Text extraction from MS Office and PDF	1	Jul 24, 2010

Pdf Parsing Project Example

Felipe Espinoza

James

Phillip Gawlowski

Martin Boßlet

Alex Young

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads