Pdf Parsing Project Example

  • Thread starter Felipe Espinoza
  • Start date
F

Felipe Espinoza

Hi,

I'm looking for an example of parsing pdf. I tried to implement this
with ruby
and docsplit gem, but it uses an external tool to extract the text, and
there are problems with number references, and you have to parse the
text file according to the regular expressions

I want to parse some papers in pdf format, to extract it's title,
keywords, authors, authors's mails, institutions, etc.

I'm looking for some experience ruby developer with a better way to do
this without parsing a textfile through regular expressions

Greetings
 
J

James

[Note: parts of this message were removed to make it a legal post.]

Regular Expressions are pretty much the standard way of parsing text files,
aren't they? Certainly they're what I've been using for years now.

What's the problem you're having with them?
 
P

Phillip Gawlowski

Regular Expressions are pretty much the standard way of parsing text files,
aren't they? Certainly they're what I've been using for years now.

PDFs aren't "just" text files.

A randomly-chosen excerpt from a random PDF I have lying about:

11 0 obj
<< /Title(1. The Quest for Quantum Gravity)
/Dest/section.1
/Parent 10 0 R
/Next 12 0 Rendobj

Source: <http://arxiv.org/abs/1010.3420v1>

I could have excerpted parts of the binary blob this PDF includes at
the start, but I rather not break anyone's email client without
intending to. ;)

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.
 
A

Alex Young

Whenever I've done this in the past, I've used pdftohtml to produce an
HTML file which Nokogiri can then handle. Yes, it's an external tool,
but it's been reliable for me in the past.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top