Pdf Parsing Challenge

Discussion in 'Ruby' started by Felipe Espinoza, May 17, 2011.

  1. Hi Everyone,

    I'm just trying to use the pdf-reader gem, but I have some trouble
    understading how the gem wokds

    If someone can help me with this, i'll be really grateful

    The Problem:

    I have to extract some data from a paper in a pdf format. I just need
    some data from the page 1, like the title of the paper, the authors
    list, the universities of these autors, their mails, the abstract and
    keywords

    how I can extract this data from this paper?
    http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf

    with a simple string that contains the information of a complete field
    (keywords, abstract, etc) would help me

    It's not necessary to use this gem, but I need a string for each field
    with this info, how can I do that?

    --
    Posted via http://www.ruby-forum.com/.
     
    Felipe Espinoza, May 17, 2011
    #1
    1. Advertising

  2. On Tue, May 17, 2011 at 11:04 PM, Felipe Espinoza
    <> wrote:
    >
    > I have to extract some data from a paper in a pdf format. I just need
    > some data from the page 1, like the title of the paper, the authors
    > list, the universities of these autors, their mails, the abstract and
    > keywords
    >
    > how I can extract this data from this paper?
    > http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf


    Mark the text, copy it.

    > It's not necessary to use this gem, but I need a string for each field
    > with this info, how can I do that?


    Open a text editor, paste it, and construct the data you need.

    Doing the research for how to do what you want, and then writing and
    debugging a script that does it, takes longer than just doing it by
    hand. ;)

    --
    Phillip Gawlowski

    Though the folk I have met,
    (Ah, how soon!) they forget
    When I've moved on to some other place,
    There may be one or two,
    When I've played and passed through,
    Who'll remember my song or my face.
     
    Phillip Gawlowski, May 17, 2011
    #2
    1. Advertising

  3. I need to do this automatically, I'll be doing it for a lot of papers
    and then take that data to a database

    --
    Posted via http://www.ruby-forum.com/.
     
    Felipe Espinoza, May 17, 2011
    #3
  4. On Tue, May 17, 2011 at 11:38 PM, Felipe Espinoza
    <> wrote:
    > I need to do this automatically, I'll be doing it for a lot of papers
    > and then take that data to a database


    Unless the papers are all (near) identical in layout, this will be
    difficult, since PDFs lack semantic information.

    Can you instead query a DB for the DOI of the paper (getting the DOI
    via the filename, or via the title of the paper, assuming the title is
    easy to grab), and use said DOI DB to get the information in a way
    that's much easier to process?

    --
    Phillip Gawlowski

    Though the folk I have met,
    (Ah, how soon!) they forget
    When I've moved on to some other place,
    There may be one or two,
    When I've played and passed through,
    Who'll remember my song or my face.
     
    Phillip Gawlowski, May 17, 2011
    #4
  5. Felipe Espinoza

    Mark T Guest

    Inkscape has a command line conversion option.
    I've only used it with a Linux instance.
    It converts one page at a time though.
    More than thee output format options from memory.
    Not exactly pure Ruby approach, though scripting such a task is
    certainly a Ruby domain.
    Your example is still loading here.
    So this reply may be completely out of context.

    MarkT

    > I have to extract some data from a paper in a pdf format. I just need
    > some data from the page 1, like the title of the paper, the authors
    > list, the universities of these autors, their mails, the abstract and
    > keywords


    I _top_ _post_ _so_ _there_
     
    Mark T, May 18, 2011
    #5
  6. Felipe Espinoza

    Mark T Guest

    Inkscape has a command line conversion option.
    I've only used it with a Linux instance.
    It converts one page at a time though.
    More than thee output format options from memory.
    Not exactly pure Ruby approach, though scripting such a task is
    certainly a Ruby domain.
    Your example is still loading here.
    So this reply may be completely out of context.

    MarkT

    > I have to extract some data from a paper in a pdf format. I just need
    > some data from the page 1, like the title of the paper, the authors
    > list, the universities of these autors, their mails, the abstract and
    > keywords


    I _top_ _post_ _so_ _there_
     
    Mark T, May 18, 2011
    #6
  7. Felipe Espinoza

    Kouhei Sutou Guest

    Hi,

    In <>
    "Pdf Parsing Challenge" on Wed, 18 May 2011 06:04:19 +0900,
    Felipe Espinoza <> wrote:

    > The Problem:
    > =


    > I have to extract some data from a paper in a pdf format. I just need=


    > some data from the page 1, like the title of the paper, the authors
    > list, the universities of these autors, their mails, the abstract and=


    > keywords
    > =


    > how I can extract this data from this paper?
    > http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf
    > =


    > with a simple string that contains the information of a complete fiel=

    d
    > (keywords, abstract, etc) would help me


    % gem install poppler
    % cat extract-data-from-paper.rb
    require 'tempfile'
    require 'open-uri'
    require 'poppler'

    ARGV.each do |url|
    pdf =3D Tempfile.new(["extract-data-from-paper", ".pdf"])
    pdf.binmode
    open(url) do |input|
    pdf.write(input.read)
    end
    pdf.close

    document =3D Poppler::Document.new(pdf.path)
    title_page =3D document.pages.first
    text =3D title_page.get_text
    lines =3D text.lines.to_a
    title =3D lines[0, 2].collect(&:strip).join(" ")
    puts title
    authors =3D lines[2, 2].collect(&:strip).join(" ")
    puts authors
    # ...
    end
    % ruby1.9 extract-data-from-paper.rb http://dl.dropbox.com/u/6928078/CL=
    EI_2008_002.pdf
    Query Routing Process for Adapted Information Retrieval using Agents
    Angela Carrillo-Ramos2, J=E9r=F4me Gensel1, Marl=E8ne Villanova-Oliver1=
    , Herv=E9 Martin1, and Miguel Torres-Moreno2


    Thanks,
    --
    kou
     
    Kouhei Sutou, May 18, 2011
    #7
  8. Do you need that for an own application or do you want to build up a
    literature database on your own?
    For the latter, you could try [Mendeley][1]. That's a tool (web-based &
    desktop-based) to manage your research literature. It can parse PDF, and
    much more.
    Once parsed, you can parse the generated bibtex-file …

    [1]: http://www.mendeley.com

    --
    Gruß, Johannes
     
    Johannes Held, May 19, 2011
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Artco News

    Parsing challenge...

    Artco News, Oct 7, 2003, in forum: Perl
    Replies:
    6
    Views:
    483
    Ara.T.Howard
    Oct 8, 2003
  2. Artco News

    Parsing challenge...

    Artco News, Oct 7, 2003, in forum: Perl
    Replies:
    2
    Views:
    429
    Ed Morton
    Oct 7, 2003
  3. Gary Brower

    Link To Print PDF Challenge

    Gary Brower, Nov 4, 2003, in forum: HTML
    Replies:
    2
    Views:
    640
    Hywel Jenkins
    Nov 4, 2003
  4. Ricardo Pog
    Replies:
    1
    Views:
    455
    Austin Ziegler
    Mar 26, 2008
  5. Sean Nakasone
    Replies:
    1
    Views:
    400
    Farrel Lifson
    Apr 14, 2008
Loading...

Share This Page