Parse Word/HTML Docs for database inserts

Discussion in 'Ruby' started by Margaret Smith, Jul 16, 2009.

  1. I am new to Ruby and have perused the forum but I will ask this question
    as I couldn't seem to answer my questions with other posts.

    The documents have no structure except for a unique number that appears
    first in the document and the rest of the data I am looking for is
    preceeded by key words that can help me identify a country code, the
    hour something was started or finished and maybe a subject here and
    there. The html docs are just snippets from the news pages of the
    Internet pictures and all that I need the title, and dates extracted.

    What I need to do is also extract the mimetype, file name and
    last_update_date of the document. Can I do this with Ruby? I know Ruby
    has several gems that can help but which one would be the best for
    something like this?

    Most of the postings I have read deal with semi-structured data. Data
    that is preceeded with a column name perhaps but these files are
    completely unstructured.

    Also I don't want to be entering filenames one by one. I have about 6000
    documents to parse. Is there a way to handle something like that with a
    script?

    Any direction would be greatly appreciated. Never have written Ruby code
    so I am looking for a good tutorial using parsing or an example app that
    may handle something like this.
    --
    Posted via http://www.ruby-forum.com/.
    Margaret Smith, Jul 16, 2009
    #1
    1. Advertising

  2. Margaret Smith

    Dylan Guest

    I'm not able to help with the parsing, but if you want to check all
    files in a folder you can use this:

    $all_files = []
    Dir.chdir dir do
    $all_files += Dir["*"]
    end

    where dir is the directory the files are in. That will get you an
    array with all the filenames. Then you can just iterate through them:

    On Jul 15, 4:23 pm, Margaret Smith <> wrote:
    > I am new to Ruby and have perused the forum but I will ask this question
    > as I couldn't seem to answer my questions with other posts.
    >
    > The documents have no structure except for a unique number that appears
    > first in the document and the rest of the data I am looking for is
    > preceeded by key words that can help me identify a country code, the
    > hour something was started or finished and maybe a subject here and
    > there. The html docs are just snippets from the news pages of the
    > Internet pictures and all that I need the title, and dates extracted.
    >
    > What I need to do is also extract the mimetype, file name and
    > last_update_date of the document. Can I do this with Ruby? I know Ruby
    > has several gems that can help but which one would be the best for
    > something like this?
    >
    > Most of the postings I have read deal with semi-structured data. Data
    > that is preceeded with a column name perhaps but these files are
    > completely unstructured.
    >
    > Also I don't want to be entering filenames one by one. I have about 6000
    > documents to parse. Is there a way to handle something like that with a
    > script?
    >
    > Any direction would be greatly appreciated. Never have written Ruby code
    > so I am looking for a good tutorial using parsing or an example app that
    > may handle something like this.
    > --
    > Posted viahttp://www.ruby-forum.com/.
    Dylan, Jul 16, 2009
    #2
    1. Advertising

  3. Margaret Smith

    James Britt Guest

    Dylan wrote:
    > I'm not able to help with the parsing, but if you want to check all
    > files in a folder you can use this:
    >
    > $all_files = []
    > Dir.chdir dir do
    > $all_files += Dir["*"]
    > end


    Might not Find be more useful overall?


    http://www.ruby-doc.org/stdlib/libdoc/find/rdoc/classes/Find.html

    --
    James Britt

    www.jamesbritt.com - Playing with Better Toys
    www.ruby-doc.org - Ruby Help & Documentation
    www.rubystuff.com - The Ruby Store for Ruby Stuff
    www.neurogami.com - Smart application development
    James Britt, Jul 16, 2009
    #3
  4. Margaret Smith wrote:
    > I am new to Ruby and have perused the forum but I will ask this question
    > as I couldn't seem to answer my questions with other posts.
    >



    Hi Smith,

    Its very tough to answer your question. Because I like HPRICOT gem very
    much. But I didn't said That is best. It depends upon your satisfaction.
    And also please try with ,

    http://rfeedparser.rubyforge.org/


    Thanks,
    P.Raveendran
    http://raveendran.wordpress.com


    > The documents have no structure except for a unique number that appears
    > first in the document and the rest of the data I am looking for is
    > preceeded by key words that can help me identify a country code, the
    > hour something was started or finished and maybe a subject here and
    > there. The html docs are just snippets from the news pages of the
    > Internet pictures and all that I need the title, and dates extracted.
    >
    > What I need to do is also extract the mimetype, file name and
    > last_update_date of the document. Can I do this with Ruby? I know Ruby
    > has several gems that can help but which one would be the best for
    > something like this?
    >
    > Most of the postings I have read deal with semi-structured data. Data
    > that is preceeded with a column name perhaps but these files are
    > completely unstructured.
    >
    > Also I don't want to be entering filenames one by one. I have about 6000
    > documents to parse. Is there a way to handle something like that with a
    > script?
    >
    > Any direction would be greatly appreciated. Never have written Ruby code
    > so I am looking for a good tutorial using parsing or an example app that
    > may handle something like this.


    --
    Posted via http://www.ruby-forum.com/.
    Raveendran Perumalsamy, Jul 16, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Marcus Leon
    Replies:
    2
    Views:
    322
    shakah
    Jun 7, 2005
  2. Replies:
    2
    Views:
    445
  3. KYG
    Replies:
    2
    Views:
    883
    Ian Collins
    Aug 18, 2008
  4. Stéphane Wirtel
    Replies:
    0
    Views:
    177
    Stéphane Wirtel
    Apr 19, 2007
  5. Al
    Replies:
    1
    Views:
    149
    Henry Law
    Oct 16, 2005
Loading...

Share This Page