Parse Word/HTML Docs for database inserts

M

Margaret Smith

I am new to Ruby and have perused the forum but I will ask this question
as I couldn't seem to answer my questions with other posts.

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don't want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.
 
D

Dylan

I'm not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files = []
Dir.chdir dir do
$all_files += Dir["*"]
end

where dir is the directory the files are in. That will get you an
array with all the filenames. Then you can just iterate through them:
 
J

James Britt

Dylan said:
I'm not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files = []
Dir.chdir dir do
$all_files += Dir["*"]
end

Might not Find be more useful overall?


http://www.ruby-doc.org/stdlib/libdoc/find/rdoc/classes/Find.html

--
James Britt

www.jamesbritt.com - Playing with Better Toys
www.ruby-doc.org - Ruby Help & Documentation
www.rubystuff.com - The Ruby Store for Ruby Stuff
www.neurogami.com - Smart application development
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top