Parse Word/HTML Docs for database inserts

Margaret Smith · Jul 16, 2009

I am new to Ruby and have perused the forum but I will ask this question
as I couldn't seem to answer my questions with other posts.

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don't want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.

Dylan · Jul 16, 2009

I'm not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files = []
Dir.chdir dir do
$all_files += Dir["*"]
end

where dir is the directory the files are in. That will get you an
array with all the filenames. Then you can just iterate through them:

James Britt · Jul 16, 2009

Dylan said:
I'm not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files = []
Dir.chdir dir do
$all_files += Dir["*"]
end

Might not Find be more useful overall?

http://www.ruby-doc.org/stdlib/libdoc/find/rdoc/classes/Find.html

--
James Britt

www.jamesbritt.com - Playing with Better Toys
www.ruby-doc.org - Ruby Help & Documentation
www.rubystuff.com - The Ruby Store for Ruby Stuff
www.neurogami.com - Smart application development

Raveendran Perumalsamy · Jul 16, 2009

Margaret said:
I am new to Ruby and have perused the forum but I will ask this question
as I couldn't seem to answer my questions with other posts.

Hi Smith,

Its very tough to answer your question. Because I like HPRICOT gem very
much. But I didn't said That is best. It depends upon your satisfaction.
And also please try with ,

http://rfeedparser.rubyforge.org/

Thanks,
P.Raveendran
http://raveendran.wordpress.com

Parse plus/minus 0	0	Jan 21, 2022
Php interface for Mariadb/any database	0	Aug 19, 2022
Is it possible to get some informations from a document in Google Docs and show it on my website ?	0	Nov 19, 2022
Database schema for file organizer.	1	May 17, 2022
Website with Database. use C#	1	Mar 25, 2023
Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
Write a JAVASCRIPT program that will parse the JSON structure once the PHP program is called using AJAX. You may show output using html	0	Jul 21, 2022
Pokemon card database	5	May 14, 2023

Parse Word/HTML Docs for database inserts

Margaret Smith

Dylan

James Britt

Raveendran Perumalsamy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads