Find/Replace In Files Using Lookup Table

A

Andrew Porter

I have a directory full of HTML files. Some have anchor tags (<a =20
href=3D"directory/filename.html">), some do not. I also have a tab-=20
delimited text file with=97among other things=97an ID, title, and =
filename.

What I need to do is create a script that will:

1. Search all of the HTML files in a directory for anchor tags
2. Strip out the file name from the href attribute
3. Use the file name to look up the correlating ID in the lookup file
4. Replace the contents of the href attribute with the ID

Being new to Ruby and command-line scripting, I'm not sure where to =20
begin looking for examples of how to do this. Any help is appreciated.
 
E

Eric I.

I have a directory full of HTML files. Some have anchor tags (<a  
href="directory/filename.html">), some do not. I also have a tab-
delimited text file with—among other things—an ID, title, and filename..

What I need to do is create a script that will:

1. Search all of the HTML files in a directory for anchor tags
2. Strip out the file name from the href attribute
3. Use the file name to look up the correlating ID in the lookup file
4. Replace the contents of the href attribute with the ID

Being new to Ruby and command-line scripting, I'm not sure where to  
begin looking for examples of how to do this. Any help is appreciated.

Obviously your goal is to this processing. But are you hoping to use
this to learn Ruby? If so, this is a nice-sized project that will
help you to learn the language. Here are some pointers to help you
figure out where to look or start with certain aspects of the project
(the numbers match up with your numbers above):

1. To get a list of all of the HTML files in a given directory, you
can use Dir.glob.

2. To parse an HTML file you can use the hpricot gem. Alternatively,
you could open the file and use regular expressions.

3. To have read your tab-delimited file at the start of the program,
you can use the CSV class in the standard library or the fastercsv
gem. You can put the data into a hash where the file name is the key
and the ID is the value. Lookup becomes trivial then.

4. Depending on whether you're using hpricot or regular expressions
will determine how you do this. If you're using regular expressions,
you might want to do a gsub! call with a block that would allow you to
do your lookup and replacement.

Some relevant information sources:

You should have one of the Ruby books to help you with basic syntax
and all that. They will also help you with regular expressions,
hashes, and file I/O.

Documentation on File (and IO), Dir, CSV, Regexp, and Hash, you can
use:

http://ruby-doc.org/core/

For hpricot:

http://code.whytheluckystiff.net/hpricot/

For fastercsv:

http://fastercsv.rubyforge.org/

I hope that's helpful,

Eric

====

LearnRuby.com offers Rails & Ruby HANDS-ON public & ON-SITE
workshops.
Ruby Fundamentals Wkshp June 16-18 Ann Arbor, Mich.
Ready for Rails Ruby Wkshp June 23-24 Ann Arbor, Mich.
Ruby on Rails Wkshp June 25-27 Ann Arbor, Mich.
Ruby Plus Rails Combo Wkshp June 23-27 Ann Arbor, Mich
Please visit http://LearnRuby.com for all the details.
 
A

Andrew Porter

Thanks, Eric. These are excellent tips.


Obviously your goal is to this processing. But are you hoping to use
this to learn Ruby? If so, this is a nice-sized project that will
help you to learn the language. Here are some pointers to help you
figure out where to look or start with certain aspects of the project
(the numbers match up with your numbers above):

1. To get a list of all of the HTML files in a given directory, you
can use Dir.glob.

2. To parse an HTML file you can use the hpricot gem. Alternatively,
you could open the file and use regular expressions.

3. To have read your tab-delimited file at the start of the program,
you can use the CSV class in the standard library or the fastercsv
gem. You can put the data into a hash where the file name is the key
and the ID is the value. Lookup becomes trivial then.

4. Depending on whether you're using hpricot or regular expressions
will determine how you do this. If you're using regular expressions,
you might want to do a gsub! call with a block that would allow you to
do your lookup and replacement.

Some relevant information sources:

You should have one of the Ruby books to help you with basic syntax
and all that. They will also help you with regular expressions,
hashes, and file I/O.

Documentation on File (and IO), Dir, CSV, Regexp, and Hash, you can
use:

http://ruby-doc.org/core/

For hpricot:

http://code.whytheluckystiff.net/hpricot/

For fastercsv:

http://fastercsv.rubyforge.org/

I hope that's helpful,

Eric

=3D=3D=3D=3D

LearnRuby.com offers Rails & Ruby HANDS-ON public & ON-SITE
workshops.
Ruby Fundamentals Wkshp June 16-18 Ann Arbor, Mich.
Ready for Rails Ruby Wkshp June 23-24 Ann Arbor, Mich.
Ruby on Rails Wkshp June 25-27 Ann Arbor, Mich.
Ruby Plus Rails Combo Wkshp June 23-27 Ann Arbor, Mich
Please visit http://LearnRuby.com for all the details.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top