Find/Replace In Files Using Lookup Table

Discussion in 'Ruby' started by Andrew Porter, May 28, 2008.

  1. I have a directory full of HTML files. Some have anchor tags (<a =20
    href=3D"directory/filename.html">), some do not. I also have a tab-=20
    delimited text file with=97among other things=97an ID, title, and =
    filename.

    What I need to do is create a script that will:

    1. Search all of the HTML files in a directory for anchor tags
    2. Strip out the file name from the href attribute
    3. Use the file name to look up the correlating ID in the lookup file
    4. Replace the contents of the href attribute with the ID

    Being new to Ruby and command-line scripting, I'm not sure where to =20
    begin looking for examples of how to do this. Any help is appreciated.
     
    Andrew Porter, May 28, 2008
    #1
    1. Advertising

  2. Andrew Porter

    Eric I. Guest

    On May 28, 6:18 pm, Andrew Porter <> wrote:
    > I have a directory full of HTML files. Some have anchor tags (<a  
    > href="directory/filename.html">), some do not. I also have a tab-
    > delimited text file with—among other things—an ID, title, and filename..
    >
    > What I need to do is create a script that will:
    >
    > 1. Search all of the HTML files in a directory for anchor tags
    > 2. Strip out the file name from the href attribute
    > 3. Use the file name to look up the correlating ID in the lookup file
    > 4. Replace the contents of the href attribute with the ID
    >
    > Being new to Ruby and command-line scripting, I'm not sure where to  
    > begin looking for examples of how to do this. Any help is appreciated.


    Obviously your goal is to this processing. But are you hoping to use
    this to learn Ruby? If so, this is a nice-sized project that will
    help you to learn the language. Here are some pointers to help you
    figure out where to look or start with certain aspects of the project
    (the numbers match up with your numbers above):

    1. To get a list of all of the HTML files in a given directory, you
    can use Dir.glob.

    2. To parse an HTML file you can use the hpricot gem. Alternatively,
    you could open the file and use regular expressions.

    3. To have read your tab-delimited file at the start of the program,
    you can use the CSV class in the standard library or the fastercsv
    gem. You can put the data into a hash where the file name is the key
    and the ID is the value. Lookup becomes trivial then.

    4. Depending on whether you're using hpricot or regular expressions
    will determine how you do this. If you're using regular expressions,
    you might want to do a gsub! call with a block that would allow you to
    do your lookup and replacement.

    Some relevant information sources:

    You should have one of the Ruby books to help you with basic syntax
    and all that. They will also help you with regular expressions,
    hashes, and file I/O.

    Documentation on File (and IO), Dir, CSV, Regexp, and Hash, you can
    use:

    http://ruby-doc.org/core/

    For hpricot:

    http://code.whytheluckystiff.net/hpricot/

    For fastercsv:

    http://fastercsv.rubyforge.org/

    I hope that's helpful,

    Eric

    ====

    LearnRuby.com offers Rails & Ruby HANDS-ON public & ON-SITE
    workshops.
    Ruby Fundamentals Wkshp June 16-18 Ann Arbor, Mich.
    Ready for Rails Ruby Wkshp June 23-24 Ann Arbor, Mich.
    Ruby on Rails Wkshp June 25-27 Ann Arbor, Mich.
    Ruby Plus Rails Combo Wkshp June 23-27 Ann Arbor, Mich
    Please visit http://LearnRuby.com for all the details.
     
    Eric I., May 29, 2008
    #2
    1. Advertising

  3. On Wednesday 28 May 2008 18:05:15 Eric I. wrote:
    > On May 28, 6:18=A0pm, Andrew Porter <> wrote:


    > 2. To parse an HTML file you can use the hpricot gem. Alternatively,
    > you could open the file and use regular expressions.


    I'd suggest hpricot or REXML if the files are reasonably well-formed and/or=
    =20
    XML-ish, and regex if they're not.
     
    David Masover, May 29, 2008
    #3
  4. Thanks, Eric. These are excellent tips.


    On May 28, 2008, at 5:05 PM, Eric I. wrote:

    > On May 28, 6:18 pm, Andrew Porter <> wrote:
    >> I have a directory full of HTML files. Some have anchor tags (<a
    >> href=3D"directory/filename.html">), some do not. I also have a tab-
    >> delimited text file with=97among other things=97an ID, title, and =20
    >> filename.
    >>
    >> What I need to do is create a script that will:
    >>
    >> 1. Search all of the HTML files in a directory for anchor tags
    >> 2. Strip out the file name from the href attribute
    >> 3. Use the file name to look up the correlating ID in the lookup file
    >> 4. Replace the contents of the href attribute with the ID
    >>
    >> Being new to Ruby and command-line scripting, I'm not sure where to
    >> begin looking for examples of how to do this. Any help is =20
    >> appreciated.

    >
    > Obviously your goal is to this processing. But are you hoping to use
    > this to learn Ruby? If so, this is a nice-sized project that will
    > help you to learn the language. Here are some pointers to help you
    > figure out where to look or start with certain aspects of the project
    > (the numbers match up with your numbers above):
    >
    > 1. To get a list of all of the HTML files in a given directory, you
    > can use Dir.glob.
    >
    > 2. To parse an HTML file you can use the hpricot gem. Alternatively,
    > you could open the file and use regular expressions.
    >
    > 3. To have read your tab-delimited file at the start of the program,
    > you can use the CSV class in the standard library or the fastercsv
    > gem. You can put the data into a hash where the file name is the key
    > and the ID is the value. Lookup becomes trivial then.
    >
    > 4. Depending on whether you're using hpricot or regular expressions
    > will determine how you do this. If you're using regular expressions,
    > you might want to do a gsub! call with a block that would allow you to
    > do your lookup and replacement.
    >
    > Some relevant information sources:
    >
    > You should have one of the Ruby books to help you with basic syntax
    > and all that. They will also help you with regular expressions,
    > hashes, and file I/O.
    >
    > Documentation on File (and IO), Dir, CSV, Regexp, and Hash, you can
    > use:
    >
    > http://ruby-doc.org/core/
    >
    > For hpricot:
    >
    > http://code.whytheluckystiff.net/hpricot/
    >
    > For fastercsv:
    >
    > http://fastercsv.rubyforge.org/
    >
    > I hope that's helpful,
    >
    > Eric
    >
    > =3D=3D=3D=3D
    >
    > LearnRuby.com offers Rails & Ruby HANDS-ON public & ON-SITE
    > workshops.
    > Ruby Fundamentals Wkshp June 16-18 Ann Arbor, Mich.
    > Ready for Rails Ruby Wkshp June 23-24 Ann Arbor, Mich.
    > Ruby on Rails Wkshp June 25-27 Ann Arbor, Mich.
    > Ruby Plus Rails Combo Wkshp June 23-27 Ann Arbor, Mich
    > Please visit http://LearnRuby.com for all the details.
    >
     
    Andrew Porter, May 29, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee
    Replies:
    0
    Views:
    651
    Xah Lee
    Jun 14, 2006
  2. Xah Lee
    Replies:
    1
    Views:
    1,425
    YYusenet
    Jan 31, 2005
  3. Xah Lee
    Replies:
    0
    Views:
    358
    Xah Lee
    Jun 14, 2006
  4. Alun
    Replies:
    3
    Views:
    4,651
    Masudur
    Feb 18, 2008
  5. mscir
    Replies:
    0
    Views:
    332
    mscir
    Oct 12, 2005
Loading...

Share This Page