Odd Regexp Issue

Discussion in 'Ruby' started by Kyle Heck, May 23, 2007.

  1. Kyle Heck

    Kyle Heck Guest

    I'm writing a web crawler, and in that crawler I want to remove all
    scripts in the pages I crawl.

    I should be able to do a simple gsub!(/<!--.*-->/,"") right? Well, I do
    that and unfortunately it doesn't remove some scripts. Take google for
    instance. It removes the first script, but not the second. I'm really
    confused. Since google has two scripts, <!-- happens twice, so do -->
    so it's not like the full regexp should ever fail to be triggered.

    Any insight on the issue would be GREAT?! :D

    Thanks,
    Kyle Heck
     
    Kyle Heck, May 23, 2007
    #1
    1. Advertisements

  2. gsub(/<!--.*?-->/m,"")

    If there are new lines inside the string you need to use the m
    modifier to make the dot (.) include new lines as well.
    And the ? is to make the match non-greedy. Without it it would match
    the start of the first script and the end of the last script.
     
    Luis Parravicini, May 23, 2007
    #2
    1. Advertisements

  3. Try multiline mode: gsub!(/<!--.*-->/m,"")
     
    Joel VanderWerf, May 23, 2007
    #3
  4. Joel VanderWerf wrote:
    ...
    Luis is right, it needs to be non-greedy, as well:

    gsub!(/<!--.*?-->/m,"")
     
    Joel VanderWerf, May 23, 2007
    #4
  5. Kyle Heck

    Kyle Heck Guest

    Well, that seemed to do the trick :D

    Thanks a lot, I didn't know that regexp only applied to one line by
    default, HRM!

    Thanks,
    Kyle Heck
     
    Kyle Heck, May 23, 2007
    #5
  6. Of course, you do realize that you're saying "scripts" but you're
    removing "comments" with this regexp.

    I can have:
    <script type="text/javascript">
    //<![CDATA[
    <%= yield :page_scripts %>
    //]]>
    </script>

    in a page with not a <!-- or --> in sight!

    -Rob

    Rob Biedenharn http://agileconsultingllc.com
     
    Rob Biedenharn, May 23, 2007
    #6
  7. Actually, it is the . expression that doesn't match a newline without
    the 'm' option. That option just changes '.' from matching "any
    character except a newline" to matching "any character". The Regular
    Expression section of chapter 22 in the pickaxe covers all this (p.
    324-328)

    -Rob

    Rob Biedenharn http://agileconsultingllc.com
     
    Rob Biedenharn, May 23, 2007
    #7
  8. I'm not sure what are you after actually, but apart from the <script>
    tags Rob mentioned, you might need to remove the onClick, onMouseOver
    and other handlers. And since the handlers can be within almost any tag
    it would be very hard to find and remove them correctly with just a few
    regexps. You should use a real HTML parser (the preffered Ruby one seems
    to be called hpricot ... I guess the author wanted to be funny). If this
    is meant to make the display of the pages secure you should also rather
    "keep only the tags and attributes that are safe" than "remove stuff
    that's not safe". You might easily overlook something.

    If you happened to use the-language-that-musn't-be-named, you'd just use
    HTML::TagFilter
    (http://search.cpan.org/~wross/HTML-TagFilter-1.03/TagFilter.pm). Good
    luck.

    Jenda
     
    Jenda Krynicky, May 25, 2007
    #8
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.