regexp and stack overflow

Discussion in 'Ruby' started by Une bévue, Mar 27, 2006.

  1. Une bévue

    Une bévue Guest

    i've a regexp :
    SCRIPT_RE=Regexp.new('<script[^>]*>((.|\n)(?!/script))*</script>',
    Regexp::EXTENDED, 'N')

    which is supposed to strip out everything being inside :
    <script ...>(part suppressed...)</script>

    it works well for some html file but crash over other with the following
    error message :
    RegexpError: Stack overflow in regexp matcher:
    /<script[^>]*>((.|\n)(?!\/script))*<\/script>/xn
    method gsub
    in check_files.rb at line 38
    method stripHTML
    in check_files.rb at line 38
    [...]

    ligne 38 being :

    self.gsub(SCRIPT_RE, '').gsub(TAGS_RE, '').gsub(/\s+/, '
    ').gsub(NBSP_RE, '')

    with :
    SCRIPT_RE=Regexp.new('<script[^>]*>((.|\n)(?!/script))*</script>',
    Regexp::EXTENDED, 'N')


    what i want to do :

    strip out all the contents of scripts, all the html tags with their
    attributes, and also i have to add striping out any css declaration (not
    done yet).

    the prog failes for a file having the following parts for script :
    <script type="text/javascript"
    src="Mac-roman-utf-8_fichiers/wikibits.js"><!-- wikibits js --></script>
    <script type="text/javascript"
    src="Mac-roman-utf-8_fichiers/index.php"><!-- site js --></script>
    <style type="text/css">/*<![CDATA[*/
    @import
    "/w/index.php?title=MediaWiki:Common.css&action=raw&ctype=text/css&smaxa
    ge=2678400";
    @import
    "/w/index.php?title=MediaWiki:Monobook.css&action=raw&ctype=text/css&sma
    xage=2678400";
    @import "/w/index.php?title=-&action=raw&gen=css&maxage=2678400";
    /*]]>*/</style></head><body class="ns-0 ltr">


    and also having some script inside divs of body :
    <script type="text/javascript"> if (window.isMSIE55) fixalpha();
    </script>
    [...]
    <script type="text/javascript"> if (window.runOnloadHook)
    runOnloadHook();</script>


    i can't make use of tidy for that purpose, because the reason to strip
    out any kind of html, to keep the text only, is to help some prog
    finding out the encoding of the file.
    --
    une bévue
     
    Une bévue, Mar 27, 2006
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Yvad

    Stack overflow and memory problem?

    Yvad, Nov 4, 2005, in forum: C Programming
    Replies:
    11
    Views:
    846
    ishekara
    Nov 9, 2005
  2. David Heinemeier Hansson

    [Bug?] Stack overflow in regexp matcher

    David Heinemeier Hansson, Feb 12, 2004, in forum: Ruby
    Replies:
    4
    Views:
    205
  3. Bil Kleb
    Replies:
    4
    Views:
    109
    Bil Kleb
    Dec 27, 2004
  4. Kenneth McDonald

    Why stack overflow with such a small stack?

    Kenneth McDonald, Aug 30, 2007, in forum: Ruby
    Replies:
    7
    Views:
    265
    Kenneth McDonald
    Sep 1, 2007
  5. Joao Silva
    Replies:
    16
    Views:
    368
    7stud --
    Aug 21, 2009
Loading...

Share This Page