re pattern for matching JS/CSS

Discussion in 'Python' started by i80and, Dec 15, 2006.

  1. i80and

    i80and Guest

    I'm working on a program to remove tags from a HTML document, leaving
    just the content, but I want to do it simply. I've finished a system
    to remove simple tags, but I want all CSS and JS to be removed. What
    re pattern could I use to do that?

    I've tried
    '<script[\S\s]*/script>'
    but that didn't work properly. I'm fairly basic in my knowledge of
    Python, so I'm still trying to learn re.
    What pattern would work?
     
    i80and, Dec 15, 2006
    #1
    1. Advertising

  2. i80and

    ina Guest

    i80and wrote:
    > I'm working on a program to remove tags from a HTML document, leaving
    > just the content, but I want to do it simply. I've finished a system
    > to remove simple tags, but I want all CSS and JS to be removed. What
    > re pattern could I use to do that?
    >
    > I've tried
    > '<script[\S\s]*/script>'
    > but that didn't work properly. I'm fairly basic in my knowledge of
    > Python, so I'm still trying to learn re.
    > What pattern would work?


    I use re.compile("<script.*?</script>",re.DOTALL)
    for scripts. I strip this out first since my tag stripping re will
    strip out script tags as well hope this was of help.
     
    ina, Dec 15, 2006
    #2
    1. Advertising

  3. i80and

    Tim Chase Guest

    >> I've tried
    >> '<script[\S\s]*/script>'
    >> but that didn't work properly. I'm fairly basic in my knowledge of
    >> Python, so I'm still trying to learn re.
    >> What pattern would work?

    >
    > I use re.compile("<script.*?</script>",re.DOTALL)
    > for scripts. I strip this out first since my tag stripping re will
    > strip out script tags as well hope this was of help.


    This won't catch various alterations of

    <
    script
    >
    doEvil()
    <
    /
    script
    >

    which is valid html/xhtml.

    For less valid html, but still attemptable, one might find
    something like

    <scrip<script>hah</script>t>doEvil()</script>

    which, if you nuke your pattern, leaves the valid but unwanted

    <script>doEvil()</script>

    I'd propose that it's better to use something such as
    BeautifulSoup that actually parses the HTML, and then skim
    through it whitelisting the tags you plan to allow, and skipping
    the emission of any tags that don't make the whitelist.

    -tkc
     
    Tim Chase, Dec 15, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DelphiDude
    Replies:
    3
    Views:
    1,188
  2. danpres2k
    Replies:
    3
    Views:
    7,519
    danpres2k
    Aug 25, 2003
  3. i80and

    re pattern for matching JS/CSS

    i80and, Dec 15, 2006, in forum: Python
    Replies:
    0
    Views:
    281
    i80and
    Dec 15, 2006
  4. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    260
    Marc Bissonnette
    Jan 13, 2004
  5. Bobby Chamness
    Replies:
    2
    Views:
    261
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page