Regex, replacing THIS|THAT

Discussion in 'Perl Misc' started by Jason C, Dec 17, 2011.

  1. Jason C

    Jason C Guest

    Before putting this into production, can you guys confirm if the logic hereis correct?

    my $lt = chr(1);
    my $gt = chr(2);

    $text =~ s/<(\/{0,1})(div|span|table|tr|td|font|img)(.*?)>/$lt$1$2$3$gt/gsi;

    What I'm not sure about is if (div|span...) will work correctly, or if it'sgoing to read "di, followed by either v or s, followed by pa", and so on.

    (FWIW, the next step in the process is to remove all other HTML code, so that only these tags are allowed. Then, I go back and change $lt and $gt backto < and >. This concept works well, so my only real question is whether the regex will work as expected.)
    Jason C, Dec 17, 2011
    #1
    1. Advertising

  2. Jason C <> wrote:
    [misguided attempt at using REs to manage HTML snipped]
    >(FWIW, the next step in the process is to remove all other HTML code, so that only these tags are allowed. Then, I go back and change $lt and $gt back to < and >. This concept works well, so my only real question is whether the regex will work as expected.)


    No, it doesn't work at all. You are aware of 'perldoc -q "remove HTML"'?
    The examples given there for why REs are not suitable to parse HTML
    apply just as well for your limited scope of only 7 tags.

    If you want to parse HTML then use a parser for HTM but don't dwadle
    with home-brewn RE approaches. Those can't work as has been discussed ad
    nauseam before.

    jue
    Jürgen Exner, Dec 17, 2011
    #2
    1. Advertising

  3. Jason C wrote:
    > Before putting this into production, can you guys confirm if the logic here is correct?
    >
    > my $lt = chr(1);
    > my $gt = chr(2);
    >
    > $text =~ s/<(\/{0,1})(div|span|table|tr|td|font|img)(.*?)>/$lt$1$2$3$gt/gsi;


    You have nothing between $1 and $2 or between $2 and $3 so why not just
    use one pair of capturing parentheses:

    $text =~ s/<(\/?(?:div|span|table|tr|td|font|img).*?)>/$lt$1$gt/gsi;


    > What I'm not sure about is if (div|span...) will work correctly,


    Yes, that is how alternation works. Each alternative can be any valid
    pattern, including strings.


    > or if it's going to read "di, followed by either v or s,
    > followed by pa", and so on.


    No, that would not make sense.



    John
    --
    Any intelligent fool can make things bigger and
    more complex... It takes a touch of genius -
    and a lot of courage to move in the opposite
    direction. -- Albert Einstein
    John W. Krahn, Dec 17, 2011
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. darrel
    Replies:
    0
    Views:
    300
    darrel
    Jul 8, 2004
  2. Tim_Mac
    Replies:
    2
    Views:
    598
    Tim_Mac
    Jan 21, 2006
  3. abcd

    regex for replacing \r\n

    abcd, Aug 7, 2006, in forum: Python
    Replies:
    1
    Views:
    421
    Patrick Bothe
    Aug 7, 2006
  4. Replies:
    3
    Views:
    746
    Reedick, Andrew
    Jul 1, 2008
  5. Rob Meade

    Replacing - and not Replacing...

    Rob Meade, Apr 5, 2005, in forum: ASP General
    Replies:
    5
    Views:
    271
    Chris Hohmann
    Apr 11, 2005
Loading...

Share This Page