Regular Expression

Discussion in 'Perl Misc' started by fritz-bayer@web.de, Sep 7, 2007.

  1. Guest

    Hi,

    I 'm looking for a regular expression, which will find a certain word
    in a text and replace it, if and only if it does not appear inside an
    a html link or inside a tag, for example as an attribute or tag name.

    So, for example the following text should not match and be replaced:

    <a href='/index.html'>WORD TO MATCH</a> ....
    <image alt='WORD TO MATCH' src='../image.gif'> ..

    but the following should be replaced

    <body><h1>WORD TO MATCH</h1>...

    I guess I would have to use a positive lookahead or lookaround
    construct to achieve this. I have tried, but could not come up with
    anything that will do the job.

    Can some pro help me out?

    Fritz
    , Sep 7, 2007
    #1
    1. Advertising

  2. Klaus Guest

    On Sep 7, 2:28 pm, "" <> wrote:
    > I 'm looking for a regular expression, which will find a certain word
    > in a text and replace it, if and only if it does not appear inside an
    > a html link or inside a tag


    see Perlfaq 4 - How do I find matching/nesting anything?

    ==================================
    This isn't something that can be done in one regular expression, no
    matter how complicated. To find something between two single
    characters, a pattern like /x([^x]*)x/ will get the intervening bits
    in $1. For multiple ones, then something more like /alpha(.*?)omega/
    would be needed. But none of these deals with nested patterns. For
    balanced expressions using (, {, [ or < as delimiters, use the CPAN
    module Regexp::Common, or see (??{ code }) in the perlre manpage. For
    other cases, you'll have to write a parser.

    If you are serious about writing a parser, there are a number of
    modules or oddities that will make your life a lot easier. There are
    the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
    and the byacc program. Starting from perl 5.8 the Text::Balanced is
    part of the standard distribution.

    One simple destructive, inside-out approach that you might try is to
    pull out the smallest nesting parts one at a time:

    while (s/BEGIN((?:(?!BEGIN)(?!END).)*)END//gs) {
    # do something with $1
    }

    A more complicated and sneaky approach is to make Perl's regular
    expression engine do it for you. This is courtesy Dean Inada, and
    rather has the nature of an Obfuscated Perl Contest entry, but it
    really does work:

    # $_ contains the string to parse
    # BEGIN and END are the opening and closing markers for the
    # nested text.

    @( = ('(','');
    @) = (')','');
    ($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
    @$ = (eval{/$re/},$@!~/unmatched/i);
    print join("\n",@$[0..$#$]) if( $$[-1] );
    ==================================

    --
    Klaus
    Klaus, Sep 7, 2007
    #2
    1. Advertising

  3. On Sep 7, 8:28 am, "" <> wrote:
    > Hi,
    >
    > I 'm looking for a regular expression, which will find a certain word
    > in a text and replace it, if and only if it does not appear inside an
    > a html link or inside a tag, for example as an attribute or tag name.
    >
    > So, for example the following text should not match and be replaced:
    >
    > <a href='/index.html'>WORD TO MATCH</a> ....
    > <image alt='WORD TO MATCH' src='../image.gif'> ..
    >
    > but the following should be replaced
    >
    > <body><h1>WORD TO MATCH</h1>...
    >
    > I guess I would have to use a positive lookahead or lookaround
    > construct to achieve this. I have tried, but could not come up with
    > anything that will do the job.
    >
    > Can some pro help me out?
    >
    > Fritz


    I'm sure there is some WAY BETTER WAY to do this..

    But here is a solutions that seems to work.

    ----------------8<--------------------------------------
    #!/usr/bin/perl -w

    use strict;

    my $to_replace = "WORD";
    my $replacement = "BLEH";

    my @list = ("<a href='/index.html'>WORD</a> ....",
    "<image alt='WORD' src='../image.gif'> ..",
    "<body><h1>this is my WORD !</h1>... ");

    foreach my $line (@list) {
    if ($line =~ m/>([^<]*$to_replace[^>]*)</) {
    my $match = $1;
    $match =~ s/$to_replace/$replacement/g;
    $line =~ s/>([^<]*$to_replace[^>]*)</>$match</g;
    }
    print $line . "\n";
    }
    --------------------------------------------------------

    output:
    <a href='/index.html'>BLEH</a> ....
    <image alt='WORD' src='../image.gif'> ..
    <body><h1>this is my BLEH !</h1>...
    Benoit Lefebvre, Sep 7, 2007
    #3
  4. Guest

    On 7 Sep., 17:41, Klaus <> wrote:
    > On Sep 7, 2:28 pm, "" <> wrote:
    >
    > > I 'm looking for a regular expression, which will find a certain word
    > > in a text and replace it, if and only if it does not appear inside an
    > > a html link or inside a tag

    >
    > see Perlfaq 4 - How do I find matching/nesting anything?
    >
    > ==================================
    > This isn't something that can be done in one regular expression, no
    > matter how complicated. To find something between two single
    > characters, a pattern like /x([^x]*)x/ will get the intervening bits
    > in $1. For multiple ones, then something more like /alpha(.*?)omega/
    > would be needed. But none of these deals with nested patterns. For
    > balanced expressions using (, {, [ or < as delimiters, use the CPAN
    > module Regexp::Common, or see (??{ code }) in the perlre manpage. For
    > other cases, you'll have to write a parser.
    >
    > If you are serious about writing a parser, there are a number of
    > modules or oddities that will make your life a lot easier. There are
    > the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
    > and the byacc program. Starting from perl 5.8 the Text::Balanced is
    > part of the standard distribution.
    >
    > One simple destructive, inside-out approach that you might try is to
    > pull out the smallest nesting parts one at a time:
    >
    > while (s/BEGIN((?:(?!BEGIN)(?!END).)*)END//gs) {
    > # do something with $1
    > }
    >
    > A more complicated and sneaky approach is to make Perl's regular
    > expression engine do it for you. This is courtesy Dean Inada, and
    > rather has the nature of an Obfuscated Perl Contest entry, but it
    > really does work:
    >
    > # $_ contains the string to parse
    > # BEGIN and END are the opening and closing markers for the
    > # nested text.
    >
    > @( = ('(','');
    > @) = (')','');
    > ($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
    > @$ = (eval{/$re/},$@!~/unmatched/i);
    > print join("\n",@$[0..$#$]) if( $$[-1] );
    > ==================================
    >
    > --
    > Klaus



    Well, I would know if it's possible, but positive and negative
    lookaheads seem to be something to consider. The following shows how:

    http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular-expression.html
    , Sep 7, 2007
    #4
  5. Klaus Guest

    On Sep 7, 4:51 pm, "" <> wrote:
    > On 7 Sep., 17:41, Klaus <> wrote:
    >
    > > On Sep 7, 2:28 pm, "" <> wrote:

    >
    > > > I 'm looking for a regular expression, which will find a certain word
    > > > in a text and replace it, if and only if it does not appear inside an
    > > > a html link or inside a tag

    >
    > > see Perlfaq 4 - How do I find matching/nesting anything?


    [ snip contents of Perlfaq 4 ]

    > Well, I would know if it's possible, but positive and negative
    > lookaheads seem to be something to consider. The following shows how:
    >
    > http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular-expression.html


    The document claims:
    " [...] apparently there aren't many good HTML parsers available
    for .NET [...] "

    That might be true for .NET, but as far as Perl is concerned, there
    are many HTML parsers available on CPAN, and HTML::parser looks
    perfect for the job (although I would have to admit that I haven't yet
    tested it myself) :

    http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm

    ========================================
    Here is an extract from the HTML::parser documentation:
    ========================================
    HTML::parser is not a generic SGML parser. We have tried to make it
    able to deal with the HTML that is actually "out there", and it
    normally parses as closely as possible to the way the popular web
    browsers do it instead of strictly following one of the many HTML
    specifications from W3C. Where there is disagreement, there is often
    an option that you can enable to get the official behaviour.

    The document to be parsed may be supplied in arbitrary chunks. This
    makes on-the-fly parsing as documents are received from the network
    possible.

    If event driven parsing does not feel right for your application, you
    might want to use HTML::pullParser. This is an HTML::parser subclass
    that allows a more conventional program structure.
    ========================================

    --
    Klaus
    Klaus, Sep 7, 2007
    #5
  6. Guest

    On 7 Sep., 18:28, Klaus <> wrote:
    > On Sep 7, 4:51 pm, "" <> wrote:
    >
    > > On 7 Sep., 17:41, Klaus <> wrote:

    >
    > > > On Sep 7, 2:28 pm, "" <> wrote:

    >
    > > > > I 'm looking for a regular expression, which will find a certain word
    > > > > in a text and replace it, if and only if it does not appear inside an
    > > > > a html link or inside a tag

    >
    > > > see Perlfaq 4 - How do I find matching/nesting anything?

    >
    > [ snip contents of Perlfaq 4 ]
    >
    > > Well, I would know if it's possible, but positive and negative
    > > lookaheads seem to be something to consider. The following shows how:

    >
    > >http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular-expressi...

    >
    > The document claims:
    > " [...] apparently there aren't many good HTML parsers available
    > for .NET [...] "
    >
    > That might be true for .NET, but as far as Perl is concerned, there
    > are many HTML parsers available on CPAN, and HTML::parser looks
    > perfect for the job (although I would have to admit that I haven't yet
    > tested it myself) :
    >
    > http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm
    >
    > ========================================
    > Here is an extract from the HTML::parser documentation:
    > ========================================
    > HTML::parser is not a generic SGML parser. We have tried to make it
    > able to deal with the HTML that is actually "out there", and it
    > normally parses as closely as possible to the way the popular web
    > browsers do it instead of strictly following one of the many HTML
    > specifications from W3C. Where there is disagreement, there is often
    > an option that you can enable to get the official behaviour.
    >
    > The document to be parsed may be supplied in arbitrary chunks. This
    > makes on-the-fly parsing as documents are received from the network
    > possible.
    >
    > If event driven parsing does not feel right for your application, you
    > might want to use HTML::pullParser. This is an HTML::parser subclass
    > that allows a more conventional program structure.
    > ========================================
    >
    > --
    > Klaus


    I'm looking for a regular expression, which is plattform independet
    and works for java, perl or net.
    , Sep 7, 2007
    #6
  7. Ben Morrow Guest

    Quoth "" <>:
    >
    > I'm looking for a regular expression, [to parse HTML] which is
    > plattform independet and works for java, perl or net.


    <sigh> Here we go again. Clpmisc is for discussing Perl. If you want to
    discuss Java or .NET their newsgroups are -->thataway.

    In any case, regular expressions (and Perl5 regexps, which are not quite
    the same thing) are not an appropriate tool to parse HTML with. If you
    have a limited set of documents you may be able to hack up something
    that works, but it will be fragile.

    Now, did you have a Perl question?

    Ben
    Ben Morrow, Sep 7, 2007
    #7
  8. <> wrote:

    > I 'm looking for a regular expression, which will find a certain word
    > in a text and replace it, if and only if it does not appear inside an
    > a html link or inside a tag, for example as an attribute or tag name.


    > Can some pro help me out?



    Sure.

    A regular expression is not the Right Tool for this job.

    Use a real parser instead.


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad McClellan, Sep 7, 2007
    #8
  9. Guest

    On 8 Sep., 07:50, Joe Smith <> wrote:
    > wrote:
    > > I'm looking for a regular expression, which is plattform independet
    > > and works for java, perl or net.

    >
    > I'd say you have an impossible task. The advanced parts of perl
    > regular expressions that almost do what you want are not implemented
    > the same way (if at all) on the other platforms.
    >
    > -Joe



    What about finding all words which are not inside a href tag? So if
    I'm looking for the word OUTSIDE, then it should match, if it's not
    inside a href. So the following should not match
    <a href='/somethin.html'>OUTSIDE</a>

    but this should match twice!

    OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE

    Can somebody come up with a regular expression that does the job?
    , Sep 11, 2007
    #9
  10. <> wrote:
    > On 8 Sep., 07:50, Joe Smith <> wrote:
    >> wrote:
    >> > I'm looking for a regular expression, which is plattform independet
    >> > and works for java, perl or net.

    >>
    >> I'd say you have an impossible task. The advanced parts of perl
    >> regular expressions that almost do what you want are not implemented
    >> the same way (if at all) on the other platforms.
    >>
    >> -Joe

    >
    >
    > What about finding all words which are not inside a href tag? So if
    > I'm looking for the word OUTSIDE, then it should match, if it's not
    > inside a href. So the following should not match
    ><a href='/somethin.html'>OUTSIDE</a>
    >
    > but this should match twice!
    >
    > OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE



    So the below should match twice also?

    <!--
    OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE
    -->

    And the below should match once (since it doess not appear in an anchor)?

    <!--
    <a href='/somethin.html'>OUTSIDE</a>
    -->


    > Can somebody come up with a regular expression that does the job?



    A regular expression is not the Right Tool for this job.

    Use a real parser instead.

    Strip all of the anchor elements, then match against what remains.


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad McClellan, Sep 11, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Keith-Earl
    Replies:
    1
    Views:
    450
    Mary Chipman
    Jun 15, 2004
  2. VSK
    Replies:
    2
    Views:
    2,290
  3. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    846
    Alan Moore
    Dec 2, 2005
  4. GIMME
    Replies:
    3
    Views:
    11,958
    vforvikash
    Dec 29, 2008
  5. Noman Shapiro
    Replies:
    0
    Views:
    232
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page