Regular Expression

fritz-bayer · Sep 7, 2007

Hi,

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag, for example as an attribute or tag name.

So, for example the following text should not match and be replaced:

<a href='/index.html'>WORD TO MATCH</a> ....
<image alt='WORD TO MATCH' src='../image.gif'> ..

but the following should be replaced

<body><h1>WORD TO MATCH</h1>...

I guess I would have to use a positive lookahead or lookaround
construct to achieve this. I have tried, but could not come up with
anything that will do the job.

Can some pro help me out?

Fritz

Klaus · Sep 7, 2007

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag

see Perlfaq 4 - How do I find matching/nesting anything?

==================================
This isn't something that can be done in one regular expression, no
matter how complicated. To find something between two single
characters, a pattern like /x([^x]*)x/ will get the intervening bits
in $1. For multiple ones, then something more like /alpha(.*?)omega/
would be needed. But none of these deals with nested patterns. For
balanced expressions using (, {, [ or < as delimiters, use the CPAN
module Regexp::Common, or see (??{ code }) in the perlre manpage. For
other cases, you'll have to write a parser.

If you are serious about writing a parser, there are a number of
modules or oddities that will make your life a lot easier. There are
the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
and the byacc program. Starting from perl 5.8 the Text::Balanced is
part of the standard distribution.

One simple destructive, inside-out approach that you might try is to
pull out the smallest nesting parts one at a time:

while (s/BEGIN((?

?!BEGIN)(?!END).)*)END//gs) {
# do something with $1
}

A more complicated and sneaky approach is to make Perl's regular
expression engine do it for you. This is courtesy Dean Inada, and
rather has the nature of an Obfuscated Perl Contest entry, but it
really does work:

# $_ contains the string to parse
# BEGIN and END are the opening and closing markers for the
# nested text.

@( = ('(','');
@) = (')','');
($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
@$ = (eval{/$re/},$@!~/unmatched/i);
print join("\n",@$[0..$#$]) if( $$[-1] );
==================================

Benoit Lefebvre · Sep 7, 2007

Hi,

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag, for example as an attribute or tag name.

So, for example the following text should not match and be replaced:

<a href='/index.html'>WORD TO MATCH</a> ....
<image alt='WORD TO MATCH' src='../image.gif'> ..

but the following should be replaced

<body><h1>WORD TO MATCH</h1>...

I guess I would have to use a positive lookahead or lookaround
construct to achieve this. I have tried, but could not come up with
anything that will do the job.

Can some pro help me out?

Fritz

I'm sure there is some WAY BETTER WAY to do this..

But here is a solutions that seems to work.

----------------8<--------------------------------------
#!/usr/bin/perl -w

use strict;

my $to_replace = "WORD";
my $replacement = "BLEH";

my @list = ("<a href='/index.html'>WORD</a> ....",
"<image alt='WORD' src='../image.gif'> ..",
"<body><h1>this is my WORD !</h1>... ");

foreach my $line (@list) {
if ($line =~ m/>([^<]*$to_replace[^>]*)</) {
my $match = $1;
$match =~ s/$to_replace/$replacement/g;
$line =~ s/>([^<]*$to_replace[^>]*)</>$match</g;
}
print $line . "\n";
}
--------------------------------------------------------

output:
<a href='/index.html'>BLEH</a> ....
<image alt='WORD' src='../image.gif'> ..
<body><h1>this is my BLEH !</h1>...

fritz-bayer · Sep 7, 2007

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag

Click to expand...

see Perlfaq 4 - How do I find matching/nesting anything?

==================================
This isn't something that can be done in one regular expression, no
matter how complicated. To find something between two single
characters, a pattern like /x([^x]*)x/ will get the intervening bits
in $1. For multiple ones, then something more like /alpha(.*?)omega/
would be needed. But none of these deals with nested patterns. For
balanced expressions using (, {, [ or < as delimiters, use the CPAN
module Regexp::Common, or see (??{ code }) in the perlre manpage. For
other cases, you'll have to write a parser.

If you are serious about writing a parser, there are a number of
modules or oddities that will make your life a lot easier. There are
the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
and the byacc program. Starting from perl 5.8 the Text::Balanced is
part of the standard distribution.

One simple destructive, inside-out approach that you might try is to
pull out the smallest nesting parts one at a time:

while (s/BEGIN((??!BEGIN)(?!END).)*)END//gs) {
# do something with $1
}

A more complicated and sneaky approach is to make Perl's regular
expression engine do it for you. This is courtesy Dean Inada, and
rather has the nature of an Obfuscated Perl Contest entry, but it
really does work:

# $_ contains the string to parse
# BEGIN and END are the opening and closing markers for the
# nested text.

@( = ('(','');
@) = (')','');
($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
@$ = (eval{/$re/},$@!~/unmatched/i);
print join("\n",@$[0..$#$]) if( $$[-1] );
==================================

Well, I would know if it's possible, but positive and negative
lookaheads seem to be something to consider. The following shows how:

http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular-expression.html

Klaus · Sep 7, 2007

[ snip contents of Perlfaq 4 ]

Well, I would know if it's possible, but positive and negative
lookaheads seem to be something to consider. The following shows how:

http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular-expression.html

The document claims:
" [...] apparently there aren't many good HTML parsers available
for .NET [...] "

That might be true for .NET, but as far as Perl is concerned, there
are many HTML parsers available on CPAN, and HTML:

arser looks
perfect for the job (although I would have to admit that I haven't yet
tested it myself) :

http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm

========================================
Here is an extract from the HTML:

arser documentation:
========================================
HTML:

arser is not a generic SGML parser. We have tried to make it
able to deal with the HTML that is actually "out there", and it
normally parses as closely as possible to the way the popular web
browsers do it instead of strictly following one of the many HTML
specifications from W3C. Where there is disagreement, there is often
an option that you can enable to get the official behaviour.

The document to be parsed may be supplied in arbitrary chunks. This
makes on-the-fly parsing as documents are received from the network
possible.

If event driven parsing does not feel right for your application, you
might want to use HTML:

ullParser. This is an HTML:

arser subclass
that allows a more conventional program structure.
========================================

fritz-bayer · Sep 7, 2007

[ snip contents of Perlfaq 4 ]

Well, I would know if it's possible, but positive and negative
lookaheads seem to be something to consider. The following shows how:

Click to expand...

http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular-expressi...

Click to expand...

The document claims:
" [...] apparently there aren't many good HTML parsers available
for .NET [...] "

That might be true for .NET, but as far as Perl is concerned, there
are many HTML parsers available on CPAN, and HTML:arser looks
perfect for the job (although I would have to admit that I haven't yet
tested it myself) :

http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm

========================================
Here is an extract from the HTML:arser documentation:
========================================
HTML:arser is not a generic SGML parser. We have tried to make it
able to deal with the HTML that is actually "out there", and it
normally parses as closely as possible to the way the popular web
browsers do it instead of strictly following one of the many HTML
specifications from W3C. Where there is disagreement, there is often
an option that you can enable to get the official behaviour.

The document to be parsed may be supplied in arbitrary chunks. This
makes on-the-fly parsing as documents are received from the network
possible.

If event driven parsing does not feel right for your application, you
might want to use HTML:ullParser. This is an HTML:arser subclass
that allows a more conventional program structure.
========================================

I'm looking for a regular expression, which is plattform independet
and works for java, perl or net.

Ben Morrow · Sep 7, 2007

Quoth "[email protected] said:
I'm looking for a regular expression, [to parse HTML] which is
plattform independet and works for java, perl or net.

<sigh> Here we go again. Clpmisc is for discussing Perl. If you want to
discuss Java or .NET their newsgroups are -->thataway.

In any case, regular expressions (and Perl5 regexps, which are not quite
the same thing) are not an appropriate tool to parse HTML with. If you
have a limited set of documents you may be able to hack up something
that works, but it will be fragile.

Now, did you have a Perl question?

Ben

Tad McClellan · Sep 7, 2007

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag, for example as an attribute or tag name.

Can some pro help me out?

Sure.

A regular expression is not the Right Tool for this job.

Use a real parser instead.

fritz-bayer · Sep 11, 2007

I'd say you have an impossible task. The advanced parts of perl
regular expressions that almost do what you want are not implemented
the same way (if at all) on the other platforms.

-Joe

What about finding all words which are not inside a href tag? So if
I'm looking for the word OUTSIDE, then it should match, if it's not
inside a href. So the following should not match
<a href='/somethin.html'>OUTSIDE</a>

but this should match twice!

OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE

Can somebody come up with a regular expression that does the job?

Tad McClellan · Sep 11, 2007

What about finding all words which are not inside a href tag? So if
I'm looking for the word OUTSIDE, then it should match, if it's not
inside a href. So the following should not match
<a href='/somethin.html'>OUTSIDE</a>

but this should match twice!

OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE

So the below should match twice also?



And the below should match once (since it doess not appear in an anchor)?

Can somebody come up with a regular expression that does the job?

A regular expression is not the Right Tool for this job.

Use a real parser instead.

Strip all of the anchor elements, then match against what remains.

RegExp - Match specific words, but not if they're inside parenthesis (with or without other words within)	6	Jan 29, 2023
How do I get the text that is found by a regular expression?	10	Apr 30, 2014
Regular expression for BOM required	6	Jan 12, 2013
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
Regular Expression Help?	5	Feb 4, 2009
Recursion regular expression (xtended)	1	Aug 16, 2010
Regular Expression for the special character "\|" pipe	7	May 27, 2014
help with regular expression	12	Jul 8, 2008

Regular Expression

fritz-bayer

Klaus

Benoit Lefebvre

fritz-bayer

Klaus

fritz-bayer

Ben Morrow

Tad McClellan

fritz-bayer

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads