Regular Expression

F

fritz-bayer

Hi,

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag, for example as an attribute or tag name.

So, for example the following text should not match and be replaced:

<a href='/index.html'>WORD TO MATCH</a> ....
<image alt='WORD TO MATCH' src='../image.gif'> ..

but the following should be replaced

<body><h1>WORD TO MATCH</h1>...

I guess I would have to use a positive lookahead or lookaround
construct to achieve this. I have tried, but could not come up with
anything that will do the job.

Can some pro help me out?

Fritz
 
K

Klaus

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag

see Perlfaq 4 - How do I find matching/nesting anything?

==================================
This isn't something that can be done in one regular expression, no
matter how complicated. To find something between two single
characters, a pattern like /x([^x]*)x/ will get the intervening bits
in $1. For multiple ones, then something more like /alpha(.*?)omega/
would be needed. But none of these deals with nested patterns. For
balanced expressions using (, {, [ or < as delimiters, use the CPAN
module Regexp::Common, or see (??{ code }) in the perlre manpage. For
other cases, you'll have to write a parser.

If you are serious about writing a parser, there are a number of
modules or oddities that will make your life a lot easier. There are
the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
and the byacc program. Starting from perl 5.8 the Text::Balanced is
part of the standard distribution.

One simple destructive, inside-out approach that you might try is to
pull out the smallest nesting parts one at a time:

while (s/BEGIN((?:(?!BEGIN)(?!END).)*)END//gs) {
# do something with $1
}

A more complicated and sneaky approach is to make Perl's regular
expression engine do it for you. This is courtesy Dean Inada, and
rather has the nature of an Obfuscated Perl Contest entry, but it
really does work:

# $_ contains the string to parse
# BEGIN and END are the opening and closing markers for the
# nested text.

@( = ('(','');
@) = (')','');
($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
@$ = (eval{/$re/},$@!~/unmatched/i);
print join("\n",@$[0..$#$]) if( $$[-1] );
==================================
 
B

Benoit Lefebvre

Hi,

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag, for example as an attribute or tag name.

So, for example the following text should not match and be replaced:

<a href='/index.html'>WORD TO MATCH</a> ....
<image alt='WORD TO MATCH' src='../image.gif'> ..

but the following should be replaced

<body><h1>WORD TO MATCH</h1>...

I guess I would have to use a positive lookahead or lookaround
construct to achieve this. I have tried, but could not come up with
anything that will do the job.

Can some pro help me out?

Fritz

I'm sure there is some WAY BETTER WAY to do this..

But here is a solutions that seems to work.

----------------8<--------------------------------------
#!/usr/bin/perl -w

use strict;

my $to_replace = "WORD";
my $replacement = "BLEH";

my @list = ("<a href='/index.html'>WORD</a> ....",
"<image alt='WORD' src='../image.gif'> ..",
"<body><h1>this is my WORD !</h1>... ");

foreach my $line (@list) {
if ($line =~ m/>([^<]*$to_replace[^>]*)</) {
my $match = $1;
$match =~ s/$to_replace/$replacement/g;
$line =~ s/>([^<]*$to_replace[^>]*)</>$match</g;
}
print $line . "\n";
}
--------------------------------------------------------

output:
<a href='/index.html'>BLEH</a> ....
<image alt='WORD' src='../image.gif'> ..
<body><h1>this is my BLEH !</h1>...
 
F

fritz-bayer

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag

see Perlfaq 4 - How do I find matching/nesting anything?

==================================
This isn't something that can be done in one regular expression, no
matter how complicated. To find something between two single
characters, a pattern like /x([^x]*)x/ will get the intervening bits
in $1. For multiple ones, then something more like /alpha(.*?)omega/
would be needed. But none of these deals with nested patterns. For
balanced expressions using (, {, [ or < as delimiters, use the CPAN
module Regexp::Common, or see (??{ code }) in the perlre manpage. For
other cases, you'll have to write a parser.

If you are serious about writing a parser, there are a number of
modules or oddities that will make your life a lot easier. There are
the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
and the byacc program. Starting from perl 5.8 the Text::Balanced is
part of the standard distribution.

One simple destructive, inside-out approach that you might try is to
pull out the smallest nesting parts one at a time:

while (s/BEGIN((?:(?!BEGIN)(?!END).)*)END//gs) {
# do something with $1
}

A more complicated and sneaky approach is to make Perl's regular
expression engine do it for you. This is courtesy Dean Inada, and
rather has the nature of an Obfuscated Perl Contest entry, but it
really does work:

# $_ contains the string to parse
# BEGIN and END are the opening and closing markers for the
# nested text.

@( = ('(','');
@) = (')','');
($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
@$ = (eval{/$re/},$@!~/unmatched/i);
print join("\n",@$[0..$#$]) if( $$[-1] );
==================================


Well, I would know if it's possible, but positive and negative
lookaheads seem to be something to consider. The following shows how:

http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular-expression.html
 
K

Klaus

[ snip contents of Perlfaq 4 ]
Well, I would know if it's possible, but positive and negative
lookaheads seem to be something to consider. The following shows how:

http://frank.vanpuffelen.net/2007/04/how-to-optimize-regular-expression.html

The document claims:
" [...] apparently there aren't many good HTML parsers available
for .NET [...] "

That might be true for .NET, but as far as Perl is concerned, there
are many HTML parsers available on CPAN, and HTML::parser looks
perfect for the job (although I would have to admit that I haven't yet
tested it myself) :

http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm

========================================
Here is an extract from the HTML::parser documentation:
========================================
HTML::parser is not a generic SGML parser. We have tried to make it
able to deal with the HTML that is actually "out there", and it
normally parses as closely as possible to the way the popular web
browsers do it instead of strictly following one of the many HTML
specifications from W3C. Where there is disagreement, there is often
an option that you can enable to get the official behaviour.

The document to be parsed may be supplied in arbitrary chunks. This
makes on-the-fly parsing as documents are received from the network
possible.

If event driven parsing does not feel right for your application, you
might want to use HTML::pullParser. This is an HTML::parser subclass
that allows a more conventional program structure.
========================================
 
F

fritz-bayer

[ snip contents of Perlfaq 4 ]
Well, I would know if it's possible, but positive and negative
lookaheads seem to be something to consider. The following shows how:

The document claims:
" [...] apparently there aren't many good HTML parsers available
for .NET [...] "

That might be true for .NET, but as far as Perl is concerned, there
are many HTML parsers available on CPAN, and HTML::parser looks
perfect for the job (although I would have to admit that I haven't yet
tested it myself) :

http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm

========================================
Here is an extract from the HTML::parser documentation:
========================================
HTML::parser is not a generic SGML parser. We have tried to make it
able to deal with the HTML that is actually "out there", and it
normally parses as closely as possible to the way the popular web
browsers do it instead of strictly following one of the many HTML
specifications from W3C. Where there is disagreement, there is often
an option that you can enable to get the official behaviour.

The document to be parsed may be supplied in arbitrary chunks. This
makes on-the-fly parsing as documents are received from the network
possible.

If event driven parsing does not feel right for your application, you
might want to use HTML::pullParser. This is an HTML::parser subclass
that allows a more conventional program structure.
========================================

I'm looking for a regular expression, which is plattform independet
and works for java, perl or net.
 
B

Ben Morrow

Quoth "[email protected] said:
I'm looking for a regular expression, [to parse HTML] which is
plattform independet and works for java, perl or net.

<sigh> Here we go again. Clpmisc is for discussing Perl. If you want to
discuss Java or .NET their newsgroups are -->thataway.

In any case, regular expressions (and Perl5 regexps, which are not quite
the same thing) are not an appropriate tool to parse HTML with. If you
have a limited set of documents you may be able to hack up something
that works, but it will be fragile.

Now, did you have a Perl question?

Ben
 
T

Tad McClellan

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag, for example as an attribute or tag name.
Can some pro help me out?


Sure.

A regular expression is not the Right Tool for this job.

Use a real parser instead.
 
F

fritz-bayer

I'd say you have an impossible task. The advanced parts of perl
regular expressions that almost do what you want are not implemented
the same way (if at all) on the other platforms.

-Joe


What about finding all words which are not inside a href tag? So if
I'm looking for the word OUTSIDE, then it should match, if it's not
inside a href. So the following should not match
<a href='/somethin.html'>OUTSIDE</a>

but this should match twice!

OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE

Can somebody come up with a regular expression that does the job?
 
T

Tad McClellan

What about finding all words which are not inside a href tag? So if
I'm looking for the word OUTSIDE, then it should match, if it's not
inside a href. So the following should not match
<a href='/somethin.html'>OUTSIDE</a>

but this should match twice!

OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE


So the below should match twice also?

<!--
OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE
-->

And the below should match once (since it doess not appear in an anchor)?

<!--
<a href='/somethin.html'>OUTSIDE</a>
-->

Can somebody come up with a regular expression that does the job?


A regular expression is not the Right Tool for this job.

Use a real parser instead.

Strip all of the anchor elements, then match against what remains.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top