Simple problem

J

Jayashree

I am a newbie to perl and its regular expression. I would like to know
the regular expression which would allow me to match anything in a
string except for "textarea" or "script" within an angle bracket

string = "abracadabra<this should match><textarea><script><textarea
abracadabra>"

In the above string it should match
<this should match>
<textarea abracadabra>
and not match
<textarea>
<script> and ofcourse the beginning abracadabra

Thanks in advance,
Jayashree
 
G

Gunnar Hjalmarsson

Jayashree said:
I am a newbie to perl and its regular expression. I would like to
know the regular expression which would allow me to match anything
in a string except for "textarea" or "script" within an angle
bracket

perldoc perlrequick
 
J

Josef Moellers

Jayashree said:
I am a newbie to perl and its regular expression. I would like to know
the regular expression which would allow me to match anything in a
string except for "textarea" or "script" within an angle bracket

string = "abracadabra<this should match><textarea><script><textarea
abracadabra>"

In the above string it should match
<this should match>
<textarea abracadabra>
and not match
<textarea>
<script> and ofcourse the beginning abracadabra

What do you mean by "should match"?
Usually you match an entire line (or parts thereof) and extract the
interesting portions. So: what exactly do you want to do with the line?
If you are interested in the various parts, you might want to have a
look at "split":

#! /usr/bin/perl
use warnings;
use strict;

my $string = "abracadabra<this should match><textarea><script><textarea>
abracadabra>";
my @f = split(/<textarea>|<script>/, $string);
print join("\n", @f);

Note that this will extract the empty string between "<textarea>" and
"<script>".
You can avoid this if you use
my @f = split(/(<textarea>|<script>)+/, $string);

HTH,

Josef
 
A

Anno Siegel

Your subject "Simple problem" is at once vacuous and annoying. If you
can't solve it, don't tell us it's simple!
I am a newbie to perl and its regular expression. I would like to know
the regular expression which would allow me to match anything in a
string except for "textarea" or "script" within an angle bracket
string = "abracadabra<this should match><textarea><script><textarea
abracadabra>"

In the above string it should match
<this should match>
<textarea abracadabra>
and not match
<textarea>
<script> and ofcourse the beginning abracadabra

This isn't simple, it's impossible.

A regex match must be definite. It isn't enough to decide whether
a match occurs, but it must say where it occurs and how many characters
are involved. So where in the string "abracadabra" is the place where
/<textarea>|<script>/ isn't contained, and for how many characters
isn't it contained?

The answer is, your regex would have to match every substring that
isn't exactly one of "<textarea>" or "<script>". That are a lot
of substrings, including strings like "match><textar", "textarea>",
and so on. A single regex simply won't do it.

You could first extract the parts that are enclosed in <> and
then de-select the unwanted matches, as in

print "$_\n" for grep ! /<textarea>|<script>/, /(<.*?>)/g;

but that isn't a single regex, and it will break easily on more
complicated HTML.

What you really should do is get yourself an HTML parser from CPAN
and use that.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top