regular expression pb. with tags

S

steeve_dun

Hi,
I want to make some pattern replacement. ie to delete every thing
that's between 2 tags.
For example for

1<tag> 2</tag>3
x<tag> a<tag> b </tag> c</tag>z

I want to get

1 3
x z

But I have a problem with embeded tags.
I've tried :
$text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
but it doens't work for embeded tags. It gives:
13
x c</tag>z

Is there a way to deal with this?

Thank you

-steeve
 
D

David Squire

Hi,
I want to make some pattern replacement. ie to delete every thing
that's between 2 tags.
For example for

1<tag> 2</tag>3
x<tag> a<tag> b </tag> c</tag>z

I want to get

1 3
x z

But I have a problem with embeded tags.
I've tried :
$text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
but it doens't work for embeded tags. It gives:
13
x c</tag>z

Is there a way to deal with this?

Yep. Don't try to use regular expressions to parse XML. Use a module
that understands XML. Go to CPAN and you will find many.


DS
 
A

anno4000

Hi,
I want to make some pattern replacement. ie to delete every thing
that's between 2 tags.
For example for

1<tag> 2</tag>3
x<tag> a<tag> b </tag> c</tag>z

I want to get

1 3
x z

But I have a problem with embeded tags.
I've tried :
$text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
but it doens't work for embeded tags. It gives:
13
x c</tag>z

Is there a way to deal with this?

Not using regular expressions directly. Use one of the HTML-parsing
modules from CPAN.

Anno
 
X

Xicheng Jia

Hi,
I want to make some pattern replacement. ie to delete every thing
that's between 2 tags.
For example for

1<tag> 2</tag>3
x<tag> a<tag> b </tag> c</tag>z

I want to get

1 3
x z

But I have a problem with embeded tags.
I've tried :
$text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
but it doens't work for embeded tags. It gives:
13
x c</tag>z

Is there a way to deal with this?

Since you are using Perl, and XML is quite well formated, you may try
something like:

my $ptn;
$ptn = qr(<tag>(?:(??{$ptn})|.)*?</tag>)s;
$line =~ s/$ptn//g;

I am not encouraging you using regexes at work. But in case of some
small programs, using regexes might be much faster/easier if you know
what you do.

Regards,
Xicheng
 
T

Ted Zlatanov

I want to make some pattern replacement. ie to delete every thing
that's between 2 tags.
For example for

1<tag> 2</tag>3
x<tag> a<tag> b </tag> c</tag>z

I want to get

1 3
x z

But I have a problem with embeded tags.
I've tried :
$text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
but it doens't work for embeded tags. It gives:
13
x c</tag>z

Is there a way to deal with this?

For the first example, you're getting exactly what you wanted ("13").
Look at your input data.

For the second example, your requirements are not good. You don't say
whether you want to replace the outermost tags (in which case a regex
would work) or you want to balance tags. For outermost tag
replacement, use

$text =~ s/\<tag\>(.*)\<\/tag\>//sg;

but note that this will also replace "<tag>a</tag> extra <tag>b</tag>"
with "" and not " extra " as you may expect.

My guess is that you do want to balance tags, and you can use
Text::Balanced for that (especially if your text is not valid XML or
even SGML). If you are doing SGML/HTML/XML/etc. tagged formats then
you should search CPAN for the appropriate parser, as others have
suggested. Look at "perldoc -q html" as well.

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top