help extracting tag with boost:regex

M

MCH

hi there,
I am working with a HTML-like text with boost:regex. For example,
the following pattern might occur in my text

<abc efg>
[TAG] <p>EFG</p> [/TAG] 12<3>

In this case, I would like to extract everything between [TAG] [/TAG]
and replace [TAG] with <pre>, [/TAG] with </pre>. Meanwhile,
everything outside [TAG][/TAG] should be unchaged except that < is
replaced by &lt; and > is replaced by &gt;

In far more complicated case, a nested [TAG] might occur as follow

<abc efg>
[TAG] <p>EF [TAG] eee [/TAG] G</p> [/TAG] 12<3>

in this case, the program just tackle the outermost TAG and left the
inside TAG there. I try to implement the program with boost::regex;
however, it seems never succeed in extracing the TAG even for the
simple case.
 
D

David Harmon

On 12 Mar 2007 13:07:26 -0700 in comp.lang.c++ said:
<abc efg>
[TAG] <p>EFG</p> [/TAG] 12<3>

In this case, I would like to extract everything between [TAG] [/TAG]
and replace [TAG] with <pre>, [/TAG] with </pre>. Meanwhile,
everything outside [TAG][/TAG] should be unchaged except that < is
replaced by &lt; and > is replaced by &gt;

This exceeds what I know how to do with regex. Too much context.
I would do it with std::string, std::find_first_of(),
and a couple of if statements.

Maybe you could pretend you are writing a Perl program and get some help
from the real regex experts over on comp.lang.perl.misc
 
F

Fei Liu

MCH said:
hi there,
I am working with a HTML-like text with boost:regex. For example,
the following pattern might occur in my text

<abc efg>
[TAG] <p>EFG</p> [/TAG] 12<3>

In this case, I would like to extract everything between [TAG] [/TAG]
and replace [TAG] with <pre>, [/TAG] with </pre>. Meanwhile,
everything outside [TAG][/TAG] should be unchaged except that < is
replaced by &lt; and > is replaced by &gt;

In far more complicated case, a nested [TAG] might occur as follow

<abc efg>
[TAG] <p>EF [TAG] eee [/TAG] G</p> [/TAG] 12<3>

in this case, the program just tackle the outermost TAG and left the
inside TAG there. I try to implement the program with boost::regex;
however, it seems never succeed in extracing the TAG even for the
simple case.
try
const boost::regex expression("\[TAG\].*?\[\/TAG\]");

The trick is to use the '?' after .* to turn off greedy pattern match,
instead of matching the last occurance of [/TAG], it will match the
first [/TAG] which may or may not be what you want. It seems your
problem is not so much as boost::regex but utilizing regular expression
pattern match in general. I would recommend you to consult regular
expression documents first and experiment with simpler string pattern
with boost::regex.

Fei
 
P

Pete Becker

Fei said:
MCH said:
hi there,
I am working with a HTML-like text with boost:regex. For example,
the following pattern might occur in my text

<abc efg>
[TAG] <p>EFG</p> [/TAG] 12<3>

In this case, I would like to extract everything between [TAG] [/TAG]
and replace [TAG] with <pre>, [/TAG] with </pre>. Meanwhile,
everything outside [TAG][/TAG] should be unchaged except that < is
replaced by &lt; and > is replaced by &gt;

In far more complicated case, a nested [TAG] might occur as follow

<abc efg>
[TAG] <p>EF [TAG] eee [/TAG] G</p> [/TAG] 12<3>

in this case, the program just tackle the outermost TAG and left the
inside TAG there. I try to implement the program with boost::regex;
however, it seems never succeed in extracing the TAG even for the
simple case.
try
const boost::regex expression("\[TAG\].*?\[\/TAG\]");

The trick is to use the '?' after .* to turn off greedy pattern match,
instead of matching the last occurance of [/TAG], it will match the
first [/TAG] which may or may not be what you want.

It's definitely not what's needed, since it will match the [TAG] of
the outer block to the [/TAG] of the inner block. See the "far more
complicated case" above.
It seems your
problem is not so much as boost::regex but utilizing regular expression
pattern match in general. I would recommend you to consult regular
expression documents first and experiment with simpler string pattern
with boost::regex.

Right. Regular expressions don't deal well with recursive patterns.

--

-- Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com)
Author of "The Standard C++ Library Extensions: a Tutorial and
Reference." (www.petebecker.com/tr1book)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,527
Members
44,998
Latest member
MarissaEub

Latest Threads

Top