Regular Expressions

P

Patrick Herber

Hello!
I have some trouble trying to solve a regular expression problem:
I get a file, which looks like this one:
[...]
<w:p>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w:p>
[...]
When inside the <w:c> Tag I find the text "CRITICAL TEXT", then I have
to remove its whole row starting with the <w:p> tag to its closing
</w:p> tag. Inside this <w:p> Tag also other tags can be present (this
is actually a MS Word File in saved in XML Format) and the text is also
not so clear formatted (it can be on one single line).
My try was to say: find a text, which contains <w:p> followed by any
characters but not by </w:p> then followed by CRITICAL TEXT then
followed by any charachters but not by <w:p> then followed by </w:p>.
I tried with several patterns but I didn't find the correct one:

$content =~ s/<w:p>[\s\S]*?(?!<\/w:p>)CRITICAL
TEXT[\s\S]*?(?!<w:p>)<\/w:p>//ig;

$content =~ s/<w:p>(?!.*?<\/w:p>)*?CRITICAL
TEXT(?!.*?<w:p>)*?<\/w:p>//ig;

Can you pleas help me?

Thanks a lot!
Regards,
Patrick
 
A

Anno Siegel

Patrick Herber said:
Hello!
I have some trouble trying to solve a regular expression problem:
I get a file, which looks like this one:
[...]
<w:p>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w:p>
[...]
When inside the <w:c> Tag I find the text "CRITICAL TEXT", then I have
to remove its whole row starting with the <w:p> tag to its closing
</w:p> tag. Inside this <w:p> Tag also other tags can be present (this
is actually a MS Word File in saved in XML Format) and the text is also
not so clear formatted (it can be on one single line).
My try was to say: find a text, which contains <w:p> followed by any
characters but not by </w:p> then followed by CRITICAL TEXT then
followed by any charachters but not by <w:p> then followed by </w:p>.

It may be possible to do this with a single pattern match, but (as
you have found) the regular expressions you build soon get out of hand.

Consider a different approach: For each occurrence of
"<w:c>CRITICAL TEXT</w:c>" find the nearest <w:p> preceding it and
the nearest </w:p> succeeding it, and delete the text they span:

my $str = do { local $/; <DATA> };

while ( $str =~ m{<w:c>\s*CRITICAL TEXT\s*</w:c>}g ) {
$str =~ s{.*(<w:p>.*\G.*?</w:p>)} {}s and print "deleted: |$1|\n";
}

__DATA__
<w:p>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w:p>

Anno
 
P

Paul Lalli

Patrick said:
I have some trouble trying to solve a regular expression problem:
I get a file, which looks like this one:
[...]
<w:p>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w:p>
[...]
When inside the <w:c> Tag I find the text "CRITICAL TEXT", then I have
to remove its whole row starting with the <w:p> tag to its closing
</w:p> tag. Inside this <w:p> Tag also other tags can be present

A critical piece of information is whether or not <w:p> tags themselves
can be nested. Assuming they cannot, I recommend this solution:
Read the file by "chunks", one <w:p>..</w:p> at a time. If the chunk
does not contain your CRITICAL TEXT, print/use/store it. Otherwise,
skip it.

Example:
#!/usr/bin/perl
use strict;
use warnings;

local $/ = '</w:p>';
while (my $chunk = <DATA>){
print $chunk if index($chunk, '<w:c>CRITICAL TEXT</w:c>') == -1;
}
__DATA__
<w:p>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w:p>

Output:
<w:p>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w:p>


You may wish to also look into the $^I variable if you wish to edit the
file "in-place", as well as the File::Stream module if you want to set
the input record separator to anything more complicated than a string.

perldoc perlvar
http://search.cpan.org/~smueller/File-Stream-2.00/lib/File/Stream.pm

Hope this helps,
Paul Lalli
 
M

Mark Clements

Patrick said:
Hello!
I have some trouble trying to solve a regular expression problem:
I get a file, which looks like this one:
[...]
<w:p>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w:p>
<w:p>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w:p>
[...]
When inside the <w:c> Tag I find the text "CRITICAL TEXT", then I have
to remove its whole row starting with the <w:p> tag to its closing
</w:p> tag. Inside this <w:p> Tag also other tags can be present (this
is actually a MS Word File in saved in XML Format) and the text is also
not so clear formatted (it can be on one single line).

I may be missing something here, but if it's XML, why not use an XML
parser? There are various modules that might be suitable for your needs
on CPAN.

Mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top