Regular Expressions

Patrick Herber · Nov 18, 2005

Hello!
I have some trouble trying to solve a regular expression problem:
I get a file, which looks like this one:
[...]
<w

>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w

>
<w

>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w

>
<w

>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w

>
[...]
When inside the <w:c> Tag I find the text "CRITICAL TEXT", then I have
to remove its whole row starting with the <w

> tag to its closing
</w

> tag. Inside this <w

> Tag also other tags can be present (this
is actually a MS Word File in saved in XML Format) and the text is also
not so clear formatted (it can be on one single line).
My try was to say: find a text, which contains <w

> followed by any
characters but not by </w

> then followed by CRITICAL TEXT then
followed by any charachters but not by <w

> then followed by </w

>.
I tried with several patterns but I didn't find the correct one:

$content =~ s/<w

>[\s\S]*?(?!<\/w

>)CRITICAL
TEXT[\s\S]*?(?!<w

>)<\/w

>//ig;

$content =~ s/<w

>(?!.*?<\/w

>)*?CRITICAL
TEXT(?!.*?<w

>)*?<\/w

>//ig;

Can you pleas help me?

Thanks a lot!
Regards,
Patrick

Anno Siegel · Nov 18, 2005

Patrick Herber said:
Hello!
I have some trouble trying to solve a regular expression problem:
I get a file, which looks like this one:
[...]
<w>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w>
<w>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w>
<w>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w>
[...]
When inside the <w:c> Tag I find the text "CRITICAL TEXT", then I have
to remove its whole row starting with the <w> tag to its closing
</w> tag. Inside this <w> Tag also other tags can be present (this
is actually a MS Word File in saved in XML Format) and the text is also
not so clear formatted (it can be on one single line).
My try was to say: find a text, which contains <w> followed by any
characters but not by </w> then followed by CRITICAL TEXT then
followed by any charachters but not by <w> then followed by </w>.

It may be possible to do this with a single pattern match, but (as
you have found) the regular expressions you build soon get out of hand.

Consider a different approach: For each occurrence of
"<w:c>CRITICAL TEXT</w:c>" find the nearest <w

> preceding it and
the nearest </w

> succeeding it, and delete the text they span:

my $str = do { local $/; <DATA> };

while ( $str =~ m{<w:c>\s*CRITICAL TEXT\s*</w:c>}g ) {
$str =~ s{.*(<w

>.*\G.*?</w

>)} {}s and print "deleted: |$1|\n";
}

__DATA__
<w

>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w

>
<w

>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w

>
<w

>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w

>

Anno

Paul Lalli · Nov 18, 2005

Patrick said:
I have some trouble trying to solve a regular expression problem:
I get a file, which looks like this one:
[...]
<w>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w>
<w>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w>
<w>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w>
[...]
When inside the <w:c> Tag I find the text "CRITICAL TEXT", then I have
to remove its whole row starting with the <w> tag to its closing
</w> tag. Inside this <w> Tag also other tags can be present

A critical piece of information is whether or not <w

> tags themselves
can be nested. Assuming they cannot, I recommend this solution:
Read the file by "chunks", one <w

>..</w

> at a time. If the chunk
does not contain your CRITICAL TEXT, print/use/store it. Otherwise,
skip it.

Example:
#!/usr/bin/perl
use strict;
use warnings;

local $/ = '</w

>';
while (my $chunk = <DATA>){
print $chunk if index($chunk, '<w:c>CRITICAL TEXT</w:c>') == -1;
}
__DATA__
<w

>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w

>
<w

>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w

>
<w

>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w

>

Output:
<w

>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w

>
<w

>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w

>

You may wish to also look into the $^I variable if you wish to edit the
file "in-place", as well as the File::Stream module if you want to set
the input record separator to anything more complicated than a string.

perldoc perlvar
http://search.cpan.org/~smueller/File-Stream-2.00/lib/File/Stream.pm

Hope this helps,
Paul Lalli

Mark Clements · Nov 18, 2005

Patrick said:
Hello!
I have some trouble trying to solve a regular expression problem:
I get a file, which looks like this one:
[...]
<w>
<w:r>
<w:c>TEXT ONE</w:c>
</w:r>
</w>
<w>
<w:r>
<w:c>CRITICAL TEXT</w:c>
</w:r>
</w>
<w>
<w:r>
<w:c>ANOTHER TEXT</w:c>
</w:r>
</w>
[...]
When inside the <w:c> Tag I find the text "CRITICAL TEXT", then I have
to remove its whole row starting with the <w> tag to its closing
</w> tag. Inside this <w> Tag also other tags can be present (this
is actually a MS Word File in saved in XML Format) and the text is also
not so clear formatted (it can be on one single line).

I may be missing something here, but if it's XML, why not use an XML
parser? There are various modules that might be suitable for your needs
on CPAN.

Mark

Has anyone solved the problem of lists in WordML (Word 2003)?	5	May 13, 2004
using regular expressions...	1	Nov 11, 2008
Mandatory Elements To Conduct JavaScript Form Manipulation	7	Aug 22, 2023
Python Regular Expressions	4	Jun 22, 2011
Large regular expressions	1	Mar 15, 2010
Image shifts to the right when export the page to pdf	4	May 5, 2023
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013
Big problem I need to solve with some unix utils	1	Jun 19, 2022

Regular Expressions

Patrick Herber

Anno Siegel

Paul Lalli

Mark Clements

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads