Serious Perl Regular Expression deficiency?

R

robic0

I don't see a solution to this problem that
regular expressions can't exclude a string when
processing. It can exclude individual characters
fine. I started doing Perl 2 years ago and have
run into this nagging problem several times.

After extensive read on the Perl docs on re's
(especially in the last 2 days) I have come to the
conclusion that regular expressions have a serious
deficiency. This is serious because the not string
is a fundimental basic logic idea in a search from
a touted master search engine or should be.
To a degree it works with a known subset, but it
won't work to the degree shown below. This is a
serious flaw in regualar expressions!

I hope you masters can prove me wrong! I really do.
If not I would hope that the Perl authors can provide
some insight on when this construct can be fixed,
aka implemented.

Beat this code if you can (you can't). Don't look
at the code in this example, look instead at the
output.
Don't comment on any code syntax because thats not
welcome or the point.
Instead, refer you comments to the output ID's.

If you know of a way Perl regex can do this
please reply. I'm almost %99 sure Perl regex
can't do this. In fact the %1 is thrown out here
to either verify that or prove otherwise.

Thanks for your help...



print <<EOM;
\n# Serious Regular Expression deficiency,
# "not string", shown by XML comments..
# ----------------------------------------
EOM

use strict;
use warnings;

my $gabage1 = '
<big name="asdf" date="33" >
asdf
<!-- howdy folks -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>
';

my $gabage2 = '
<big name="asdf" date="33" >
asdf
<!-- howdy folks %SYSTEM is down <who cares?> -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>
';

my @sarrys = ($gabage1, $gabage2);
my $cnt = 1;
foreach my $xml (@sarrys) {
print "\n\n","/"x40,"\nXML $cnt:\n$xml\n";
# -------------
$_ = $xml;
print "="x40,
"\n** regex: s/<!--(.*)-->//s\n",
"-"x40,"\n";
print "id: $cnt","1\n";
while (s/<!--(.*)-->//s) { print "$1\n"; }
# -------------
$_ = $xml;
print "\n","="x40,
"\n** regex: s/<!--([^<>]*)-->//s\n",
"-"x40,"\n";
print "id: $cnt","2\n";
while (s/<!--([^<>]*)-->//s) { print "$1\n"; }
# -------------
$_ = $xml;
print "\n","="x40,
"\n** regex: s/<!--([\\w\\s]*)(?!<!--)-->//s\n",
"-"x40,"\n";
print "id: $cnt","3\n";
while (s/<!--([\w\s]*)(?!<!--)-->//s) { print "$1\n"; }
# -------------
$_ = $xml;
print "\n","="x40,
"\n** regex: s/<!--(.*)(?!<!--)-->//s\n",
"-"x40,"\n";
print "id: $cnt","4\n";
while (s/<!--(.*)(?!<!--)-->//s) { print "$1\n"; }
$cnt++;
}
__END__

C:\Drvs14\PerlMiscTest\Eraser\ESP\XMLP>perl test.pl

# Serious Regular Expression deficiency,
# "not string", shown by XML comments..
# ----------------------------------------


////////////////////////////////////////
XML 1:

<big name="asdf" date="33" >
asdf
<!-- howdy folks -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>

========================================
** regex: s/<!--(.*)-->//s
----------------------------------------
id: 11
howdy folks -->
<in2>jjjj</in2>
<!-- and still more

========================================
** regex: s/<!--([^<>]*)-->//s
----------------------------------------
id: 12
howdy folks
and still more

========================================
** regex: s/<!--([\w\s]*)(?!<!--)-->//s
----------------------------------------
id: 13
howdy folks
and still more

========================================
** regex: s/<!--(.*)(?!<!--)-->//s
----------------------------------------
id: 14
howdy folks -->
<in2>jjjj</in2>
<!-- and still more


////////////////////////////////////////
XML 2:

<big name="asdf" date="33" >
asdf
<!-- howdy folks %SYSTEM is down <who cares?> -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>

========================================
** regex: s/<!--(.*)-->//s
----------------------------------------
id: 21
howdy folks %SYSTEM is down <who cares?> -->
<in2>jjjj</in2>
<!-- and still more

========================================
** regex: s/<!--([^<>]*)-->//s
----------------------------------------
id: 22
and still more

========================================
** regex: s/<!--([\w\s]*)(?!<!--)-->//s
----------------------------------------
id: 23
and still more

========================================
** regex: s/<!--(.*)(?!<!--)-->//s
 
M

MikeGee

robic0 said:
while (s/<!--(.*)-->//s) { print "$1\n"; }

You post is longer than I can concentrate to read carefully, but in the
above line, try:
s/<!--(.*?)-->//s
and see if there is a difference.

I don't think you should make over-arching comments on the dificiencies
of a system that works fine for everyone else. I bet most of the
really knowledgable folks reading this newsgroup ignore your question
just because of the bad attitude.
 
C

castillo.bryan

robic0 said:
I don't see a solution to this problem that
regular expressions can't exclude a string when
processing. It can exclude individual characters
fine. I started doing Perl 2 years ago and have
run into this nagging problem several times.

After extensive read on the Perl docs on re's
(especially in the last 2 days) I have come to the
conclusion that regular expressions have a serious
deficiency. This is serious because the not string
is a fundimental basic logic idea in a search from
a touted master search engine or should be.
To a degree it works with a known subset, but it
won't work to the degree shown below. This is a
serious flaw in regualar expressions!

I hope you masters can prove me wrong! I really do.
If not I would hope that the Perl authors can provide
some insight on when this construct can be fixed,
aka implemented.

Beat this code if you can (you can't). Don't look
at the code in this example, look instead at the
output.
Don't comment on any code syntax because thats not
welcome or the point.
Instead, refer you comments to the output ID's.

If you know of a way Perl regex can do this
please reply. I'm almost %99 sure Perl regex
can't do this. In fact the %1 is thrown out here
to either verify that or prove otherwise.

Its not clear what "this" is. Are you asking if perl can do a negative
match on a string, pull out XML comments with a regex, or both?

If you are wondering about a negative string match, look at the perlre
documentation, specifically negative lookahead and lookbehind
assertions.

If you want to pull out the contents of XML comments you could do this.


sub test_xml_comment_parse {
my ($xml) = @_;
print "XML\n", '-' x 40, "\n", $xml, "\n", '-' x 40, "\n";
while ($xml =~ s/<!--(.*?)-->//ms) {
print "Comment [$1]\n"
}
print "\n", '-' x 40, "\n\n\n";
}

my $gabage1 = '
<big name="asdf" date="33" >
asdf
<!-- howdy folks -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>
';

my $gabage2 = '
<big name="asdf" date="33" >
asdf
<!-- howdy folks %SYSTEM is down <who cares?> -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>
';

test_xml_comment_parse($_) foreach ($gabage1,$gabage2);

output:

XML
----------------------------------------

<big name="asdf" date="33" >
asdf
<!-- howdy folks -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>

----------------------------------------
Comment [ howdy folks ]
Comment [ and still more ]

----------------------------------------


XML
----------------------------------------

<big name="asdf" date="33" >
asdf
<!-- howdy folks %SYSTEM is down <who cares?> -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>

----------------------------------------
Comment [ howdy folks %SYSTEM is down <who cares?> ]
Comment [ and still more ]

----------------------------------------







There is a problem though. If you need to retrieve data from xml
documents, you should generally use an XML parser instead of using your
own regular expressions.

Here is 1 case where the code I posted above would pull out the text
"not really a comment", that isn't really a comment.

<test_xml>
<value>
<![CDATA[ <!-- not really a comment --> ]]>
</value>
</test_xml>
 
E

Eric J. Roode

robic0 wrote in
I don't see a solution to this problem that
regular expressions can't exclude a string when
processing. It can exclude individual characters
fine. I started doing Perl 2 years ago and have
run into this nagging problem several times.

It's hard to figure out what you're expecting to find. You never once said
what you *want* the output to be.

I'm *guessing* that you want only the XML comments to be printed, and
nothing else.

I came up with a regex in about two minutes that produces this output:

id: 15
howdy folks
and still more

id: 25
howdy folks %SYSTEM is down <who cares?>
and still more

Is that the output you wanted?

--
Eric
`$=`;$_=\%!;($_)=/(.)/;$==++$|;($.,$/,$,,$\,$",$;,$^,$#,$~,$*,$:,@%)=(
$!=~/(.)(.).(.)(.)(.)(.)..(.)(.)(.)..(.)......(.)/,$"),$=++;$.++;$.++;
$_++;$_++;($_,$\,$,)=($~.$"."$;$/$%[$?]$_$\$,$:$%[$?]",$"&$~,$#,);$,++
;$,++;$^|=$";`$_$\$,$/$:$;$~$*$%[$?]$.$~$*${#}$%[$?]$;$\$"$^$~$*.>&$=`
 
R

robic0

robic0 said:
I don't see a solution to this problem that
regular expressions can't exclude a string when
processing. It can exclude individual characters
fine. I started doing Perl 2 years ago and have
run into this nagging problem several times.

After extensive read on the Perl docs on re's
(especially in the last 2 days) I have come to the
conclusion that regular expressions have a serious
deficiency. This is serious because the not string
is a fundimental basic logic idea in a search from
a touted master search engine or should be.
To a degree it works with a known subset, but it
won't work to the degree shown below. This is a
serious flaw in regualar expressions!

I hope you masters can prove me wrong! I really do.
If not I would hope that the Perl authors can provide
some insight on when this construct can be fixed,
aka implemented.

Beat this code if you can (you can't). Don't look
at the code in this example, look instead at the
output.
Don't comment on any code syntax because thats not
welcome or the point.
Instead, refer you comments to the output ID's.

If you know of a way Perl regex can do this
please reply. I'm almost %99 sure Perl regex
can't do this. In fact the %1 is thrown out here
to either verify that or prove otherwise.

Its not clear what "this" is. Are you asking if perl can do a negative
match on a string, pull out XML comments with a regex, or both?

If you are wondering about a negative string match, look at the perlre
documentation, specifically negative lookahead and lookbehind
assertions.

If you want to pull out the contents of XML comments you could do this.


sub test_xml_comment_parse {
my ($xml) = @_;
print "XML\n", '-' x 40, "\n", $xml, "\n", '-' x 40, "\n";
while ($xml =~ s/<!--(.*?)-->//ms) {
print "Comment [$1]\n"
}
print "\n", '-' x 40, "\n\n\n";
}

my $gabage1 = '
<big name="asdf" date="33" >
asdf
<!-- howdy folks -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>
';

my $gabage2 = '
<big name="asdf" date="33" >
asdf
<!-- howdy folks %SYSTEM is down <who cares?> -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>
';

test_xml_comment_parse($_) foreach ($gabage1,$gabage2);

output:

XML
----------------------------------------

<big name="asdf" date="33" >
asdf
<!-- howdy folks -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>

----------------------------------------
Comment [ howdy folks ]
Comment [ and still more ]

----------------------------------------


XML
----------------------------------------

<big name="asdf" date="33" >
asdf
<!-- howdy folks %SYSTEM is down <who cares?> -->
<in2>jjjj</in2>
<!-- and still more -->
asdfb
</big>

----------------------------------------
Comment [ howdy folks %SYSTEM is down <who cares?> ]
Comment [ and still more ]

----------------------------------------







There is a problem though. If you need to retrieve data from xml
documents, you should generally use an XML parser instead of using your
own regular expressions.

Here is 1 case where the code I posted above would pull out the text
"not really a comment", that isn't really a comment.

<test_xml>
<value>
<![CDATA[ <!-- not really a comment --> ]]>
</value>
</test_xml>


Thanks alot

Yes the first occurance (?) does the trick /<!--(.*?)-->/
And given nesting is not allowed here this will do it.
This had worked for me before, I should have stuck with it.
The //m is not really of help here since the xml could
be without newlines.

I found xml specs from
http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
I will use that to finish this code.

About the CDATA thing you mentioned. No, thats not really a
problem. The order of the regex is such that "all" non-markup
items are processed out first.

So in this case all CDATA will be removed first followed by
all comments and any other weird ones like versioning.

I like the specs, it makes it easy to write the regex.
quote:

CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>'


Within a CDATA section, only the CDEnd string is recognized as markup,
so that left angle brackets and ampersands may occur in their literal
form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
CDATA sections cannot nest.

An example of a CDATA section, in which "<greeting>" and "</greeting>"
are recognized as character data, not markup:

<![CDATA[<greeting>Hello, world!</greeting>]]>

..
..
..
One more thing:
If you are wondering about a negative string match, look at the perlre
documentation, specifically negative lookahead and lookbehind
assertions.

Yes I looked at it and tried the assertions quite a bit,
in this context /(.*)(?!string)/s it doesen't seem to work.
This however /(\w*)(?!string)/ seems to work but only if the
string has certain characters.
Don't know why.

I won't be on for a couple of days while I install a new raid array.
Anyway thanks for the help.
 
M

Matt Garrish

Here is 1 case where the code I posted above would pull out the text
"not really a comment", that isn't really a comment.

<test_xml>
<value>
<![CDATA[ <!-- not really a comment --> ]]>
</value>
</test_xml>

I found xml specs from
http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
I will use that to finish this code.

About the CDATA thing you mentioned. No, thats not really a
problem. The order of the regex is such that "all" non-markup
items are processed out first.

So in this case all CDATA will be removed first followed by
all comments and any other weird ones like versioning.

Please *read* the spec. CDATA blocks have nothing to do with comments;
they're sections of data where all the characters inside are treated as
literals (sort of like how single quoting in perl allows you to use $,@ and
%).

Matt
 
T

Tad McClellan

robic0 said:
Yes the first occurance (?) does the trick /<!--(.*?)-->/
The //m is not really of help here

Right.


since the xml could
be without newlines.


But not for that reason.

The //m is not really of help here because it modifies the meaning
of ^ and $, but your pattern does not contain either of those.

The //m modifier is a no-op with the pattern you are using.

I found xml specs from
http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
I will use that to finish this code.


What a revolutionary idea!

Sometimes it takes a true visionary to come up with a radically
beneficial paradigm shift!

Yes I looked at it and tried the assertions quite a bit,
in this context /(.*)(?!string)/s it doesen't seem to work.


If you post a short and complete program that we can run that
duplicates your problem, then we might have a chance at
solving your problem.

But since you haven't, all we can do is offer our sympathy.

Sorry it doesn't seem to work.
 
C

castillo.bryan

robic0 wrote:
I found xml specs from
http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
I will use that to finish this code.

About the CDATA thing you mentioned. No, thats not really a
problem. The order of the regex is such that "all" non-markup
items are processed out first.

So in this case all CDATA will be removed first followed by
all comments and any other weird ones like versioning.

I like the specs, it makes it easy to write the regex.
quote:

CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>'


Within a CDATA section, only the CDEnd string is recognized as markup,
so that left angle brackets and ampersands may occur in their literal
form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
CDATA sections cannot nest.

An example of a CDATA section, in which "<greeting>" and "</greeting>"
are recognized as character data, not markup:

<![CDATA[<greeting>Hello, world!</greeting>]]>

Just out of curiosity, is there a reason you don't want to use an
existing module for parsing XML, such as Expat, LibXML, etc...?
 
C

castillo.bryan

robic0 wrote:
Thanks alot

Yes the first occurance (?) does the trick /<!--(.*?)-->/
And given nesting is not allowed here this will do it.
This had worked for me before, I should have stuck with it.
The //m is not really of help here since the xml could
be without newlines.

I found xml specs from
http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
I will use that to finish this code.

About the CDATA thing you mentioned. No, thats not really a
problem. The order of the regex is such that "all" non-markup
items are processed out first.

So in this case all CDATA will be removed first followed by
all comments and any other weird ones like versioning.

I like the specs, it makes it easy to write the regex.
quote:

CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>'


Within a CDATA section, only the CDEnd string is recognized as markup,
so that left angle brackets and ampersands may occur in their literal
form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
CDATA sections cannot nest.

An example of a CDATA section, in which "<greeting>" and "</greeting>"
are recognized as character data, not markup:

<![CDATA[<greeting>Hello, world!</greeting>]]>

Don't forget about xml processing instructions, you should handle those
to.

<test_xml>
<value>
<?proc <!-- not really a comment --> ?>
</value>
</test_xml>
 
R

robic0

On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:

Thanks for the patients folks. Hope you had a happy
25'th. I started back on this problem a few hours ago.

Initially, this was a nesting problem I couldn't figure
out how to solve with regular expressions. I'm doing a xml
parser using just regex so I want to get this right.
I have concentrated on the docs on regex for this and
oh my god its got problems. I would like the writers
of Perl and Larry Wall to take a look at the code below.
It encapsulates the logic, however it be lumbersome,
of what is takes to implement the "not this string"
in the regular expression machine. Don't ask me to
explain that phrase. I think this is a pristine solution
to what I'm doing however. In other words, given the
XML specifications, this will always work.

XML in general doesen't allow markup nesting (or from
what I imagine) because of the obvious,
"Markup" being the set of characters that act as
delimeters, both start and end of an expression.

The only problem is (for regex that is) some constructs
like "Comments" and "CDATA" conflict in that its
paradigm can result in a deadlocks.

Most SAX or stream parsers get away with this because
they have anchors and process from begin to end.

I use a substitution method in the parser code I've written
that nullifies anchors. I've been using this method
for years on other things. Hey now, doesent that sound like
something the regular expression authors use?
Yeah but they fell down on this one.
Look at what I did here.
I've assumed cdata nesting and comment nesting is illegal,
and it is. I've "assumed" an anchor on one, could have
been either one. The logic uses the limited ability of
regex to capture (hog) all the data, indeed it depends
upon it.

Look at this code very carefully, nesting is not allowed
and is the "only" reason it works. Of course nesting will
throw an error in production code. This code will be
merged with the primary and more XML spec specific changes.

Why am I doint this? I don't know, I have a couple of weeks
free I guess...

Thanks for the comments!

use strict;
use warnings;

$_ = '
<![CDATA[ <!-- imbed comment --> some text <!-- imbed as well -->]]>

<!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->

<!-- This is a real comment -->

';

#### This section of parser deals with
#### circular non-markup imbedding issues.
#### (one inside the other, and so forth)
#### So far just comments & cdata.
#### Use the general substitution magic.
#### This is valid because nesting of
#### comments nor cdata is allowed.

my $cnt = 1;
my %root = ();
my %cdata_elements = ();

print "\n";

# -- Comments (done first) --
while (s/(<!--(.*?)-->)/[$cnt]/s) {
$root{$cnt} = $1;
print "$cnt = Questionable comment: $1\n"; $cnt++;
}
print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
# -- CDATA (done second) --
while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
{
# reconstitute cdata element contents
my $cdata_contents = $1;
my $str = '';
while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
{
if (defined $1)
{
$str .= $1;
}
elsif (defined $2 && exists $root{$2})
{
$str .= $root{$2};
delete $root{$2};
}
else {
my $j = 0; # shouldn't get here
}
}
$root{$cnt} = $str;
$cdata_elements{$cnt} = '';

print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
}
# -- Process leftover comments that are real --
while (my ($key,$val) = each (%root)) {
if (!defined $cdata_elements{$key}) {
# This $root re-assignment is not really necessary
# since $1 will contain the processing text that
# will be processed here, then never used again.
$root{$key} =~ s/<!--(.*?)-->/$1/s;
print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
}
}
__END__

1 = Questionable comment: <!-- imbed comment -->
2 = Questionable comment: <!-- imbed as well -->
3 = Questionable comment: <!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->
4 = Questionable comment: <!-- This is a real comment -->

============================================================

The "Real" Stuff -->


5 = REAL CDATA: <!-- imbed comment --> some text <!-- imbed as well
-->

4 = REAL COMMENT: This is a real comment

3 = REAL COMMENT:
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
 
R

robic0

robic0 wrote:
I found xml specs from
http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
I will use that to finish this code.

About the CDATA thing you mentioned. No, thats not really a
problem. The order of the regex is such that "all" non-markup
items are processed out first.

So in this case all CDATA will be removed first followed by
all comments and any other weird ones like versioning.

I like the specs, it makes it easy to write the regex.
quote:

CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>'


Within a CDATA section, only the CDEnd string is recognized as markup,
so that left angle brackets and ampersands may occur in their literal
form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
CDATA sections cannot nest.

An example of a CDATA section, in which "<greeting>" and "</greeting>"
are recognized as character data, not markup:

<![CDATA[<greeting>Hello, world!</greeting>]]>

Just out of curiosity, is there a reason you don't want to use an
existing module for parsing XML, such as Expat, LibXML, etc...?

I'm thinking that this thing I'm doing is going to blow the doors
off SAX. But, who knows...
 
R

robic0

You post is longer than I can concentrate to read carefully, but in the
above line, try:
s/<!--(.*?)-->//s
and see if there is a difference.

I don't think you should make over-arching comments on the dificiencies
of a system that works fine for everyone else. I bet most of the
really knowledgable folks reading this newsgroup ignore your question
just because of the bad attitude.

I think you should look past your navel in these issues.
You just look at the tip of the iceberg. Why should these issues
be of concern to anyone? Its a simple capability that regex
really badly falls down on. To have a match expression that
excludes a specific "string", then resets the counter. The
match won't happen on (.*) but not this "ASDF".
Do you understand that Mike?
 
R

robic0

On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:

I'm back on the job.
I'm going to post some new code this week that
complies with XML spec.

This is the solution for the Comment/CDATA paradigm
that will be incorporated in the new version:

use strict;
use warnings;

$_ = '
<![CDATA[ <!-- imbed comment --> some text <!-- imbed as well -->]]>

<!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->

<!-- This is a real comment -->

';

#### This section of parser deals with
#### circular non-markup imbedding issues.
#### (one inside the other, and so forth)
#### So far just comments & cdata.
#### Use the general substitution magic.
#### This is valid because nesting of
#### comments nor cdata is allowed.

my $cnt = 1;
my %root = ();
my %cdata_elements = ();

print "\n";

# -- Comments (done first) --
while (s/(<!--(.*?)-->)/[$cnt]/s) {
$root{$cnt} = $1;
print "$cnt = Questionable comment: $1\n"; $cnt++;
}
print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
# -- CDATA (done second) --
while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
{
# reconstitute cdata element contents
my $cdata_contents = $1;
my $str = '';
while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
{
if (defined $1)
{
$str .= $1;
}
elsif (defined $2 && exists $root{$2})
{
$str .= $root{$2};
delete $root{$2};
}
else {
my $j = 0; # shouldn't get here
}
}
$root{$cnt} = $str;
$cdata_elements{$cnt} = '';

print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
}
# -- Process leftover comments that are real --
while (my ($key,$val) = each (%root)) {
if (!defined $cdata_elements{$key}) {
# This $root re-assignment is not really necessary
# since $1 will contain the processing text that
# will be processed here, then never used again.
$root{$key} =~ s/<!--(.*?)-->/$1/s;
print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
}
}


__END__

1 = Questionable comment: <!-- imbed comment -->
2 = Questionable comment: <!-- imbed as well -->
3 = Questionable comment: <!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->
4 = Questionable comment: <!-- This is a real comment -->


============================================================

The "Real" Stuff -->


5 = REAL CDATA: <!-- imbed comment --> some text <!-- imbed as well
-->

4 = REAL COMMENT: This is a real comment

3 = REAL COMMENT:
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
 
T

Tad McClellan

robic0 said:
On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:

Thanks for the patients folks.


You crack me up, doctor!

I'm doing a xml
parser using just regex so I want to get this right.


That is mathematically impossible you know.

You will be working on it for a long long time, and never have it right.

I have concentrated on the docs on regex for this and
oh my god its got problems.


Yes, parsing a Context Free language using a Regular grammar
is simply not possible.

(but Perl's regular expressions aren't actually "regular" at all.)

I would like the writers
of Perl and Larry Wall to take a look at the code below.


I would like you to read the Dragon Book.

In other words, given the
XML specifications, this will always work.


Yeah, right.
 
R

robic0

You crack me up, doctor!




That is mathematically impossible you know.

You will be working on it for a long long time, and never have it right.




Yes, parsing a Context Free language using a Regular grammar
is simply not possible.

(but Perl's regular expressions aren't actually "regular" at all.)




I would like you to read the Dragon Book.




Yeah, right.

You crack me up dude, it works great. The code is integrated into
the main work. No problems whatsoever. I'm not going to invest time
in code that doesen't work. Never have, never will.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top