Serious Perl Regular Expression deficiency?

Discussion in 'Perl Misc' started by robic0, Dec 23, 2005.

  1. robic0

    robic0 Guest

    I don't see a solution to this problem that
    regular expressions can't exclude a string when
    processing. It can exclude individual characters
    fine. I started doing Perl 2 years ago and have
    run into this nagging problem several times.

    After extensive read on the Perl docs on re's
    (especially in the last 2 days) I have come to the
    conclusion that regular expressions have a serious
    deficiency. This is serious because the not string
    is a fundimental basic logic idea in a search from
    a touted master search engine or should be.
    To a degree it works with a known subset, but it
    won't work to the degree shown below. This is a
    serious flaw in regualar expressions!

    I hope you masters can prove me wrong! I really do.
    If not I would hope that the Perl authors can provide
    some insight on when this construct can be fixed,
    aka implemented.

    Beat this code if you can (you can't). Don't look
    at the code in this example, look instead at the
    output.
    Don't comment on any code syntax because thats not
    welcome or the point.
    Instead, refer you comments to the output ID's.

    If you know of a way Perl regex can do this
    please reply. I'm almost %99 sure Perl regex
    can't do this. In fact the %1 is thrown out here
    to either verify that or prove otherwise.

    Thanks for your help...



    print <<EOM;
    \n# Serious Regular Expression deficiency,
    # "not string", shown by XML comments..
    # ----------------------------------------
    EOM

    use strict;
    use warnings;

    my $gabage1 = '
    <big name="asdf" date="33" >
    asdf
    <!-- howdy folks -->
    <in2>jjjj</in2>
    <!-- and still more -->
    asdfb
    </big>
    ';

    my $gabage2 = '
    <big name="asdf" date="33" >
    asdf
    <!-- howdy folks %SYSTEM is down <who cares?> -->
    <in2>jjjj</in2>
    <!-- and still more -->
    asdfb
    </big>
    ';

    my @sarrys = ($gabage1, $gabage2);
    my $cnt = 1;
    foreach my $xml (@sarrys) {
    print "\n\n","/"x40,"\nXML $cnt:\n$xml\n";
    # -------------
    $_ = $xml;
    print "="x40,
    "\n** regex: s/<!--(.*)-->//s\n",
    "-"x40,"\n";
    print "id: $cnt","1\n";
    while (s/<!--(.*)-->//s) { print "$1\n"; }
    # -------------
    $_ = $xml;
    print "\n","="x40,
    "\n** regex: s/<!--([^<>]*)-->//s\n",
    "-"x40,"\n";
    print "id: $cnt","2\n";
    while (s/<!--([^<>]*)-->//s) { print "$1\n"; }
    # -------------
    $_ = $xml;
    print "\n","="x40,
    "\n** regex: s/<!--([\\w\\s]*)(?!<!--)-->//s\n",
    "-"x40,"\n";
    print "id: $cnt","3\n";
    while (s/<!--([\w\s]*)(?!<!--)-->//s) { print "$1\n"; }
    # -------------
    $_ = $xml;
    print "\n","="x40,
    "\n** regex: s/<!--(.*)(?!<!--)-->//s\n",
    "-"x40,"\n";
    print "id: $cnt","4\n";
    while (s/<!--(.*)(?!<!--)-->//s) { print "$1\n"; }
    $cnt++;
    }
    __END__

    C:\Drvs14\PerlMiscTest\Eraser\ESP\XMLP>perl test.pl

    # Serious Regular Expression deficiency,
    # "not string", shown by XML comments..
    # ----------------------------------------


    ////////////////////////////////////////
    XML 1:

    <big name="asdf" date="33" >
    asdf
    <!-- howdy folks -->
    <in2>jjjj</in2>
    <!-- and still more -->
    asdfb
    </big>

    ========================================
    ** regex: s/<!--(.*)-->//s
    ----------------------------------------
    id: 11
    howdy folks -->
    <in2>jjjj</in2>
    <!-- and still more

    ========================================
    ** regex: s/<!--([^<>]*)-->//s
    ----------------------------------------
    id: 12
    howdy folks
    and still more

    ========================================
    ** regex: s/<!--([\w\s]*)(?!<!--)-->//s
    ----------------------------------------
    id: 13
    howdy folks
    and still more

    ========================================
    ** regex: s/<!--(.*)(?!<!--)-->//s
    ----------------------------------------
    id: 14
    howdy folks -->
    <in2>jjjj</in2>
    <!-- and still more


    ////////////////////////////////////////
    XML 2:

    <big name="asdf" date="33" >
    asdf
    <!-- howdy folks %SYSTEM is down <who cares?> -->
    <in2>jjjj</in2>
    <!-- and still more -->
    asdfb
    </big>

    ========================================
    ** regex: s/<!--(.*)-->//s
    ----------------------------------------
    id: 21
    howdy folks %SYSTEM is down <who cares?> -->
    <in2>jjjj</in2>
    <!-- and still more

    ========================================
    ** regex: s/<!--([^<>]*)-->//s
    ----------------------------------------
    id: 22
    and still more

    ========================================
    ** regex: s/<!--([\w\s]*)(?!<!--)-->//s
    ----------------------------------------
    id: 23
    and still more

    ========================================
    ** regex: s/<!--(.*)(?!<!--)-->//s
    ----------------------------------------
    id: 24
    howdy folks %SYSTEM is down <who cares?> -->
    <in2>jjjj</in2>
    <!-- and still more
    robic0, Dec 23, 2005
    #1
    1. Advertising

  2. robic0

    MikeGee Guest

    robic0 wrote:
    > while (s/<!--(.*)-->//s) { print "$1\n"; }


    You post is longer than I can concentrate to read carefully, but in the
    above line, try:
    s/<!--(.*?)-->//s
    and see if there is a difference.

    I don't think you should make over-arching comments on the dificiencies
    of a system that works fine for everyone else. I bet most of the
    really knowledgable folks reading this newsgroup ignore your question
    just because of the bad attitude.
    MikeGee, Dec 24, 2005
    #2
    1. Advertising

  3. robic0

    Guest

    robic0 wrote:
    > I don't see a solution to this problem that
    > regular expressions can't exclude a string when
    > processing. It can exclude individual characters
    > fine. I started doing Perl 2 years ago and have
    > run into this nagging problem several times.
    >
    > After extensive read on the Perl docs on re's
    > (especially in the last 2 days) I have come to the
    > conclusion that regular expressions have a serious
    > deficiency. This is serious because the not string
    > is a fundimental basic logic idea in a search from
    > a touted master search engine or should be.
    > To a degree it works with a known subset, but it
    > won't work to the degree shown below. This is a
    > serious flaw in regualar expressions!
    >
    > I hope you masters can prove me wrong! I really do.
    > If not I would hope that the Perl authors can provide
    > some insight on when this construct can be fixed,
    > aka implemented.
    >
    > Beat this code if you can (you can't). Don't look
    > at the code in this example, look instead at the
    > output.
    > Don't comment on any code syntax because thats not
    > welcome or the point.
    > Instead, refer you comments to the output ID's.
    >
    > If you know of a way Perl regex can do this
    > please reply. I'm almost %99 sure Perl regex
    > can't do this. In fact the %1 is thrown out here
    > to either verify that or prove otherwise.
    >


    Its not clear what "this" is. Are you asking if perl can do a negative
    match on a string, pull out XML comments with a regex, or both?

    If you are wondering about a negative string match, look at the perlre
    documentation, specifically negative lookahead and lookbehind
    assertions.

    If you want to pull out the contents of XML comments you could do this.


    sub test_xml_comment_parse {
    my ($xml) = @_;
    print "XML\n", '-' x 40, "\n", $xml, "\n", '-' x 40, "\n";
    while ($xml =~ s/<!--(.*?)-->//ms) {
    print "Comment [$1]\n"
    }
    print "\n", '-' x 40, "\n\n\n";
    }

    my $gabage1 = '
    <big name="asdf" date="33" >
    asdf
    <!-- howdy folks -->
    <in2>jjjj</in2>
    <!-- and still more -->
    asdfb
    </big>
    ';

    my $gabage2 = '
    <big name="asdf" date="33" >
    asdf
    <!-- howdy folks %SYSTEM is down <who cares?> -->
    <in2>jjjj</in2>
    <!-- and still more -->
    asdfb
    </big>
    ';

    test_xml_comment_parse($_) foreach ($gabage1,$gabage2);

    output:

    XML
    ----------------------------------------

    <big name="asdf" date="33" >
    asdf
    <!-- howdy folks -->
    <in2>jjjj</in2>
    <!-- and still more -->
    asdfb
    </big>

    ----------------------------------------
    Comment [ howdy folks ]
    Comment [ and still more ]

    ----------------------------------------


    XML
    ----------------------------------------

    <big name="asdf" date="33" >
    asdf
    <!-- howdy folks %SYSTEM is down <who cares?> -->
    <in2>jjjj</in2>
    <!-- and still more -->
    asdfb
    </big>

    ----------------------------------------
    Comment [ howdy folks %SYSTEM is down <who cares?> ]
    Comment [ and still more ]

    ----------------------------------------







    There is a problem though. If you need to retrieve data from xml
    documents, you should generally use an XML parser instead of using your
    own regular expressions.

    Here is 1 case where the code I posted above would pull out the text
    "not really a comment", that isn't really a comment.

    <test_xml>
    <value>
    <![CDATA[ <!-- not really a comment --> ]]>
    </value>
    </test_xml>
    , Dec 24, 2005
    #3
  4. robic0 wrote in news::

    > I don't see a solution to this problem that
    > regular expressions can't exclude a string when
    > processing. It can exclude individual characters
    > fine. I started doing Perl 2 years ago and have
    > run into this nagging problem several times.


    It's hard to figure out what you're expecting to find. You never once said
    what you *want* the output to be.

    I'm *guessing* that you want only the XML comments to be printed, and
    nothing else.

    I came up with a regex in about two minutes that produces this output:

    id: 15
    howdy folks
    and still more

    id: 25
    howdy folks %SYSTEM is down <who cares?>
    and still more

    Is that the output you wanted?

    --
    Eric
    `$=`;$_=\%!;($_)=/(.)/;$==++$|;($.,$/,$,,$\,$",$;,$^,$#,$~,$*,$:,@%)=(
    $!=~/(.)(.).(.)(.)(.)(.)..(.)(.)(.)..(.)......(.)/,$"),$=++;$.++;$.++;
    $_++;$_++;($_,$\,$,)=($~.$"."$;$/$%[$?]$_$\$,$:$%[$?]",$"&$~,$#,);$,++
    ;$,++;$^|=$";`$_$\$,$/$:$;$~$*$%[$?]$.$~$*${#}$%[$?]$;$\$"$^$~$*.>&$=`
    Eric J. Roode, Dec 24, 2005
    #4
  5. robic0

    robic0 Guest

    On 23 Dec 2005 20:13:08 -0800, wrote:

    >robic0 wrote:
    >> I don't see a solution to this problem that
    >> regular expressions can't exclude a string when
    >> processing. It can exclude individual characters
    >> fine. I started doing Perl 2 years ago and have
    >> run into this nagging problem several times.
    >>
    >> After extensive read on the Perl docs on re's
    >> (especially in the last 2 days) I have come to the
    >> conclusion that regular expressions have a serious
    >> deficiency. This is serious because the not string
    >> is a fundimental basic logic idea in a search from
    >> a touted master search engine or should be.
    >> To a degree it works with a known subset, but it
    >> won't work to the degree shown below. This is a
    >> serious flaw in regualar expressions!
    >>
    >> I hope you masters can prove me wrong! I really do.
    >> If not I would hope that the Perl authors can provide
    >> some insight on when this construct can be fixed,
    >> aka implemented.
    >>
    >> Beat this code if you can (you can't). Don't look
    >> at the code in this example, look instead at the
    >> output.
    >> Don't comment on any code syntax because thats not
    >> welcome or the point.
    >> Instead, refer you comments to the output ID's.
    >>
    >> If you know of a way Perl regex can do this
    >> please reply. I'm almost %99 sure Perl regex
    >> can't do this. In fact the %1 is thrown out here
    >> to either verify that or prove otherwise.
    >>

    >
    >Its not clear what "this" is. Are you asking if perl can do a negative
    >match on a string, pull out XML comments with a regex, or both?
    >
    >If you are wondering about a negative string match, look at the perlre
    >documentation, specifically negative lookahead and lookbehind
    >assertions.
    >
    >If you want to pull out the contents of XML comments you could do this.
    >
    >
    >sub test_xml_comment_parse {
    > my ($xml) = @_;
    > print "XML\n", '-' x 40, "\n", $xml, "\n", '-' x 40, "\n";
    > while ($xml =~ s/<!--(.*?)-->//ms) {
    > print "Comment [$1]\n"
    > }
    > print "\n", '-' x 40, "\n\n\n";
    >}
    >
    >my $gabage1 = '
    ><big name="asdf" date="33" >
    > asdf
    > <!-- howdy folks -->
    > <in2>jjjj</in2>
    > <!-- and still more -->
    > asdfb
    ></big>
    >';
    >
    >my $gabage2 = '
    ><big name="asdf" date="33" >
    > asdf
    > <!-- howdy folks %SYSTEM is down <who cares?> -->
    > <in2>jjjj</in2>
    > <!-- and still more -->
    > asdfb
    ></big>
    >';
    >
    >test_xml_comment_parse($_) foreach ($gabage1,$gabage2);
    >
    >output:
    >
    >XML
    >----------------------------------------
    >
    ><big name="asdf" date="33" >
    > asdf
    > <!-- howdy folks -->
    > <in2>jjjj</in2>
    > <!-- and still more -->
    > asdfb
    ></big>
    >
    >----------------------------------------
    >Comment [ howdy folks ]
    >Comment [ and still more ]
    >
    >----------------------------------------
    >
    >
    >XML
    >----------------------------------------
    >
    ><big name="asdf" date="33" >
    > asdf
    > <!-- howdy folks %SYSTEM is down <who cares?> -->
    > <in2>jjjj</in2>
    > <!-- and still more -->
    > asdfb
    ></big>
    >
    >----------------------------------------
    >Comment [ howdy folks %SYSTEM is down <who cares?> ]
    >Comment [ and still more ]
    >
    >----------------------------------------
    >
    >
    >
    >
    >
    >
    >
    >There is a problem though. If you need to retrieve data from xml
    >documents, you should generally use an XML parser instead of using your
    >own regular expressions.
    >
    >Here is 1 case where the code I posted above would pull out the text
    >"not really a comment", that isn't really a comment.
    >
    ><test_xml>
    > <value>
    > <![CDATA[ <!-- not really a comment --> ]]>
    > </value>
    ></test_xml>



    Thanks alot

    Yes the first occurance (?) does the trick /<!--(.*?)-->/
    And given nesting is not allowed here this will do it.
    This had worked for me before, I should have stuck with it.
    The //m is not really of help here since the xml could
    be without newlines.

    I found xml specs from
    http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
    I will use that to finish this code.

    About the CDATA thing you mentioned. No, thats not really a
    problem. The order of the regex is such that "all" non-markup
    items are processed out first.

    So in this case all CDATA will be removed first followed by
    all comments and any other weird ones like versioning.

    I like the specs, it makes it easy to write the regex.
    quote:

    CDSect ::= CDStart CData CDEnd
    [19] CDStart ::= '<![CDATA['
    [20] CData ::= (Char* - (Char* ']]>' Char*))
    [21] CDEnd ::= ']]>'


    Within a CDATA section, only the CDEnd string is recognized as markup,
    so that left angle brackets and ampersands may occur in their literal
    form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
    CDATA sections cannot nest.

    An example of a CDATA section, in which "<greeting>" and "</greeting>"
    are recognized as character data, not markup:

    <![CDATA[<greeting>Hello, world!</greeting>]]>

    ..
    ..
    ..
    One more thing:
    >If you are wondering about a negative string match, look at the perlre
    >documentation, specifically negative lookahead and lookbehind
    >assertions.


    Yes I looked at it and tried the assertions quite a bit,
    in this context /(.*)(?!string)/s it doesen't seem to work.
    This however /(\w*)(?!string)/ seems to work but only if the
    string has certain characters.
    Don't know why.

    I won't be on for a couple of days while I install a new raid array.
    Anyway thanks for the help.
    robic0, Dec 24, 2005
    #5
  6. robic0

    Matt Garrish Guest

    <robic0> wrote in message news:...
    > On 23 Dec 2005 20:13:08 -0800, wrote:
    >>
    >>Here is 1 case where the code I posted above would pull out the text
    >>"not really a comment", that isn't really a comment.
    >>
    >><test_xml>
    >> <value>
    >> <![CDATA[ <!-- not really a comment --> ]]>
    >> </value>
    >></test_xml>

    >
    > I found xml specs from
    > http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
    > I will use that to finish this code.
    >
    > About the CDATA thing you mentioned. No, thats not really a
    > problem. The order of the regex is such that "all" non-markup
    > items are processed out first.
    >
    > So in this case all CDATA will be removed first followed by
    > all comments and any other weird ones like versioning.
    >


    Please *read* the spec. CDATA blocks have nothing to do with comments;
    they're sections of data where all the characters inside are treated as
    literals (sort of like how single quoting in perl allows you to use $,@ and
    %).

    Matt
    Matt Garrish, Dec 24, 2005
    #6
  7. robic0 <> wrote:

    > Yes the first occurance (?) does the trick /<!--(.*?)-->/


    > The //m is not really of help here



    Right.


    > since the xml could
    > be without newlines.



    But not for that reason.

    The //m is not really of help here because it modifies the meaning
    of ^ and $, but your pattern does not contain either of those.

    The //m modifier is a no-op with the pattern you are using.


    > I found xml specs from
    > http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
    > I will use that to finish this code.



    What a revolutionary idea!

    Sometimes it takes a true visionary to come up with a radically
    beneficial paradigm shift!


    > Yes I looked at it and tried the assertions quite a bit,
    > in this context /(.*)(?!string)/s it doesen't seem to work.



    If you post a short and complete program that we can run that
    duplicates your problem, then we might have a chance at
    solving your problem.

    But since you haven't, all we can do is offer our sympathy.

    Sorry it doesn't seem to work.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Dec 24, 2005
    #7
  8. robic0

    Guest

    robic0 wrote:
    <snip>
    > I found xml specs from
    > http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
    > I will use that to finish this code.
    >
    > About the CDATA thing you mentioned. No, thats not really a
    > problem. The order of the regex is such that "all" non-markup
    > items are processed out first.
    >
    > So in this case all CDATA will be removed first followed by
    > all comments and any other weird ones like versioning.
    >
    > I like the specs, it makes it easy to write the regex.
    > quote:
    >
    > CDSect ::= CDStart CData CDEnd
    > [19] CDStart ::= '<![CDATA['
    > [20] CData ::= (Char* - (Char* ']]>' Char*))
    > [21] CDEnd ::= ']]>'
    >
    >
    > Within a CDATA section, only the CDEnd string is recognized as markup,
    > so that left angle brackets and ampersands may occur in their literal
    > form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
    > CDATA sections cannot nest.
    >
    > An example of a CDATA section, in which "<greeting>" and "</greeting>"
    > are recognized as character data, not markup:
    >
    > <![CDATA[<greeting>Hello, world!</greeting>]]>
    >


    Just out of curiosity, is there a reason you don't want to use an
    existing module for parsing XML, such as Expat, LibXML, etc...?
    , Dec 26, 2005
    #8
  9. robic0

    Guest

    robic0 wrote:
    <snip>

    > Thanks alot
    >
    > Yes the first occurance (?) does the trick /<!--(.*?)-->/
    > And given nesting is not allowed here this will do it.
    > This had worked for me before, I should have stuck with it.
    > The //m is not really of help here since the xml could
    > be without newlines.
    >
    > I found xml specs from
    > http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
    > I will use that to finish this code.
    >
    > About the CDATA thing you mentioned. No, thats not really a
    > problem. The order of the regex is such that "all" non-markup
    > items are processed out first.
    >
    > So in this case all CDATA will be removed first followed by
    > all comments and any other weird ones like versioning.
    >
    > I like the specs, it makes it easy to write the regex.
    > quote:
    >
    > CDSect ::= CDStart CData CDEnd
    > [19] CDStart ::= '<![CDATA['
    > [20] CData ::= (Char* - (Char* ']]>' Char*))
    > [21] CDEnd ::= ']]>'
    >
    >
    > Within a CDATA section, only the CDEnd string is recognized as markup,
    > so that left angle brackets and ampersands may occur in their literal
    > form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
    > CDATA sections cannot nest.
    >
    > An example of a CDATA section, in which "<greeting>" and "</greeting>"
    > are recognized as character data, not markup:
    >
    > <![CDATA[<greeting>Hello, world!</greeting>]]>
    >


    Don't forget about xml processing instructions, you should handle those
    to.

    <test_xml>
    <value>
    <?proc <!-- not really a comment --> ?>
    </value>
    </test_xml>
    , Dec 26, 2005
    #9
  10. robic0

    robic0 Guest

    On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:

    Thanks for the patients folks. Hope you had a happy
    25'th. I started back on this problem a few hours ago.

    Initially, this was a nesting problem I couldn't figure
    out how to solve with regular expressions. I'm doing a xml
    parser using just regex so I want to get this right.
    I have concentrated on the docs on regex for this and
    oh my god its got problems. I would like the writers
    of Perl and Larry Wall to take a look at the code below.
    It encapsulates the logic, however it be lumbersome,
    of what is takes to implement the "not this string"
    in the regular expression machine. Don't ask me to
    explain that phrase. I think this is a pristine solution
    to what I'm doing however. In other words, given the
    XML specifications, this will always work.

    XML in general doesen't allow markup nesting (or from
    what I imagine) because of the obvious,
    "Markup" being the set of characters that act as
    delimeters, both start and end of an expression.

    The only problem is (for regex that is) some constructs
    like "Comments" and "CDATA" conflict in that its
    paradigm can result in a deadlocks.

    Most SAX or stream parsers get away with this because
    they have anchors and process from begin to end.

    I use a substitution method in the parser code I've written
    that nullifies anchors. I've been using this method
    for years on other things. Hey now, doesent that sound like
    something the regular expression authors use?
    Yeah but they fell down on this one.
    Look at what I did here.
    I've assumed cdata nesting and comment nesting is illegal,
    and it is. I've "assumed" an anchor on one, could have
    been either one. The logic uses the limited ability of
    regex to capture (hog) all the data, indeed it depends
    upon it.

    Look at this code very carefully, nesting is not allowed
    and is the "only" reason it works. Of course nesting will
    throw an error in production code. This code will be
    merged with the primary and more XML spec specific changes.

    Why am I doint this? I don't know, I have a couple of weeks
    free I guess...

    Thanks for the comments!

    use strict;
    use warnings;

    $_ = '
    <![CDATA[ <!-- imbed comment --> some text <!-- imbed as well -->]]>

    <!--
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    -->

    <!-- This is a real comment -->

    ';

    #### This section of parser deals with
    #### circular non-markup imbedding issues.
    #### (one inside the other, and so forth)
    #### So far just comments & cdata.
    #### Use the general substitution magic.
    #### This is valid because nesting of
    #### comments nor cdata is allowed.

    my $cnt = 1;
    my %root = ();
    my %cdata_elements = ();

    print "\n";

    # -- Comments (done first) --
    while (s/(<!--(.*?)-->)/[$cnt]/s) {
    $root{$cnt} = $1;
    print "$cnt = Questionable comment: $1\n"; $cnt++;
    }
    print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
    # -- CDATA (done second) --
    while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
    {
    # reconstitute cdata element contents
    my $cdata_contents = $1;
    my $str = '';
    while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
    {
    if (defined $1)
    {
    $str .= $1;
    }
    elsif (defined $2 && exists $root{$2})
    {
    $str .= $root{$2};
    delete $root{$2};
    }
    else {
    my $j = 0; # shouldn't get here
    }
    }
    $root{$cnt} = $str;
    $cdata_elements{$cnt} = '';

    print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
    }
    # -- Process leftover comments that are real --
    while (my ($key,$val) = each (%root)) {
    if (!defined $cdata_elements{$key}) {
    # This $root re-assignment is not really necessary
    # since $1 will contain the processing text that
    # will be processed here, then never used again.
    $root{$key} =~ s/<!--(.*?)-->/$1/s;
    print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
    }
    }
    __END__

    1 = Questionable comment: <!-- imbed comment -->
    2 = Questionable comment: <!-- imbed as well -->
    3 = Questionable comment: <!--
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    -->
    4 = Questionable comment: <!-- This is a real comment -->

    ============================================================

    The "Real" Stuff -->


    5 = REAL CDATA: <!-- imbed comment --> some text <!-- imbed as well
    -->

    4 = REAL COMMENT: This is a real comment

    3 = REAL COMMENT:
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    robic0, Dec 27, 2005
    #10
  11. robic0

    robic0 Guest

    On 26 Dec 2005 12:38:35 -0800, wrote:

    >
    >robic0 wrote:
    ><snip>
    >> I found xml specs from
    >> http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
    >> I will use that to finish this code.
    >>
    >> About the CDATA thing you mentioned. No, thats not really a
    >> problem. The order of the regex is such that "all" non-markup
    >> items are processed out first.
    >>
    >> So in this case all CDATA will be removed first followed by
    >> all comments and any other weird ones like versioning.
    >>
    >> I like the specs, it makes it easy to write the regex.
    >> quote:
    >>
    >> CDSect ::= CDStart CData CDEnd
    >> [19] CDStart ::= '<![CDATA['
    >> [20] CData ::= (Char* - (Char* ']]>' Char*))
    >> [21] CDEnd ::= ']]>'
    >>
    >>
    >> Within a CDATA section, only the CDEnd string is recognized as markup,
    >> so that left angle brackets and ampersands may occur in their literal
    >> form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
    >> CDATA sections cannot nest.
    >>
    >> An example of a CDATA section, in which "<greeting>" and "</greeting>"
    >> are recognized as character data, not markup:
    >>
    >> <![CDATA[<greeting>Hello, world!</greeting>]]>
    >>

    >
    >Just out of curiosity, is there a reason you don't want to use an
    >existing module for parsing XML, such as Expat, LibXML, etc...?


    I'm thinking that this thing I'm doing is going to blow the doors
    off SAX. But, who knows...
    robic0, Dec 27, 2005
    #11
  12. robic0

    robic0 Guest

    On 23 Dec 2005 20:04:33 -0800, "MikeGee" <>
    wrote:

    >robic0 wrote:
    >> while (s/<!--(.*)-->//s) { print "$1\n"; }

    >
    >You post is longer than I can concentrate to read carefully, but in the
    >above line, try:
    >s/<!--(.*?)-->//s
    >and see if there is a difference.
    >
    >I don't think you should make over-arching comments on the dificiencies
    >of a system that works fine for everyone else. I bet most of the
    >really knowledgable folks reading this newsgroup ignore your question
    >just because of the bad attitude.


    I think you should look past your navel in these issues.
    You just look at the tip of the iceberg. Why should these issues
    be of concern to anyone? Its a simple capability that regex
    really badly falls down on. To have a match expression that
    excludes a specific "string", then resets the counter. The
    match won't happen on (.*) but not this "ASDF".
    Do you understand that Mike?
    robic0, Dec 27, 2005
    #12
  13. robic0

    robic0 Guest

    On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:

    I'm back on the job.
    I'm going to post some new code this week that
    complies with XML spec.

    This is the solution for the Comment/CDATA paradigm
    that will be incorporated in the new version:

    use strict;
    use warnings;

    $_ = '
    <![CDATA[ <!-- imbed comment --> some text <!-- imbed as well -->]]>

    <!--
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    -->

    <!-- This is a real comment -->

    ';

    #### This section of parser deals with
    #### circular non-markup imbedding issues.
    #### (one inside the other, and so forth)
    #### So far just comments & cdata.
    #### Use the general substitution magic.
    #### This is valid because nesting of
    #### comments nor cdata is allowed.

    my $cnt = 1;
    my %root = ();
    my %cdata_elements = ();

    print "\n";

    # -- Comments (done first) --
    while (s/(<!--(.*?)-->)/[$cnt]/s) {
    $root{$cnt} = $1;
    print "$cnt = Questionable comment: $1\n"; $cnt++;
    }
    print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
    # -- CDATA (done second) --
    while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
    {
    # reconstitute cdata element contents
    my $cdata_contents = $1;
    my $str = '';
    while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
    {
    if (defined $1)
    {
    $str .= $1;
    }
    elsif (defined $2 && exists $root{$2})
    {
    $str .= $root{$2};
    delete $root{$2};
    }
    else {
    my $j = 0; # shouldn't get here
    }
    }
    $root{$cnt} = $str;
    $cdata_elements{$cnt} = '';

    print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
    }
    # -- Process leftover comments that are real --
    while (my ($key,$val) = each (%root)) {
    if (!defined $cdata_elements{$key}) {
    # This $root re-assignment is not really necessary
    # since $1 will contain the processing text that
    # will be processed here, then never used again.
    $root{$key} =~ s/<!--(.*?)-->/$1/s;
    print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
    }
    }


    __END__

    1 = Questionable comment: <!-- imbed comment -->
    2 = Questionable comment: <!-- imbed as well -->
    3 = Questionable comment: <!--
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    -->
    4 = Questionable comment: <!-- This is a real comment -->


    ============================================================

    The "Real" Stuff -->


    5 = REAL CDATA: <!-- imbed comment --> some text <!-- imbed as well
    -->

    4 = REAL COMMENT: This is a real comment

    3 = REAL COMMENT:
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    robic0, Dec 27, 2005
    #13
  14. robic0

    robic0 Guest

    On Mon, 26 Dec 2005 19:17:04 -0800, robic0 wrote:

    >On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:
    >
    >I'm back on the job.


    disregard, wrong thread...
    robic0, Dec 27, 2005
    #14
  15. robic0 <> wrote:
    > On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:
    >
    > Thanks for the patients folks.



    You crack me up, doctor!


    > I'm doing a xml
    > parser using just regex so I want to get this right.



    That is mathematically impossible you know.

    You will be working on it for a long long time, and never have it right.


    > I have concentrated on the docs on regex for this and
    > oh my god its got problems.



    Yes, parsing a Context Free language using a Regular grammar
    is simply not possible.

    (but Perl's regular expressions aren't actually "regular" at all.)


    > I would like the writers
    > of Perl and Larry Wall to take a look at the code below.



    I would like you to read the Dragon Book.


    > In other words, given the
    > XML specifications, this will always work.



    Yeah, right.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Dec 28, 2005
    #15
  16. robic0

    robic0 Guest

    On Tue, 27 Dec 2005 20:57:09 -0600, Tad McClellan <> wrote:

    >robic0 <> wrote:
    >> On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:
    >>
    >> Thanks for the patients folks.

    >
    >
    >You crack me up, doctor!
    >
    >
    >> I'm doing a xml
    >> parser using just regex so I want to get this right.

    >
    >
    >That is mathematically impossible you know.
    >
    >You will be working on it for a long long time, and never have it right.
    >
    >
    >> I have concentrated on the docs on regex for this and
    >> oh my god its got problems.

    >
    >
    >Yes, parsing a Context Free language using a Regular grammar
    >is simply not possible.
    >
    >(but Perl's regular expressions aren't actually "regular" at all.)
    >
    >
    >> I would like the writers
    >> of Perl and Larry Wall to take a look at the code below.

    >
    >
    >I would like you to read the Dragon Book.
    >
    >
    >> In other words, given the
    >> XML specifications, this will always work.

    >
    >
    >Yeah, right.


    You crack me up dude, it works great. The code is integrated into
    the main work. No problems whatsoever. I'm not going to invest time
    in code that doesen't work. Never have, never will.
    robic0, Dec 29, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,272
  2. PenguinOfDoom

    distutils deficiency

    PenguinOfDoom, Jun 25, 2003, in forum: Python
    Replies:
    0
    Views:
    712
    PenguinOfDoom
    Jun 25, 2003
  3. Gary Feldman

    Deficiency in urllib/socket for https?

    Gary Feldman, Aug 21, 2003, in forum: Python
    Replies:
    4
    Views:
    462
    John J. Lee
    Aug 23, 2003
  4. KMyers1
    Replies:
    2
    Views:
    385
    KMyers1
    Jul 20, 2007
  5. KMyers1
    Replies:
    0
    Views:
    304
    KMyers1
    Jul 20, 2007
Loading...

Share This Page