My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

Discussion in 'Perl Misc' started by robic0, Dec 21, 2005.

  1. robic0

    robic0 Guest

    This post is in response to someone who asked for help trying to
    parse xml into a data structure. The poster couldn't install
    XML::parser or XML::Simple. I replied a few times with some
    partial code. Good to my word, here is the core of a cut & paste
    non-Perl-module based, raw, robust data xml parser into Perl
    data structures. Its about 140 lines of code. I imagine its
    about 3 times faster than the XML parsers out there, didn't time
    it. It doesen't use the overhead of SAX or nodes.

    This installment is released prematurely without the fancy
    XML::Simple options yet. This is a typical "force array"
    version (see the sub's below). I wanted to wait until tommorow
    to post this but, I already know how to do it but don't have the
    time tonight, however this is fairly final, and so I release
    it with the understanding that its shortcomings will be fixed
    in a day or so.

    I've spent 4 days on this. You have to read between the lines
    to insert your xml file open or just cut and paste your xml
    to $gabage1. I've left that part up to you. The output
    and data are legitimate. It won't look like XML:Simple
    in the default settings. I maintaine a root here and some other
    things. However, I will post a mod tommorow. The output and
    parsing is completely legitimate. The parsing is probably
    much faster than the modules on CPAN.

    Let me know if you have any suggestions for improvement.
    I want to keep it under 200 lines for a complete cut & paste
    solution. It doesen't use any parser out there. Its parser
    is built in. I don't think this method is used anywhere
    in the XML world, you may want to check for possible multiple
    speed enhancement.

    Posting changes tommorow on this.
    Contact info:
    email: robic0-AT-yahoo.com

    ========================================================
    use strict;
    use warnings;
    use Data::Dumper;

    open DATA "datafile" or die "can't open datafile...";
    my $gabage1 = <DATA>;
    close DATA;

    my @xml_files = ($gabage1);

    my $debug = 0;
    my $rmv_white_space = 1;

    ## -- XML start & end regexp substitution delimeter chars --
    ## match side , substitution side
    ## -----------------------/-------------------------
    my @S_dlim = ('\[' , '['); # use these for reading (debug)
    my @E_dlim = ('\]' , ']');
    #my @S_dlim = (chr(140) , chr(140)); # use these for production
    #my @E_dlim = (chr(141) , chr(141));


    for (@xml_files)
    {
    if ($rmv_white_space) {
    s/>[\s]+</></g;
    s/[\s]+</</g;
    s/>[\s]+/>/g;
    }
    print "\n",'='x30,"\n$_\n\n" if ($debug);

    my $ROOT = {}; # container
    my ($last_cnt, $cnt, $i) = (-1, 1, 0);

    # should only need 2 iterations max, but wth
    while ($cnt != $last_cnt && $i < 20)
    {
    $last_cnt = $cnt;

    ## <?XML-Version ?> , have to check the format of '<?'
    while (s/<\?([^<>]*)\?>//i) {} # to void xml
    versioning
    # while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = \n" if ($debug); $cnt++}

    ## <!-- Comments -->
    # while (s/<!--([^<>]*)-->//i) {} # to void comments
    while (s/<!--([^<>]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <!-- --> = $1\n" if ($debug);
    $ROOT->{$cnt} = { comment => $1 };
    $cnt++;
    }
    # Comments, need to have "anything but <!-- nor -->
    here" (revisit)
    # while
    (s/<!--([^(<!--)^(-->)]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) { print "$cnt
    <!-- --> = $1\n" if ($debug); $cnt++}

    ## <Tag/> , no content
    while
    (s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = \n" if ($debug);
    $ROOT->{$cnt} = { $1 => '' };
    $cnt++;
    }
    ## <Tag Attributes/> , no content
    while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
    ]*"[^<]*")+[ ]*\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = attr: $2\n" if ($debug);
    $ROOT->{$cnt} = { $1 => getAttrHash($2) };
    $cnt++;
    }
    ## <Tag> Content </Tag>
    while
    (s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = $2\n" if ($debug);
    my $unknown = '';
    if (length($2) > 0) {
    my ($key); my $hcontent =
    getContentHash($2, $ROOT);
    if (keys (%{$hcontent}) > 1) {
    $unknown = $hcontent;
    }
    else { ($key,$unknown) = each
    (%{$hcontent}); }
    }
    $ROOT->{$cnt} = { $1 => $unknown };
    $cnt++;
    }
    ## <Tag Attributes> Content </Tag>
    while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
    ]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = attr: $2, content: $3\n" if
    ($debug);
    my $hattrib = getAttrHash($2);
    my $hcontent = getContentHash($3, $ROOT);

    while (my ($key,$val) = each (%{$hcontent})) {
    $hattrib->{$key} = $val;
    }
    $ROOT->{$cnt} = { $1 => $hattrib };
    $cnt++;
    }
    $i++ if ($last_cnt != $cnt);
    }
    if (/<|>/) {
    print "($i) XML problem, malformed, syntax or tag
    closure:\n$_";
    } else {
    print "$i itterations\n\n";
    #print Dumper($ROOT);
    my $outer_element = $cnt-1;
    if (exists $ROOT->{$outer_element}) {
    my $tmp = {};
    %{$tmp} = %{$ROOT->{$outer_element}};
    print Dumper($tmp);
    }
    }
    }
    ##
    sub getAttrHash
    {
    my $attstr = shift;
    my $ahref = {};
    return $ahref unless (defined $attstr);
    while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
    ]*//i) {
    $ahref->{$1} = $2;
    }
    return $ahref;
    }
    ##
    sub getContentHash
    {
    my ($attstr,$hStore) = @_;
    my $ahref = {};
    return $ahref unless (defined $attstr && defined $hStore);
    my @ary = ();
    while ($attstr =~
    s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
    if (defined $1) {
    push (@ary, $1);
    }
    elsif (defined $2 && exists $hStore->{$2}) {
    my ($key,$val) = each (%{$hStore->{$2}});

    # here, force array is in effect (aka: simple)
    # (this will be modified in a day or so)
    ################
    if (exists $ahref->{$key})
    {
    #print "getChash - $key\n";
    push (@{$ahref->{$key}}, $val);

    } else {
    $ahref->{$key} = [$val];
    # $ahref->{$key} = $val;
    }
    ################
    }
    }
    if (scalar(@ary) == 1) {
    $ahref->{'content'} = $ary[0];
    } elsif (scalar(@ary) > 1) {
    $ahref->{'content'} = [@ary];
    }
    return $ahref;
    }

    __END__

    $VAR1 = {
    'document' => {
    'WMSNameSpaceVersion' => '2.0',
    'comment' => [
    ' Control Protocol ',
    ' Data Protocol ',
    ' Feedback Protocol ',
    ' Network Source '
    ],
    'node' => [
    {
    'opcode' => 'create',
    'comment' => [
    ' Object Store
    '
    ],
    'name' => 'Control Protocol',
    'node' => [
    {
    'opcode' =>
    'create',
    'comment' => [
    '
    RTSP ',
    '
    Sessionless Multicast '
    ],
    'name' =>
    'Object Store',
    'node' => [
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'RTSP',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{308786f0-8b15-11d2-b25f-006097d2e41e}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'RTSP,RTSPA,RTSPT,RTSPU,RTSPM',

    'name' => 'Protocol',

    'type' => 'string'

    }

    ]

    }

    ]
    },
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'Sessionless Multicast',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{f9377800-f38d-11d2-b26c-006097d2e41e}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'MCAST,RTP',

    'name' => 'Protocol',

    'type' => 'string'

    }

    ]

    }

    ]
    }
    ]
    },
    {
    'opcode' =>
    'create',
    'name' =>
    'Shared Properties'
    }
    ]
    },
    {
    'opcode' => 'create',
    'comment' => [
    ' Object Store
    '
    ],
    'name' => 'Data Protocol',
    'node' => [
    {
    'opcode' =>
    'create',
    'comment' => [
    '
    RTP ',
    '
    RTP/ASF ',
    '
    RTP/AVP ',
    '
    RTP/FEC ',
    '
    RTP/WMS-FEC '
    ],
    'name' =>
    'Object Store',
    'node' => [
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'RTP',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{cbfb2e20-ab7b-11d2-b261-006097d2e41e}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'x-asf-pf',

    'name' => 'Format',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => 'RTP/AVP',

    'name' => 'Protocol',

    'type' => 'string'

    }

    ]

    }

    ]
    },
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'RTP/ASF',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{149a44be-dc14-4e94-9cb0-c0268e77df9e}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'x-asfv2-pf,x-asfv2-grp-pf,x-asfv2-frag-pf',

    'name' => 'Format',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => 'RTP/AVP',

    'name' => 'Protocol',

    'type' => 'string'

    }

    ]

    }

    ]
    },
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'RTP/AVP',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{d7335e2e-62eb-4ad0-96cd-b31c9d0f9f85}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'PCMU,L8,L16,MPA,G726-24,G726-40',

    'name' => 'Format',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => 'RTP/AVP',

    'name' => 'Protocol',

    'type' => 'string'

    }

    ]

    }

    ]
    },
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'RTP/FEC',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{02DEFE42-F8FC-11d2-8670-00C04F6890ED}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'parityfec',

    'name' => 'Format',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => 'RTP/AVP',

    'name' => 'Protocol',

    'type' => 'string'

    }

    ]

    }

    ]
    },
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'RTP/WMS-FEC',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{EDAB8E6B-746C-40db-A885-9E4A9EEF27A2}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'wms-fec',

    'name' => 'Format',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => 'RTP/AVP',

    'name' => 'Protocol',

    'type' => 'string'

    }

    ]

    }

    ]
    }
    ]
    },
    {
    'opcode' =>
    'create',
    'name' =>
    'Shared Properties'
    }
    ]
    },
    {
    'opcode' => 'create',
    'comment' => [
    ' Object Store
    '
    ],
    'name' => 'Feedback Protocol',
    'node' => [
    {
    'opcode' =>
    'create',
    'comment' => [
    '
    RTCP '
    ],
    'name' =>
    'Object Store',
    'node' => [
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'RTCP',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{ecfddc81-184e-11d3-ae84-00a0c95ec3f0}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'x-wms-rtx',

    'name' => 'Format',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => 'RTP/AVP',

    'name' => 'Protocol',

    'type' => 'string'

    }

    ]

    }

    ]
    }
    ]
    },
    {
    'opcode' =>
    'create',
    'name' =>
    'Shared Properties'
    }
    ]
    },
    {
    'opcode' => 'create',
    'comment' => [
    ' Object Store
    ',
    ' Shared
    Properties '
    ],
    'name' => 'Network Source',
    'node' => [
    {
    'opcode' =>
    'create',
    'comment' => [
    '
    WMS Http Network Source ',
    '
    WMS Mms Network Source ',
    '
    WMS Msbd Network Source ',
    '
    WMS Network Source '
    ],
    'name' =>
    'Object Store',
    'node' => [
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'WMS Http Network Source',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{566A2EFF-5651-4020-AC1A-EB48E4571EA3}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'HTTP',

    'name' => 'Source Type',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x50',

    'name' => 'DefaultHttpServerPort',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1bb',

    'name' => 'DefaultHttpServerSSLPort',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x8',

    'name' => 'PacketBuffers',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'EnableHTTP1_1',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1e',

    'name' => 'OpenTimeout',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x64',

    'name' => 'SecondSegmentTimeout',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '',

    'name' => 'ControlAdapter',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x55',

    'name' => 'PercentBWUsageForAccelStreaming',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x3',

    'name' => 'Proxy Setting',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '',

    'name' => 'ProxyHostName',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x50',

    'name' => 'ProxyPort',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'ProxyBypassForLocal',

    'type' => 'int32'

    }

    ]

    }

    ]
    },
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'WMS Mms Network Source',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{DCF6C8B2-F6C0-461b-82DA-35945EADF54A}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'MMS,MMST,MMSU',

    'name' => 'Source Type',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x6db',

    'name' => 'DefaultServerPort',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x4',

    'name' => 'MaxReadHeaderRetries',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x8',

    'name' => 'PacketBuffers',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropProb',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropGracePeriod',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'FirstDropGracePeriod',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropBurstDuration',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'PacketPairDropProb',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x2',

    'name' => 'NackAlgorithm',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'NackRateMultiplier',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x5dc',

    'name' => 'NackBurst',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x3e8',

    'name' => 'NackTraceInterval',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'NackRetry',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'IgnoreServerVersion',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'EnableMmsDistribution',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'AssertStrangeErrors',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x5a',

    'name' => 'InactivityTimeout',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x20',

    'name' => 'OpenTimeout',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x55',

    'name' => 'PercentBWUsageForAccelStreaming',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '',

    'name' => 'FunnelAdapter',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '',

    'name' => 'ControlAdapter',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'Proxy Setting',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '',

    'name' => 'ProxyHostName',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x6db',

    'name' => 'ProxyPort',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'ProxyBypassForLocal',

    'type' => 'int32'

    }

    ]

    }

    ]
    },
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'WMS Msbd Network Source',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{FB74F625-7D25-4455-B840-7B870B5B9322}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'ASFM',

    'name' => 'Source Type',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x8',

    'name' => 'PacketBuffers',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropProb',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropGracePeriod',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'FirstDropGracePeriod',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropBurstDuration',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x3a98',

    'name' => 'McastTimeout',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'EnableIGMPv3',

    'type' => 'int32'

    }

    ]

    }

    ]
    },
    {

    'opcode' => 'create',

    'comment' => [

    ' Properties '

    ],

    'name' => 'WMS Network Source',

    'node' => [

    {

    'opcode' => 'create',

    'value' => '{ad763fa6-3b90-41ab-bd44-4f832beee55f}',

    'name' => 'CLSID',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'Enabled',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'name' => 'Properties',

    'node' => [

    {

    'opcode' => 'create',

    'value' => 'RTSP,XSDP,RTP,RTSPA,RTSPT,RTSPU,RTSPM',

    'name' => 'Source Type',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'EnableATM',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'MaximumMTU',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x14',

    'name' => 'FirewallTimeout',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1e',

    'name' => 'OpenTimeout',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'RtxDropProb',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropProb',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropGracePeriod',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'FirstDropGracePeriod',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'DropBurstDuration',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'PacketPairDropProb',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x2',

    'name' => 'NackAlgorithm',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'NackRateMultiplier',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x5dc',

    'name' => 'NackBurst',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x3e8',

    'name' => 'NackTraceInterval',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x1',

    'name' => 'NackRetry',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'BurstProtection',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'EmulateNetworkDisconnect',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'AssertStrangeErrors',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x55',

    'name' => 'PercentBWUsageForAccelStreaming',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'Proxy Setting',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '',

    'name' => 'ProxyHostName',

    'type' => 'string'

    },

    {

    'opcode' => 'create',

    'value' => '0x22a',

    'name' => 'ProxyPort',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x0',

    'name' => 'ProxyBypassForLocal',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x3e8',

    'name' => 'PktGracePeriodAtEOSForBPP',

    'type' => 'int32'

    },

    {

    'opcode' => 'create',

    'value' => '0x9c4',

    'name' => 'PktGracePeriodAtEOSForODP',

    'type' => 'int32'

    }

    ]

    }

    ]
    }
    ]
    },
    {
    'opcode' =>
    'create',
    'name' =>
    'Shared Properties',
    'node' => [
    {

    'opcode' => 'create',

    'name' => 'Local'
    }
    ]
    }
    ]
    }
    ]
    }
    };

    __DATA__

    <document WMSNameSpaceVersion="2.0">

    <node name="Control Protocol" opcode="create" >
    <node name="Object Store" opcode="create" >
    <node name="RTSP" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{308786f0-8b15-11d2-b25f-006097d2e41e}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Protocol" opcode="create" type="string"
    value="RTSP,RTSPA,RTSPT,RTSPU,RTSPM" />
    </node> <!-- Properties -->

    </node> <!-- RTSP -->

    <node name="Sessionless Multicast" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{f9377800-f38d-11d2-b26c-006097d2e41e}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Protocol" opcode="create" type="string"
    value="MCAST,RTP" />
    </node> <!-- Properties -->

    </node> <!-- Sessionless Multicast -->

    </node> <!-- Object Store -->

    <node name="Shared Properties" opcode="create" />
    </node> <!-- Control Protocol -->

    <node name="Data Protocol" opcode="create" >
    <node name="Object Store" opcode="create" >
    <node name="RTP" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{cbfb2e20-ab7b-11d2-b261-006097d2e41e}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Format" opcode="create" type="string"
    value="x-asf-pf" />
    <node name="Protocol" opcode="create" type="string"
    value="RTP/AVP" />
    </node> <!-- Properties -->

    </node> <!-- RTP -->

    <node name="RTP/ASF" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{149a44be-dc14-4e94-9cb0-c0268e77df9e}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Format" opcode="create" type="string"
    value="x-asfv2-pf,x-asfv2-grp-pf,x-asfv2-frag-pf" />
    <node name="Protocol" opcode="create" type="string"
    value="RTP/AVP" />
    </node> <!-- Properties -->

    </node> <!-- RTP/ASF -->

    <node name="RTP/AVP" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{d7335e2e-62eb-4ad0-96cd-b31c9d0f9f85}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Format" opcode="create" type="string"
    value="PCMU,L8,L16,MPA,G726-24,G726-40" />
    <node name="Protocol" opcode="create" type="string"
    value="RTP/AVP" />
    </node> <!-- Properties -->

    </node> <!-- RTP/AVP -->

    <node name="RTP/FEC" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{02DEFE42-F8FC-11d2-8670-00C04F6890ED}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Format" opcode="create" type="string"
    value="parityfec" />
    <node name="Protocol" opcode="create" type="string"
    value="RTP/AVP" />
    </node> <!-- Properties -->

    </node> <!-- RTP/FEC -->

    <node name="RTP/WMS-FEC" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{EDAB8E6B-746C-40db-A885-9E4A9EEF27A2}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Format" opcode="create" type="string"
    value="wms-fec" />
    <node name="Protocol" opcode="create" type="string"
    value="RTP/AVP" />
    </node> <!-- Properties -->

    </node> <!-- RTP/WMS-FEC -->

    </node> <!-- Object Store -->

    <node name="Shared Properties" opcode="create" />
    </node> <!-- Data Protocol -->

    <node name="Feedback Protocol" opcode="create" >
    <node name="Object Store" opcode="create" >
    <node name="RTCP" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{ecfddc81-184e-11d3-ae84-00a0c95ec3f0}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Format" opcode="create" type="string"
    value="x-wms-rtx" />
    <node name="Protocol" opcode="create" type="string"
    value="RTP/AVP" />
    </node> <!-- Properties -->

    </node> <!-- RTCP -->

    </node> <!-- Object Store -->

    <node name="Shared Properties" opcode="create" />
    </node> <!-- Feedback Protocol -->

    <node name="Network Source" opcode="create" >
    <node name="Object Store" opcode="create" >
    <node name="WMS Http Network Source" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{566A2EFF-5651-4020-AC1A-EB48E4571EA3}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Source Type" opcode="create" type="string"
    value="HTTP" />
    <node name="DefaultHttpServerPort" opcode="create"
    type="int32" value="0x50" />
    <node name="DefaultHttpServerSSLPort" opcode="create"
    type="int32" value="0x1bb" />
    <node name="PacketBuffers" opcode="create" type="int32"
    value="0x8" />
    <node name="EnableHTTP1_1" opcode="create" type="int32"
    value="0x1" />
    <node name="OpenTimeout" opcode="create" type="int32"
    value="0x1e" />
    <node name="SecondSegmentTimeout" opcode="create"
    type="int32" value="0x64" />
    <node name="ControlAdapter" opcode="create" type="string"
    value="" />
    <node name="PercentBWUsageForAccelStreaming" opcode="create"
    type="int32" value="0x55" />
    <node name="Proxy Setting" opcode="create" type="int32"
    value="0x3" />
    <node name="ProxyHostName" opcode="create" type="string"
    value="" />
    <node name="ProxyPort" opcode="create" type="int32"
    value="0x50" />
    <node name="ProxyBypassForLocal" opcode="create"
    type="int32" value="0x0" />
    </node> <!-- Properties -->

    </node> <!-- WMS Http Network Source -->

    <node name="WMS Mms Network Source" opcode="create" >
    <node name="CLSID" opcode="create" type="string"
    value="{DCF6C8B2-F6C0-461b-82DA-35945EADF54A}" />
    <node name="Enabled" opcode="create" type="int32" value="0x1"
    />
    <node name="Properties" opcode="create" >
    <node name="Source Type" opcode="create" type="string"
    value="MMS,MMST,MMSU" />
    <node name="DefaultServerPort" opcode="create" type="int32"
    value="0x6db" />
    <node name="MaxReadHeaderRetries" opcode="create"
    type="int32" value="0x4" />
    <node name="PacketBuffers" opcode="create" type="int32"
    value="0x8" />
    <node name="DropProb" opcode="create" type="int32"
    value="0x0" />
    <node name="DropGracePeriod" opcode="create" type="int32"
    value="0x0" />
    <node name="FirstDropGracePeriod" opcode="create"
    type="int32" value="0x0" />
    <node name="DropBurstDuration" opcode="create" type="int32"
    value="0x0" />
    <node name="PacketPairDropProb" opcode="create" type="int32"
    value="0x0" />
    <node name="NackAlgorithm" opcode="create" type="int32"
    value="0x2" />
    <node name="NackRateMultiplier" opcode="create" type="int32"
    value="0x1" />
    <node name="NackBurst" opcode="create" type="int32"
    value="0x5dc" />
    <node name="NackTraceInterval" opcode="create" type="int32"
    value="0x3e8" />
    <node name="NackRetry" opcode="create" type="int32"
    value="0x1" />
    <node name="IgnoreServerVersion" opcode="create"
    type="int32" value="0x0" />
    <node name="EnableMmsDistribution" opcode="create"
    type="int32" value="0x0" />
    <node name="AssertStrangeErrors" opcode="create"
    type="int32" value="0x0" />
    <node name="InactivityTimeout" opcode="create" type="int32"
    value="0x5a" />
    <node name="OpenTimeout" opcode="create" type="int32"
    value="0x20" />
    <node name="PercentBWUsageForAccelStreaming" opcode="create"
    type="int32" value="0x55" />
    <node name="FunnelAdapter" opcode="create" type="string"
    value="" />
    <node name="ControlAdapter" opcode="create" type="string"
    value="" />
    <node name="Proxy Setting" opcode="create" type="int32"
    value="0x0" />
    <node name="ProxyHostName" opcode="create" type="string"
    value="" />
    <node name="ProxyPort" opcode="create" type="int32"
    value="0x6db" />
    <node name="ProxyBypassForLocal" opcode="create"
    type="int32" value="0x0" />
    </node> <!-- Properties -->
    robic0, Dec 21, 2005
    #1
    1. Advertising

  2. robic0 <> wrote:


    > ## -- XML start & end regexp substitution delimeter chars --



    delimeter: noun, scale used to weigh and price cold cuts.
    also the unit of length for salamis. -- Uri Guttman

    (Message-ID: <>)


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Dec 21, 2005
    #2
    1. Advertising

  3. robic0

    mirod Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    robic0 wrote:
    > This post is in response to someone who asked for help trying to
    > parse xml into a data structure. The poster couldn't install
    > XML::parser or XML::Simple. I replied a few times with some
    > partial code. Good to my word, here is the core of a cut & paste
    > non-Perl-module based, raw, robust data xml parser into Perl
    > data structures. Its about 140 lines of code. I imagine its
    > about 3 times faster than the XML parsers out there, didn't time
    > it. It doesen't use the overhead of SAX or nodes.


    This does not seem to be an XML parser. For example a (very!) cursory
    glance seems to indicate that it considers [0-9a-zA-Z]+ to be a NAME
    (tag or attribute name), where the XML spec shows it is a tad more
    complex (see http://www.xml.com/axml/target.html#NT-Name).

    Writing a complete XML parser is fairly hard, indeed a lot harder than
    writing a quasi-XML parser, like what you wrote.

    You could have refered the OP to SOAP::Lite
    (http://search.cpan.org/dist/SOAP-Lite/), which includes a pure-perl
    XML::parser replacement (with some explicit limitations).

    As it is I think your code is a bit dangerous, as it risks being
    re-used by people who will not understand its limitations

    --
    mirod
    mirod, Dec 21, 2005
    #3
  4. robic0

    Matt Garrish Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    "mirod" <> wrote in message
    news:...
    > robic0 wrote:
    >> This post is in response to someone who asked for help trying to
    >> parse xml into a data structure. The poster couldn't install
    >> XML::parser or XML::Simple. I replied a few times with some
    >> partial code. Good to my word, here is the core of a cut & paste
    >> non-Perl-module based, raw, robust data xml parser into Perl
    >> data structures. Its about 140 lines of code. I imagine its
    >> about 3 times faster than the XML parsers out there, didn't time
    >> it. It doesen't use the overhead of SAX or nodes.

    >
    > This does not seem to be an XML parser. For example a (very!) cursory
    > glance seems to indicate that it considers [0-9a-zA-Z]+ to be a NAME
    > (tag or attribute name), where the XML spec shows it is a tad more
    > complex (see http://www.xml.com/axml/target.html#NT-Name).
    >
    > Writing a complete XML parser is fairly hard, indeed a lot harder than
    > writing a quasi-XML parser, like what you wrote.
    >


    It's always good to point out garbage when one sees it, but it's well known
    (proven through numerous posts) that rob knows nothing about xml or markup
    languages in general. He's probably just looking for an excuse to swear and
    call himself a code god (or whatever he's into these days), so don't be
    surprised if that's what you get (i.e., don't bother responding).

    Matt
    Matt Garrish, Dec 21, 2005
    #4
  5. robic0

    robic0 Guest

    On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

    >Posting changes tommorow on this.
    >Contact info:
    >email: robic0-AT-yahoo.com
    >


    Alot of bug fixes and modifications.
    The first version had many problems.
    This is clean version (.9) with options:
    ForceArray
    Keeproot.
    Keepcomments

    This works exceptionally well... Let me know
    if you try it.
    I'm so burned out on this there probably won't
    be any updates for along time unless otherwise
    if'n I change my mind.

    See ya

    print <<EOM;
    # XML Regex Parser
    # Version .9
    # 12/21/05
    # Copyright 2005,
    # by robic0-At-yahoo.com
    # -----------------------
    EOM

    use strict;
    use warnings;
    use Data::Dumper;

    #open DATA, "datafile" or die "can't open datafile...";
    #my $gabage1 = <DATA>;
    #close DATA;


    my $gabage2 = '

    <XMLDATA>
    <Submission SubmissionID="688904">
    <Category CategoryName="Storage/Adapter or Controller">
    <Driver FolderName="driver000">
    <Language LanguageName="English">
    <PackageCreationLocation
    FolderName="G:\truyen\WHQL\Athena\raid\driver" />
    </Language>
    </Driver>
    </Category>
    </Submission>
    </XMLDATA>
    ';

    my $gabage3 = '

    <big name="asdf" date="33" >
    asdf
    <in1>
    <!-- howdy folks -->
    <in2>jjjj</in2>
    <small biz="wefwf" ueue = "second" />
    <in3>asbefas</in3>
    </in1>
    asdfb
    </big>

    ';

    my @xml_strings = ($gabage2, $gabage3);

    my $VERSION = .9;
    my $debug = 0;
    my $rmv_white_space = 1;
    my $ForceArray = 0;
    my $KeepRoot = 0;
    my $KeepComments = 1;

    ## -- XML, start & end regexp substitution delimiter chars --
    ## match side , substitution side
    ## -----------------------/-------------------------
    my @S_dlim = ('\[' , '['); # use these for reading (debug)
    my @E_dlim = ('\]' , ']');
    #my @S_dlim = (chr(140) , chr(140)); # use these for production
    #my @E_dlim = (chr(141) , chr(141));


    for (@xml_strings)
    {
    print "\n",'='x30,"\n$_\n\n";

    if ($rmv_white_space) {
    s/>[\s]+</></g;
    s/[\s]+</</g;
    s/>[\s]+/>/g;
    }
    my $ROOT = {}; # container
    my ($last_cnt, $cnt, $i) = (-1, 1, 0);

    # should only need 2 iterations max, but wth
    while ($cnt != $last_cnt && $i < 20)
    {
    $last_cnt = $cnt;

    ## <?XML-Version ?> , have to check the format of '<?'
    while (s/<\?([^<>]*)\?>//i) {} # to void xml
    versioning
    # while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = \n" if ($debug); $cnt++}

    ## <!-- Comments -->
    if (!$KeepComments) {
    while (s/<!--([^<>]*)-->//i) {} # to void
    comments
    } else {
    while
    (s/<!--([^<>]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <!-- --> = $1\n" if
    ($debug);
    $ROOT->{$cnt} = { comment => $1 };
    $cnt++;
    }
    # Comments, need to have "anything but <!--
    nor --> here" (revisit)
    # while
    (s/<!--([^(<!--)^(-->)]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) { print "$cnt
    <!-- --> = $1\n" if ($debug); $cnt++}
    }
    ## <Tag/> , no content
    while
    (s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = \n" if ($debug);
    $ROOT->{$cnt} = { $1 => '' };
    $cnt++;
    }
    ## <Tag Attributes/> , no content
    while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
    ]*"[^<]*")+[ ]*\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = attr: $2\n" if ($debug);
    $ROOT->{$cnt} = { $1 => getAttrHash($2) };
    $cnt++;
    }
    ## <Tag> Content </Tag>
    while
    (s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = $2\n" if ($debug);
    my $unknown = '';
    if (length($2) > 0) {
    my $hcontent = getContentHash($2,
    $ROOT);
    $unknown = $hcontent;
    if (keys (%{$hcontent}) > 1) {
    if (!$ForceArray) {
    adjustForSingleItemArrays ($hcontent); }
    } elsif (exists $hcontent->{'content'}
    && scalar(@{$hcontent->{'content'}}) == 1) {

    if ($ForceArray ) {
    $unknown =
    $hcontent->{'content'};
    } else {
    $unknown =
    ${$hcontent->{'content'}}[0];
    }
    }
    }
    $ROOT->{$cnt} = { $1 => $unknown };
    $cnt++;
    }
    ## <Tag Attributes> Content </Tag>
    while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
    ]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = attr: $2, content: $3\n" if
    ($debug);
    my $hattrib = getAttrHash($2);
    if (length($3) > 0) {
    my $hcontent = getContentHash($3,
    $ROOT);
    if (keys (%{$hcontent}) > 1) {
    if (!$ForceArray) {
    adjustForSingleItemArrays ($hcontent); }
    }
    while (my ($key,$val) = each
    (%{$hcontent})) {
    $hattrib->{$key} = $val;
    }
    }
    $ROOT->{$cnt} = { $1 => $hattrib };
    $cnt++;
    }
    if ($last_cnt != $cnt) {
    $i++ ; print "** End pass $i\n" if ($debug);
    }
    }
    if (/<|>/) {
    print "($i) XML problem: malformed, syntax or tag
    closure:\n$_";
    } else {
    print "\n** Itterations = $i\n** ForceArray =
    $ForceArray\n** KeepRoot = $KeepRoot\n** KeepComments =
    $KeepComments\n\n";
    #print Dumper($ROOT);
    my $outer_element = $cnt-1;
    if (exists $ROOT->{$outer_element}) {
    my $htodump = $ROOT->{$outer_element};
    if (!$KeepRoot && keys (%{$htodump}) == 1) {
    my ($key,$val) = each (%{$htodump});
    $htodump = $val;
    }
    my $tmp = {};
    %{$tmp} = %{$htodump};
    print Dumper($tmp);
    } else {print "nothing to output!\n";}
    }
    }
    ##
    sub adjustForSingleItemArrays
    {
    my $href = shift;
    ## if $val is an array ref and has one element
    ## set $href->{$key} equal to the element
    while (my ($key,$val) = each (%{$href})) {
    if (ref($val) eq "ARRAY") {
    if (scalar(@{$val}) == 1) {
    $href->{$key} = $val->[0];
    }
    }
    }
    }
    ##
    sub getAttrHash
    {
    my $attstr = shift;
    my $ahref = {};
    return $ahref unless (defined $attstr);
    while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
    ]*//i) {
    $ahref->{$1} = $2;
    }
    return $ahref;
    }
    ##
    sub getContentHash
    {
    my ($attstr,$hStore) = @_;
    my $ahref = {};
    return $ahref unless (defined $attstr && defined $hStore);
    my @ary = ();
    while ($attstr =~
    s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
    if (defined $1) {
    push (@ary, $1);
    }
    elsif (defined $2 && exists $hStore->{$2}) {
    my ($key,$val) = each (%{$hStore->{$2}});
    if (exists $ahref->{$key}) {
    push (@{$ahref->{$key}}, $val);
    } else {
    $ahref->{$key} = [$val];
    }
    }
    }
    if (scalar(@ary) > 0) { $ahref->{'content'} = [@ary]; }
    ## if $val is an array ref and has one element and it
    ## is a hash ref, set {$key} equal to hash ref
    if (!$ForceArray) {
    while (my ($key,$val) = each (%{$ahref})) {
    if (ref($val) eq "ARRAY") {
    if (scalar(@{$val}) == 1 &&
    ref($val->[0]) eq "HASH") {
    $ahref->{$key} = $val->[0];
    }
    }
    }
    }
    return $ahref;
    }

    __END__


    # XML Regex Parser
    # Version .9
    # 12/21/05
    # Copyright 2005,
    # by robic0-At-yahoo.com
    # -----------------------

    ==============================


    <XMLDATA>
    <Submission SubmissionID="688904">
    <Category CategoryName="Storage/Adapter or Controller">
    <Driver FolderName="driver000">
    <Language LanguageName="English">
    <PackageCreationLocation
    FolderName="G:\truyen\WHQL\Athena\raid\driver" />
    </Language>
    </Driver>
    </Category>
    </Submission>
    </XMLDATA>



    ** Itterations = 2
    ** ForceArray = 0
    ** KeepRoot = 0
    ** KeepComments = 1

    $VAR1 = {
    'Submission' => {
    'SubmissionID' => '688904',
    'Category' => {
    'Driver' => {
    'Language'
    => {

    'LanguageName' => 'English',

    'PackageCreationLocation' => {

    'FolderName' => 'G:\\truyen\\WHQL\\Athena\\raid\\driver'

    }

    },
    'FolderName'
    => 'driver000'
    },
    'CategoryName' =>
    'Storage/Adapter or Controller'
    }
    }
    };

    ==============================


    <big name="asdf" date="33" >
    asdf
    <in1>
    <!-- howdy folks -->
    <in2>jjjj</in2>
    <small biz="wefwf" ueue = "second" />
    <in3>asbefas</in3>
    </in1>
    asdfb
    </big>




    ** Itterations = 1
    ** ForceArray = 0
    ** KeepRoot = 0
    ** KeepComments = 1

    $VAR1 = {
    'date' => '33',
    'name' => 'asdf',
    'content' => [
    'asdf',
    'asdfb'
    ],
    'in1' => {
    'small' => {
    'ueue' => 'second',
    'biz' => 'wefwf'
    },
    'in2' => 'jjjj',
    'comment' => ' howdy folks ',
    'in3' => 'asbefas'
    }
    };
    robic0, Dec 22, 2005
    #5
  6. robic0

    robic0 Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    On 21 Dec 2005 13:01:21 -0800, "mirod" <> wrote:

    >robic0 wrote:
    >> This post is in response to someone who asked for help trying to
    >> parse xml into a data structure. The poster couldn't install
    >> XML::parser or XML::Simple. I replied a few times with some
    >> partial code. Good to my word, here is the core of a cut & paste
    >> non-Perl-module based, raw, robust data xml parser into Perl
    >> data structures. Its about 140 lines of code. I imagine its
    >> about 3 times faster than the XML parsers out there, didn't time
    >> it. It doesen't use the overhead of SAX or nodes.

    >
    >This does not seem to be an XML parser. For example a (very!) cursory
    >glance seems to indicate that it considers [0-9a-zA-Z]+ to be a NAME
    >(tag or attribute name), where the XML spec shows it is a tad more
    >complex (see http://www.xml.com/axml/target.html#NT-Name).
    >
    >Writing a complete XML parser is fairly hard, indeed a lot harder than
    >writing a quasi-XML parser, like what you wrote.
    >
    >You could have refered the OP to SOAP::Lite
    >(http://search.cpan.org/dist/SOAP-Lite/), which includes a pure-perl
    >XML::parser replacement (with some explicit limitations).
    >
    >As it is I think your code is a bit dangerous, as it risks being
    >re-used by people who will not understand its limitations


    Hey, I don't know how but you started a new "Re:" thread.
    I just posted up on the original thread midly reworked code.
    If you would like to try it out feel free.

    This is indeed xml parser framework logic. There is nothing left now
    but incidentals to bring it up to the XML spec like tag naming,
    special character escape sequences ("&amp",...). Its not made the
    same as XML::parser or SAX. This is something entirely different.
    The thrust was to parse the xml into a valid data structure.

    The direction this could take is anybodys guess but I have alot
    of imagination. I don't think writing a complete xml parser is
    fairly hard. I wrote this framework in 4 days and I've used xml
    parsers before. The parsing is done purely with regexp however
    pulling out the data is real-time as the substitution progresses.
    As the substitution moves forward, the xml string shrinks so the
    subsequent regex searches get exponentially short resulting in
    an extremely efficient and fast parse.

    I welcome you to try it out. Perhaps do some time comparisons
    with any other parser out there. I may do some more on it
    in the next few days.

    Post to the thread I'm posting the code to so I can get your
    feedback. That is where I will post the next version.

    And pay no attention to Matt Garish or Tad McClelan... my
    underlings!

    robic0
    --------------------------------
    "AMERICAN" and proud of it!
    robic0, Dec 22, 2005
    #6
  7. robic0

    robic0 Guest

    On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

    >This post is in response to someone who asked for help trying to
    >parse xml into a data structure.


    This will fix the final issues with "ForceArray".
    Comments have an issue with enclosed "<" or ">" in this
    version, other than that they will process normally.
    Its a regex issue (shortcoming in my opinion) that can't
    match a "not" string. Where I need <!--(all but "<!--")-->.
    Where (.*)(?!<!--) won't work in an expression. But I'll
    work around that.

    This is version .901 from 12-22-05 is the one you want.
    This is close to the last post as far as this newsgroup.
    Sorry, but I had to get it stable. I've run this on every
    big and wierd xml file I could get my hands on. I'm
    satisfied with it.

    See ya...


    print <<EOM;

    # XML Regex Parser
    # Version .901 - 12/22/05
    # Copyright 2005,
    # by robic0-At-yahoo.com
    # -----------------------
    EOM

    use strict;
    use warnings;
    use Data::Dumper;

    #open DATA, "sumfile.xml" or die "can't open datafile...";
    #my $gabage1 = join ('', <DATA>);
    #close DATA;


    my $gabage3 = '

    <big name="asdf" date="33" >
    asdf
    <in1>
    <!-- howdy f*%$olks -->
    <in2>jjjj</in2>
    <small biz="wefwf" ueue = "second" />
    <!-- and still more -->
    <bar><inside>asgfasdf<insF>2</insF>sdfb</inside></bar>
    </in1>
    <in2>some in3 content</in2>
    asdfb
    </big>

    ';

    my @xml_strings = ($gabage3);

    my $VERSION = .901;
    my $debug = 1;
    my $rmv_white_space = 1;
    my $ForceArray = 0;
    my $KeepRoot = 0;
    my $KeepComments = 0;

    ## -- XML, start & end regexp substitution delimiter chars --
    ## match side , substitution side
    ## ----------------------/-------------------------------
    my @S_dlim = ('\[' , '['); # use these for debug
    my @E_dlim = ('\]' , ']');
    #my @S_dlim = (chr(140) , chr(140)); # use these for production
    #my @E_dlim = (chr(141) , chr(141));


    ## -- Process xml data --
    ##
    for (@xml_strings)
    {
    print "\n",'='x30,"\n$_\n\n";

    if ($rmv_white_space) {
    s/>[\s]+</></g;
    s/[\s]+</</g;
    s/>[\s]+/>/g;
    }
    my $ROOT = {}; # container
    my ($last_cnt, $cnt, $i) = (-1, 1, 0);

    # should only need 2 iterations max, but wth
    while ($cnt != $last_cnt && $i < 20)
    {
    $last_cnt = $cnt;

    ## <?XML-Version ?> , have to check the format of '<?'
    while (s/<\?([^<>]*)\?>//i) {} # to void xml
    versioning
    # while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i)
    { print "$cnt <$1> = \n" if ($debug); $cnt++}

    ## <!-- Comments -->, nesting not processed,
    ## also comments can't have "<" or ">" this version.
    if (!$KeepComments) {
    while (s/<!--[^<>]*-->//s) {} # to void
    comments
    } else {
    while
    (s/<!--([^<>]*)-->/$S_dlim[1]$cnt$E_dlim[1]/s) {
    # while
    (s/<!--([\w\s]*)(?!<!--)-->/$S_dlim[1]$cnt$E_dlim[1]/s) {
    print "$cnt <!-- --> = $1\n" if
    ($debug);
    $ROOT->{$cnt} = { comment => $1 };
    $cnt++;
    }
    }
    ## <Tag/> , no content
    while
    (s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = \n" if ($debug);
    $ROOT->{$cnt} = { $1 => '' };
    $cnt++;
    }
    ## <Tag Attributes/> , no content
    while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
    ]*"[^<]*")+[ ]*\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = attr: $2\n" if ($debug);
    $ROOT->{$cnt} = { $1 => getAttrHash($2) };
    $cnt++;
    }
    ## <Tag> Content </Tag>
    while
    (s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = $2\n" if ($debug);
    my $unknown = '';
    if (length($2) > 0) {
    my $hcontent = getContentHash($2,
    $ROOT);
    $unknown = $hcontent;
    if (keys (%{$hcontent}) > 1) {
    if (!$ForceArray) {
    adjustForSingleItemArrays ($hcontent); }
    } else {
    if (exists
    $hcontent->{'content'} && scalar(@{$hcontent->{'content'}}) == 1) {
    if (!$ForceArray ) {
    $unknown =
    ${$hcontent->{'content'}}[0];
    } else {$unknown =
    $hcontent->{'content'}; }
    }
    if (!$ForceArray) {
    adjustForSingleItemArrays ($hcontent); }
    }
    }
    $ROOT->{$cnt} = { $1 => $unknown };
    $cnt++;
    }
    ## <Tag Attributes> Content </Tag>
    while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
    ]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
    print "$cnt <$1> = attr: $2, content: $3\n" if
    ($debug);
    my $hattrib = getAttrHash($2);
    if (length($3) > 0) {
    my $hcontent = getContentHash($3,
    $ROOT);
    if (!$ForceArray) {
    adjustForSingleItemArrays ($hcontent); }
    while (my ($key,$val) = each
    (%{$hcontent})) {
    $hattrib->{$key} = $val;
    }
    }
    $ROOT->{$cnt} = { $1 => $hattrib };
    $cnt++;
    }
    if ($last_cnt != $cnt) {
    $i++ ; print "** End pass $i\n" if ($debug);
    }
    }
    if (/<|>/) {
    print "($i) XML problem: malformed, syntax or tag
    closure:\n$_";
    } else {
    print "\n** Itterations = $i\n** ForceArray =
    $ForceArray\n** KeepRoot = $KeepRoot\n** KeepComments =
    $KeepComments\n\n";
    #print Dumper($ROOT);
    my $outer_element = $cnt-1;
    if (exists $ROOT->{$outer_element}) {
    my $htodump = $ROOT->{$outer_element};
    if (!$KeepRoot && keys (%{$htodump}) == 1) {
    my ($key,$val) = each (%{$htodump});
    $htodump = $val;
    }
    my $tmp = {};
    %{$tmp} = %{$htodump};
    print Dumper($tmp);
    } else {print "nothing to output!\n";}
    }
    }
    ##
    sub adjustForSingleItemArrays
    {
    my $href = shift;
    ## if $val is an array ref and has one element
    ## set $href->{$key} equal to the element
    while (my ($key,$val) = each (%{$href})) {
    if (ref($val) eq "ARRAY") {
    if (scalar(@{$val}) == 1) {
    $href->{$key} = $val->[0];
    }
    }
    }
    }
    ##
    sub getAttrHash
    {
    my $attstr = shift;
    my $ahref = {};
    return $ahref unless (defined $attstr);
    while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
    ]*//i) {
    $ahref->{$1} = $2;
    }
    return $ahref;
    }
    ##
    sub getContentHash
    {
    my ($attstr,$hStore) = @_;
    my $ahref = {};
    return $ahref unless (defined $attstr && defined $hStore);
    my @ary = ();
    while ($attstr =~
    s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
    if (defined $1) {
    push (@ary, $1);
    }
    elsif (defined $2 && exists $hStore->{$2}) {
    my ($key,$val) = each (%{$hStore->{$2}});
    if (exists $ahref->{$key}) {
    push (@{$ahref->{$key}}, $val);
    } else {
    $ahref->{$key} = [$val];
    }
    }
    }
    if (scalar(@ary) > 0) { $ahref->{'content'} = [@ary]; }
    ## if $val is an array ref and has one element and it
    ## is a hash ref, set {$key} equal to hash ref
    if (!$ForceArray) {
    while (my ($key,$val) = each (%{$ahref})) {
    if (ref($val) eq "ARRAY") {
    if (scalar(@{$val}) == 1 &&
    ref($val->[0]) eq "HASH") {
    $ahref->{$key} = $val->[0];
    }
    }
    }
    }
    return $ahref;
    }

    __END__


    # XML Regex Parser
    # Version .901 - 12/22/05
    # Copyright 2005,
    # by robic0-At-yahoo.com
    # -----------------------

    ==============================


    <big name="asdf" date="33" >
    asdf
    <in1>
    <!-- howdy f*%$olks -->
    <in2>jjjj</in2>
    <small biz="wefwf" ueue = "second" />
    <!-- and still more -->
    <bar><inside>asgfasdf<insF>2</insF>sdfb</inside></bar>
    </in1>
    <in2>some in3 content</in2>
    asdfb
    </big>



    1 <small> = attr: biz="wefwf" ueue = "second"
    2 <in2> = jjjj
    3 <insF> = 2
    4 <inside> = asgfasdf[3]sdfb
    5 <bar> = [4]
    6 <in1> = [2][1][5]
    7 <in2> = some in3 content
    8 <big> = attr: name="asdf" date="33", content: asdf[6][7]asdfb
    ** End pass 1

    ** Itterations = 1
    ** ForceArray = 0
    ** KeepRoot = 0
    ** KeepComments = 0

    $VAR1 = {
    'in2' => 'some in3 content',
    'date' => '33',
    'name' => 'asdf',
    'content' => [
    'asdf',
    'asdfb'
    ],
    'in1' => {
    'small' => {
    'ueue' => 'second',
    'biz' => 'wefwf'
    },
    'bar' => {
    'inside' => {
    'insF' => '2',
    'content' => [

    'asgfasdf',
    'sdfb'
    ]
    }
    },
    'in2' => 'jjjj'
    }
    };
    robic0, Dec 23, 2005
    #7
  8. robic0 <> wrote:

    > Comments have an issue with enclosed "<" or ">" in this
    > version, other than that they will process normally.
    > Its a regex issue (shortcoming in my opinion)



    Then you do not understand the mathematics underpinning
    regular expressions (ie. set theory).


    > that can't
    > match a "not" string. Where I need <!--(all but "<!--")-->.



    If you are processing XML, then you do not need that, as
    Comment Declarations cannot be nested.


    > This is version .901 from 12-22-05 is the one you want.



    No sensible person will want XML processing code written by
    someone who has demonstrated repeatedly that they do not
    understand the data that is being processed.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Dec 23, 2005
    #8
  9. Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    robic0 wrote:

    > On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:
    >
    > >This post is in response to someone who asked for help trying to
    > >parse xml into a data structure.

    >
    > This will fix the final issues with "ForceArray".
    > Comments have an issue with enclosed "<" or ">" in this
    > version, other than that they will process normally.
    > Its a regex issue (shortcoming in my opinion) that can't
    > match a "not" string. Where I need <!--(all but "<!--")-->.
    > Where (.*)(?!<!--) won't work in an expression. But I'll
    > work around that.
    >
    > This is version .901 from 12-22-05 is the one you want.
    > This is close to the last post as far as this newsgroup.
    > Sorry, but I had to get it stable. I've run this on every
    > big and wierd xml file I could get my hands on. I'm
    > satisfied with it.


    [ code snipped ]

    It's very hard to run your code. You are messing up the line ends in
    your post. I 've uploaded a corrected version to
    www.dotinternet.be/temp/code.txt.

    Your software produces errors when using namespaces:

    <?xml version="1.0" encoding="UTF-8"?>
    <root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
    <mytag>content</mytag>
    <html:br/>
    </root>

    Your software produces errors when using a DOCTYPE:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <root>
    <mytag>content</mytag>
    </root>

    Your software produces errors when argument values are enclosed by `` '
    ´´ instead of `` " ´´:

    <?xml version='1.0' encoding='UTF-8'?>
    <root>
    <mytag myargument='argvalue'>content</mytag>
    </root>

    XML is case sensitive; your program doesn't seem to bother:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
    <mYTag myargument="argvalue">content</mytag>
    </root>

    I'm using Microsoft XP's XML parser to check the XML well-formedness.

    Your program has many shortcomings.

    --
    Bart
    Bart Van der Donck, Dec 23, 2005
    #9
  10. robic0

    robic0 Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    On 23 Dec 2005 15:31:03 -0800, "Bart Van der Donck" <>
    wrote:
    >
    >It's very hard to run your code. You are messing up the line ends in
    >your post. I 've uploaded a corrected version to
    >www.dotinternet.be/temp/code.txt.
    >

    Please don't correct and post code I've written on this.
    I'm taking it to a higher level every day. My thoughts on
    this won't take it where you want to go. Its my idea
    and I'll do just about anything I want with it! The code
    strain emminates from my creativity, I gave it birth and
    I will progress it. Email me, or post code on specific xml
    that doesen't work. Either you get a exception bail out
    or you get my general error. Not all xml constucts are
    implemented. !DOCTYPE not done yet. Its an infant now,
    just the basics. Trust me, I'm gonna do it all.

    If you got a host for me that would be great!
    I'm going to expand this to every xml construct out there.
    robic0, Dec 23, 2005
    #10
  11. robic0

    robic0 Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    On 23 Dec 2005 15:31:03 -0800, "Bart Van der Donck" <>
    wrote:

    >robic0 wrote:
    >
    >> On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:
    >>
    >> >This post is in response to someone who asked for help trying to
    >> >parse xml into a data structure.

    [snip]
    >It's very hard to run your code. You are messing up the line ends in
    >your post.

    I'm not messing up "line ends"..
    > I 've uploaded a corrected version to
    >www.dotinternet.be/temp/code.txt.

    You didn't write the code, you can't correct it..
    >
    >Your software produces errors when using namespaces:
    >
    > <?xml version="1.0" encoding="UTF-8"?>
    > <root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
    > <mytag>content</mytag>
    > <html:br/>
    > </root>
    >

    Uh, namespaces? wha where?
    >Your software produces errors when using a DOCTYPE:
    >
    > <?xml version="1.0" encoding="UTF-8"?>
    > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    > <root>
    > <mytag>content</mytag>
    > </root>
    >

    "<!DOCTYPE..." is not implemented, don't use that xml

    >Your software produces errors when argument values are enclosed by `` '
    >´´ instead of `` " ´´:
    >
    > <?xml version='1.0' encoding='UTF-8'?>
    > <root>
    > <mytag myargument='argvalue'>content</mytag>
    > </root>
    >

    Ok, I'l give you that, if '|" is ok for attribute's then I'll put it
    in
    >XML is case sensitive; your program doesn't seem to bother:

    Thought that was the case, I turned off case sensitivity, I'll
    put it back on
    >
    > <?xml version="1.0" encoding="UTF-8"?>
    > <root>
    > <mYTag myargument="argvalue">content</mytag>
    > </root>
    >
    >I'm using Microsoft XP's XML parser to check the XML well-formedness.
    >
    >Your program has many shortcomings.
    >

    My program has a solid framework I wrote in 4 days. I've run it on
    every single MShit OS xml on my machine. It works perfect ...

    Don't know what you want. Either you want what I wrote your you just
    want to bust balls of a software designer. Can't figure out which you
    want. One more comment like the one above and I won't post a personal
    reply like this one!
    if ever you should
    robic0, Dec 24, 2005
    #11
  12. robic0

    Matt Garrish Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    <robic0> wrote in message news:eek:...
    > On 23 Dec 2005 15:31:03 -0800, "Bart Van der Donck" <>
    > wrote:
    >>
    >>It's very hard to run your code. You are messing up the line ends in
    >>your post. I 've uploaded a corrected version to
    >>www.dotinternet.be/temp/code.txt.
    >>

    > Please don't correct and post code I've written on this.
    > I'm taking it to a higher level every day. My thoughts on
    > this won't take it where you want to go. Its my idea
    > and I'll do just about anything I want with it! The code
    > strain emminates from my creativity, I gave it birth and
    > I will progress it. Email me, or post code on specific xml
    > that doesen't work.


    I don't think anyone wants your garbage.

    Now how about the part where you start dealing with the fact that xml is not
    constrained to single lines. Your little toy has a lot of trouble with:

    <!-- comment out this section
    <oldroot>
    <oldstuff>oops!</oldstuff>
    </oldroot>
    -->

    and also:

    <myplace
    city="here"
    province="there"/>

    Maybe you should learn XML *before* trying to write this parser of yours.

    Matt
    Matt Garrish, Dec 24, 2005
    #12
  13. robic0

    robic0 Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    On Fri, 23 Dec 2005 16:19:25 -0800, robic0 wrote:

    >On 23 Dec 2005 15:31:03 -0800, "Bart Van der Donck" <>
    >wrote:
    >
    >>robic0 wrote:
    >>
    >>> On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:
    >>>
    >>> >This post is in response to someone who asked for help trying to
    >>> >parse xml into a data structure.

    >[snip]
    >>It's very hard to run your code. You are messing up the line ends in
    >>your post.

    >I'm not messing up "line ends"..
    >> I 've uploaded a corrected version to
    >>www.dotinternet.be/temp/code.txt.

    >You didn't write the code, you can't correct it..
    >>
    >>Your software produces errors when using namespaces:
    >>
    >> <?xml version="1.0" encoding="UTF-8"?>
    >> <root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
    >> <mytag>content</mytag>
    >> <html:br/>
    >> </root>
    >>

    >Uh, namespaces? wha where?

    <html:br/>
    ^
    Only \w are allowed in tag names now.
    This character can be allowed.
    I won't do it until the ramifications of a ":" are clear.
    Send me the spec on tags, delimeters that runnon without space
    within tags.
    I'll see what I can do.
    robic0, Dec 24, 2005
    #13
  14. robic0

    robic0 Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    On Fri, 23 Dec 2005 19:29:45 -0500, "Matt Garrish"
    <> wrote:
    >Now how about the part where you start dealing with the fact that xml is not
    >constrained to single lines. Your little toy has a lot of trouble with:
    >

    Huh, constrained to single lines?
    Wha, where?

    ><!-- comment out this section
    ><oldroot>
    > <oldstuff>oops!</oldstuff>
    ></oldroot>
    >-->
    >

    Comments are a problem for now. I have a workaround
    for the near future. I've posted a general complaint
    about this Regex problem to the general forum.

    >and also:
    >
    ><myplace
    > city="here"
    > province="there"/>
    >

    "white space" is not considered as a seperator yet, only " ". If its
    xml complieant I will enact it.

    >Maybe you should learn XML *before* trying to write this parser of yours.


    Maybe you should not get or use any my software. If I find out you did
    I will sue you!!!!
    >
    >Matt
    >
    robic0, Dec 24, 2005
    #14
  15. Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    robic0 <> wrote:


    > I will sue you!!!!



    I doubt it.

    You'd have to stop cowering behind anonymity to sue.

    You don't have the guts for it.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Dec 24, 2005
    #15
  16. robic0

    Matt Garrish Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    <robic0> wrote in message news:...
    > On Fri, 23 Dec 2005 19:29:45 -0500, "Matt Garrish"
    > <> wrote:
    >>Now how about the part where you start dealing with the fact that xml is
    >>not
    >>constrained to single lines. Your little toy has a lot of trouble with:
    >>

    > Huh, constrained to single lines?
    > Wha, where?
    >
    >><!-- comment out this section
    >><oldroot>
    >> <oldstuff>oops!</oldstuff>
    >></oldroot>
    >>-->
    >>

    > Comments are a problem for now. I have a workaround
    > for the near future. I've posted a general complaint
    > about this Regex problem to the general forum.
    >
    >>and also:
    >>
    >><myplace
    >> city="here"
    >> province="there"/>
    >>

    > "white space" is not considered as a seperator yet, only " ". If its
    > xml complieant I will enact it.
    >


    Exactly my point. The last XML processor I built took three weeks just to
    write the design for and another 1.5 months to build. And I didn't write my
    own parsers; I used a combination of DOM and SAX parsing. You don't know XML
    and are proud that you've spent four days designing and writing on the fly
    this parser of yours. Are you beginning to see why we don't take you
    seriously.

    >
    > Maybe you should not get or use any my software. If I find out you did
    > I will sue you!!!!
    >


    Maybe you should consider the legal ramifications of what you've done. You
    posted the code here asking for help fixing it on the premise that it is
    free and open code. By doing so, you've entered an agreement with everyone
    on clpm who responds in any way to your code that this will always be the
    case. Though I don't believe you could ever make a cent off it, bear in mind
    that I have a real cause for legal action if I find out you use this code in
    any commercial product (and that includes reproducing it for an employer).

    By the way, have you put any thought into the public interface for this
    thing? It's nice that it runs line-by-line and uses regexes to find tags,
    but that's totally useless for XML parsing. Does it handle events like a SAX
    parser? (Not that I see.) Does it build a parent/child tree? (Again, I don't
    see anywhere that you can tell what the relationship is between any set of
    tags.) Or is this just an exercise in writing regular expressions?

    Matt
    Matt Garrish, Dec 24, 2005
    #16
  17. robic0

    robic0 Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    On Sat, 24 Dec 2005 11:57:13 -0500, "Matt Garrish"
    <> wrote:

    >
    ><robic0> wrote in message news:...
    >> On Fri, 23 Dec 2005 19:29:45 -0500, "Matt Garrish"
    >> <> wrote:
    >>>Now how about the part where you start dealing with the fact that xml is
    >>>not
    >>>constrained to single lines. Your little toy has a lot of trouble with:
    >>>

    >> Huh, constrained to single lines?
    >> Wha, where?
    >>
    >>><!-- comment out this section
    >>><oldroot>
    >>> <oldstuff>oops!</oldstuff>
    >>></oldroot>
    >>>-->
    >>>

    >> Comments are a problem for now. I have a workaround
    >> for the near future. I've posted a general complaint
    >> about this Regex problem to the general forum.
    >>
    >>>and also:
    >>>
    >>><myplace
    >>> city="here"
    >>> province="there"/>
    >>>

    >> "white space" is not considered as a seperator yet, only " ". If its
    >> xml complieant I will enact it.
    >>

    >
    >Exactly my point. The last XML processor I built took three weeks just to
    >write the design for and another 1.5 months to build. And I didn't write my
    >own parsers; I used a combination of DOM and SAX parsing. You don't know XML
    >and are proud that you've spent four days designing and writing on the fly
    >this parser of yours. Are you beginning to see why we don't take you
    >seriously.
    >
    >>
    >> Maybe you should not get or use any my software. If I find out you did
    >> I will sue you!!!!
    >>

    >
    >Maybe you should consider the legal ramifications of what you've done. You
    >posted the code here asking for help fixing it on the premise that it is
    >free and open code. By doing so, you've entered an agreement with everyone
    >on clpm who responds in any way to your code that this will always be the
    >case. Though I don't believe you could ever make a cent off it, bear in mind
    >that I have a real cause for legal action if I find out you use this code in
    >any commercial product (and that includes reproducing it for an employer).
    >

    Man you make me laff!

    >By the way, have you put any thought into the public interface for this
    >thing? It's nice that it runs line-by-line and uses regexes to find tags,
    >but that's totally useless for XML parsing. Does it handle events like a SAX
    >parser? (Not that I see.) Does it build a parent/child tree? (Again, I don't
    >see anywhere that you can tell what the relationship is between any set of
    >tags.) Or is this just an exercise in writing regular expressions?
    >
    >Matt
    >

    Since its out of sequence, its totally useless for event driven SAX.
    However, in-line handling of contents could be re-directed for
    special character handling.
    Specific accumulation of special "tag" data could be handled too.
    You have to think outside the box on this. Definetly the data
    structure indenture is right on the money. To modify that data
    in-line or pull off just the data you want is no problem.

    To tell you the truth, there's a bunch this can do.
    You better try to stay off the "negative" machine a little more.
    Try the "positive" machine for a while. And oh well, if it flops
    who cares, but it punches out some awsome timed data right now.
    The technique is new, in my opinion its worth the effort.

    Keep the comments coming... I don't care if its negative,
    it leads me in the right direction. If I have to swear to get
    some feedback so be it.
    robic0, Dec 27, 2005
    #17
  18. robic0

    robic0 Guest

    On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

    I'm back on the job.
    I'm going to post some new code this week that
    complies with XML spec.

    This is the solution for the Comment/CDATA paradigm
    that will be incorporated in the new version:

    use strict;
    use warnings;

    $_ = '
    <![CDATA[ <!-- imbed comment --> some text <!-- imbed as well -->]]>

    <!--
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    -->

    <!-- This is a real comment -->

    ';

    #### This section of parser deals with
    #### circular non-markup imbedding issues.
    #### (one inside the other, and so forth)
    #### So far just comments & cdata.
    #### Use the general substitution magic.
    #### This is valid because nesting of
    #### comments nor cdata is allowed.

    my $cnt = 1;
    my %root = ();
    my %cdata_elements = ();

    print "\n";

    # -- Comments (done first) --
    while (s/(<!--(.*?)-->)/[$cnt]/s) {
    $root{$cnt} = $1;
    print "$cnt = Questionable comment: $1\n"; $cnt++;
    }
    print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
    # -- CDATA (done second) --
    while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
    {
    # reconstitute cdata element contents
    my $cdata_contents = $1;
    my $str = '';
    while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
    {
    if (defined $1)
    {
    $str .= $1;
    }
    elsif (defined $2 && exists $root{$2})
    {
    $str .= $root{$2};
    delete $root{$2};
    }
    else {
    my $j = 0; # shouldn't get here
    }
    }
    $root{$cnt} = $str;
    $cdata_elements{$cnt} = '';

    print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
    }
    # -- Process leftover comments that are real --
    while (my ($key,$val) = each (%root)) {
    if (!defined $cdata_elements{$key}) {
    # This $root re-assignment is not really necessary
    # since $1 will contain the processing text that
    # will be processed here, then never used again.
    $root{$key} =~ s/<!--(.*?)-->/$1/s;
    print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
    }
    }


    __END__

    1 = Questionable comment: <!-- imbed comment -->
    2 = Questionable comment: <!-- imbed as well -->
    3 = Questionable comment: <!--
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    -->
    4 = Questionable comment: <!-- This is a real comment -->


    ============================================================

    The "Real" Stuff -->


    5 = REAL CDATA: <!-- imbed comment --> some text <!-- imbed as well
    -->

    4 = REAL COMMENT: This is a real comment

    3 = REAL COMMENT:
    wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
    <tag>at tag in a real comment</tag>
    <![CDATA[ not a CDATA ]]>
    robic0, Dec 27, 2005
    #18
  19. Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    robic0 wrote:

    > I'm back on the job.
    > I'm going to post some new code this week that
    > complies with XML spec.


    There is more than meets the eye.

    An XML file may be well-formed, but invalid if it doesn't comply with
    its DTD. Would your program complain about that ?

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE root [
    <!ELEMENT root ((mytag|mytag2),myothertag+,notrequiredtag?)>
    <!ELEMENT mytag (#PCDATA)>
    <!ELEMENT myothertag (#PCDATA)>
    ]>
    <root>
    <mytag>content 1</mytag>
    <myothertag>content 2</myothertag>
    </root>

    What about the declaration of entities ?

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE root [
    <!ENTITY my_entity "this content was set by !ENTITY">
    ]>
    <root>
    <mytag>&my_entity;</mytag>
    <myothertag>content 2</myothertag>
    </root>

    What about an ATTLIST ?

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE root [
    <!ATTLIST mytag
    att1 CDATA #REQUIRED
    att2 CDATA #IMPLIED>
    <!ATTLIST myothertag att3 CDATA #FIXED
    "this content was set by !ATTLIST">
    ]>
    <root>
    <mytag att1="attvalue1" att2="attvalue2">content 1</mytag>
    <myothertag>content 2</myothertag>
    </root>

    What you gonna do with specific XSL tags ?

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <root>
    <xsl:sort select="@ID" order="ascending" />
    <mytag>
    <xsl:attribute name='{name()}'>
    <xsl:value-of select="." />
    </xsl:attribute>
    </mytag>
    </root>
    </xsl:stylesheet>

    What about the rules from an XML schema ?

    <?xml version="1.0" encoding="UTF-8"?>
    <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <xsd:element name="root">
    <xsd:complexType>
    <xsd:sequence>
    <xsd:element ref="mytag" maxOccurs="unbounded" />
    </xsd:sequence>
    </xsd:complexType>
    </xsd:element>
    </xsd:schema>

    It would be a good idea to decode numeric character references:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
    <mytag>i</mytag>
    </root>

    Same for the non-numeric ones:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
    <mytag>&amp;</mytag>
    </root>

    I would recommend "Perl & XML - XML Processing with Perl" by Erik T.
    Ray & Jason McIntosh (edited by O'Reilly). Very good book. See
    http://www.oreilly.com/catalog/perlxml/.

    You need to learn more about XML:

    http://www.w3.org/XML/
    http://www.xml.com/
    http://www.w3schools.com/xml/default.asp (tip!)

    --
    Bart
    Bart Van der Donck, Dec 27, 2005
    #19
  20. robic0

    Matt Garrish Guest

    Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

    <robic0> wrote in message news:...
    > On Sat, 24 Dec 2005 11:57:13 -0500, "Matt Garrish"
    > <> wrote:
    >
    > Man you make me laff!
    >


    Well, at least you're getting as much out of this as I am. It would be nice
    if you could drop the script-kiddie talk and write proper English sentences
    in the future, though.

    >
    >>By the way, have you put any thought into the public interface for this
    >>thing? It's nice that it runs line-by-line and uses regexes to find tags,
    >>but that's totally useless for XML parsing. Does it handle events like a
    >>SAX
    >>parser? (Not that I see.) Does it build a parent/child tree? (Again, I
    >>don't
    >>see anywhere that you can tell what the relationship is between any set of
    >>tags.) Or is this just an exercise in writing regular expressions?
    >>
    >>

    > Since its out of sequence, its totally useless for event driven SAX.


    That's exactly my point. What is this thing supposed to do? The (very
    simple) point of an XML parser is to verify the integrity of the document
    (validation: either well-formedness or compliance to a dtd or schema) and/or
    allow you to access the content.

    Your parser has no appreciation of nesting beyond the very trivial, so there
    is no way that it can check well-formedness. It (you) also doesn't
    understand dtds or schemas, and don't realize how nearly impossible it's
    going to be for your parser to validate against one.

    To get back to my original point, however, your parser does not build a
    tree, so that makes it useless for half the applications of a parser. It
    also doesn't handle events like a SAX parser, which makes it useless for the
    other half. I'm honestly curious what real world application you think this
    is going to have?

    Oh, and when are you going to start handling xpath queries?

    Matt
    Matt Garrish, Dec 27, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?QW5nZWw=?=

    custom cut copy and paste

    =?Utf-8?B?QW5nZWw=?=, Jan 10, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    359
    =?Utf-8?B?QW5nZWw=?=
    Jan 10, 2005
  2. Roedy Green

    Cut/Paste Bug

    Roedy Green, Jul 8, 2004, in forum: Java
    Replies:
    7
    Views:
    562
    Andrew Thompson
    Jul 9, 2004
  3. Esteban

    Cut and paste images

    Esteban, Sep 14, 2004, in forum: Java
    Replies:
    5
    Views:
    5,001
  4. cpprogrammer
    Replies:
    0
    Views:
    544
    cpprogrammer
    May 11, 2006
  5. robic0
    Replies:
    2
    Views:
    105
    robic0
    Jan 22, 2006
Loading...

Share This Page