My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

R

robic0

This post is in response to someone who asked for help trying to
parse xml into a data structure. The poster couldn't install
XML::parser or XML::Simple. I replied a few times with some
partial code. Good to my word, here is the core of a cut & paste
non-Perl-module based, raw, robust data xml parser into Perl
data structures. Its about 140 lines of code. I imagine its
about 3 times faster than the XML parsers out there, didn't time
it. It doesen't use the overhead of SAX or nodes.

This installment is released prematurely without the fancy
XML::Simple options yet. This is a typical "force array"
version (see the sub's below). I wanted to wait until tommorow
to post this but, I already know how to do it but don't have the
time tonight, however this is fairly final, and so I release
it with the understanding that its shortcomings will be fixed
in a day or so.

I've spent 4 days on this. You have to read between the lines
to insert your xml file open or just cut and paste your xml
to $gabage1. I've left that part up to you. The output
and data are legitimate. It won't look like XML:Simple
in the default settings. I maintaine a root here and some other
things. However, I will post a mod tommorow. The output and
parsing is completely legitimate. The parsing is probably
much faster than the modules on CPAN.

Let me know if you have any suggestions for improvement.
I want to keep it under 200 lines for a complete cut & paste
solution. It doesen't use any parser out there. Its parser
is built in. I don't think this method is used anywhere
in the XML world, you may want to check for possible multiple
speed enhancement.

Posting changes tommorow on this.
Contact info:
email: robic0-AT-yahoo.com

========================================================
use strict;
use warnings;
use Data::Dumper;

open DATA "datafile" or die "can't open datafile...";
my $gabage1 = <DATA>;
close DATA;

my @xml_files = ($gabage1);

my $debug = 0;
my $rmv_white_space = 1;

## -- XML start & end regexp substitution delimeter chars --
## match side , substitution side
## -----------------------/-------------------------
my @S_dlim = ('\[' , '['); # use these for reading (debug)
my @E_dlim = ('\]' , ']');
#my @S_dlim = (chr(140) , chr(140)); # use these for production
#my @E_dlim = (chr(141) , chr(141));


for (@xml_files)
{
if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}
print "\n",'='x30,"\n$_\n\n" if ($debug);

my $ROOT = {}; # container
my ($last_cnt, $cnt, $i) = (-1, 1, 0);

# should only need 2 iterations max, but wth
while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <?XML-Version ?> , have to check the format of '<?'
while (s/<\?([^<>]*)\?>//i) {} # to void xml
versioning
# while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug); $cnt++}

## <!-- Comments -->
# while (s/<!--([^<>]*)-->//i) {} # to void comments
while (s/<!--([^<>]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <!-- --> = $1\n" if ($debug);
$ROOT->{$cnt} = { comment => $1 };
$cnt++;
}
# Comments, need to have "anything but <!-- nor -->
here" (revisit)
# while
(s/<!--([^(<!--)^(-->)]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) { print "$cnt
<!-- --> = $1\n" if ($debug); $cnt++}

## <Tag/> , no content
while
(s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug);
$ROOT->{$cnt} = { $1 => '' };
$cnt++;
}
## <Tag Attributes/> , no content
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2\n" if ($debug);
$ROOT->{$cnt} = { $1 => getAttrHash($2) };
$cnt++;
}
## <Tag> Content </Tag>
while
(s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = $2\n" if ($debug);
my $unknown = '';
if (length($2) > 0) {
my ($key); my $hcontent =
getContentHash($2, $ROOT);
if (keys (%{$hcontent}) > 1) {
$unknown = $hcontent;
}
else { ($key,$unknown) = each
(%{$hcontent}); }
}
$ROOT->{$cnt} = { $1 => $unknown };
$cnt++;
}
## <Tag Attributes> Content </Tag>
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2, content: $3\n" if
($debug);
my $hattrib = getAttrHash($2);
my $hcontent = getContentHash($3, $ROOT);

while (my ($key,$val) = each (%{$hcontent})) {
$hattrib->{$key} = $val;
}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}
$i++ if ($last_cnt != $cnt);
}
if (/<|>/) {
print "($i) XML problem, malformed, syntax or tag
closure:\n$_";
} else {
print "$i itterations\n\n";
#print Dumper($ROOT);
my $outer_element = $cnt-1;
if (exists $ROOT->{$outer_element}) {
my $tmp = {};
%{$tmp} = %{$ROOT->{$outer_element}};
print Dumper($tmp);
}
}
}
##
sub getAttrHash
{
my $attstr = shift;
my $ahref = {};
return $ahref unless (defined $attstr);
while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
]*//i) {
$ahref->{$1} = $2;
}
return $ahref;
}
##
sub getContentHash
{
my ($attstr,$hStore) = @_;
my $ahref = {};
return $ahref unless (defined $attstr && defined $hStore);
my @ary = ();
while ($attstr =~
s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
if (defined $1) {
push (@ary, $1);
}
elsif (defined $2 && exists $hStore->{$2}) {
my ($key,$val) = each (%{$hStore->{$2}});

# here, force array is in effect (aka: simple)
# (this will be modified in a day or so)
################
if (exists $ahref->{$key})
{
#print "getChash - $key\n";
push (@{$ahref->{$key}}, $val);

} else {
$ahref->{$key} = [$val];
# $ahref->{$key} = $val;
}
################
}
}
if (scalar(@ary) == 1) {
$ahref->{'content'} = $ary[0];
} elsif (scalar(@ary) > 1) {
$ahref->{'content'} = [@ary];
}
return $ahref;
}

__END__

$VAR1 = {
'document' => {
'WMSNameSpaceVersion' => '2.0',
'comment' => [
' Control Protocol ',
' Data Protocol ',
' Feedback Protocol ',
' Network Source '
],
'node' => [
{
'opcode' => 'create',
'comment' => [
' Object Store
'
],
'name' => 'Control Protocol',
'node' => [
{
'opcode' =>
'create',
'comment' => [
'
RTSP ',
'
Sessionless Multicast '
],
'name' =>
'Object Store',
'node' => [
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTSP',

'node' => [

{

'opcode' => 'create',

'value' => '{308786f0-8b15-11d2-b25f-006097d2e41e}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'RTSP,RTSPA,RTSPT,RTSPU,RTSPM',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'Sessionless Multicast',

'node' => [

{

'opcode' => 'create',

'value' => '{f9377800-f38d-11d2-b26c-006097d2e41e}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'MCAST,RTP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
}
]
},
{
'opcode' =>
'create',
'name' =>
'Shared Properties'
}
]
},
{
'opcode' => 'create',
'comment' => [
' Object Store
'
],
'name' => 'Data Protocol',
'node' => [
{
'opcode' =>
'create',
'comment' => [
'
RTP ',
'
RTP/ASF ',
'
RTP/AVP ',
'
RTP/FEC ',
'
RTP/WMS-FEC '
],
'name' =>
'Object Store',
'node' => [
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP',

'node' => [

{

'opcode' => 'create',

'value' => '{cbfb2e20-ab7b-11d2-b261-006097d2e41e}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'x-asf-pf',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP/ASF',

'node' => [

{

'opcode' => 'create',

'value' => '{149a44be-dc14-4e94-9cb0-c0268e77df9e}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'x-asfv2-pf,x-asfv2-grp-pf,x-asfv2-frag-pf',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP/AVP',

'node' => [

{

'opcode' => 'create',

'value' => '{d7335e2e-62eb-4ad0-96cd-b31c9d0f9f85}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'PCMU,L8,L16,MPA,G726-24,G726-40',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP/FEC',

'node' => [

{

'opcode' => 'create',

'value' => '{02DEFE42-F8FC-11d2-8670-00C04F6890ED}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'parityfec',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP/WMS-FEC',

'node' => [

{

'opcode' => 'create',

'value' => '{EDAB8E6B-746C-40db-A885-9E4A9EEF27A2}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'wms-fec',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
}
]
},
{
'opcode' =>
'create',
'name' =>
'Shared Properties'
}
]
},
{
'opcode' => 'create',
'comment' => [
' Object Store
'
],
'name' => 'Feedback Protocol',
'node' => [
{
'opcode' =>
'create',
'comment' => [
'
RTCP '
],
'name' =>
'Object Store',
'node' => [
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTCP',

'node' => [

{

'opcode' => 'create',

'value' => '{ecfddc81-184e-11d3-ae84-00a0c95ec3f0}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'x-wms-rtx',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
}
]
},
{
'opcode' =>
'create',
'name' =>
'Shared Properties'
}
]
},
{
'opcode' => 'create',
'comment' => [
' Object Store
',
' Shared
Properties '
],
'name' => 'Network Source',
'node' => [
{
'opcode' =>
'create',
'comment' => [
'
WMS Http Network Source ',
'
WMS Mms Network Source ',
'
WMS Msbd Network Source ',
'
WMS Network Source '
],
'name' =>
'Object Store',
'node' => [
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'WMS Http Network Source',

'node' => [

{

'opcode' => 'create',

'value' => '{566A2EFF-5651-4020-AC1A-EB48E4571EA3}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'HTTP',

'name' => 'Source Type',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x50',

'name' => 'DefaultHttpServerPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1bb',

'name' => 'DefaultHttpServerSSLPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x8',

'name' => 'PacketBuffers',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'EnableHTTP1_1',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1e',

'name' => 'OpenTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x64',

'name' => 'SecondSegmentTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ControlAdapter',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x55',

'name' => 'PercentBWUsageForAccelStreaming',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3',

'name' => 'Proxy Setting',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ProxyHostName',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x50',

'name' => 'ProxyPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'ProxyBypassForLocal',

'type' => 'int32'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'WMS Mms Network Source',

'node' => [

{

'opcode' => 'create',

'value' => '{DCF6C8B2-F6C0-461b-82DA-35945EADF54A}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'MMS,MMST,MMSU',

'name' => 'Source Type',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x6db',

'name' => 'DefaultServerPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x4',

'name' => 'MaxReadHeaderRetries',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x8',

'name' => 'PacketBuffers',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'FirstDropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropBurstDuration',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'PacketPairDropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x2',

'name' => 'NackAlgorithm',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'NackRateMultiplier',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x5dc',

'name' => 'NackBurst',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3e8',

'name' => 'NackTraceInterval',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'NackRetry',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'IgnoreServerVersion',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'EnableMmsDistribution',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'AssertStrangeErrors',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x5a',

'name' => 'InactivityTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x20',

'name' => 'OpenTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x55',

'name' => 'PercentBWUsageForAccelStreaming',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'FunnelAdapter',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ControlAdapter',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'Proxy Setting',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ProxyHostName',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x6db',

'name' => 'ProxyPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'ProxyBypassForLocal',

'type' => 'int32'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'WMS Msbd Network Source',

'node' => [

{

'opcode' => 'create',

'value' => '{FB74F625-7D25-4455-B840-7B870B5B9322}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'ASFM',

'name' => 'Source Type',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x8',

'name' => 'PacketBuffers',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'FirstDropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropBurstDuration',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3a98',

'name' => 'McastTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'EnableIGMPv3',

'type' => 'int32'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'WMS Network Source',

'node' => [

{

'opcode' => 'create',

'value' => '{ad763fa6-3b90-41ab-bd44-4f832beee55f}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'RTSP,XSDP,RTP,RTSPA,RTSPT,RTSPU,RTSPM',

'name' => 'Source Type',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'EnableATM',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'MaximumMTU',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x14',

'name' => 'FirewallTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1e',

'name' => 'OpenTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'RtxDropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'FirstDropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropBurstDuration',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'PacketPairDropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x2',

'name' => 'NackAlgorithm',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'NackRateMultiplier',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x5dc',

'name' => 'NackBurst',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3e8',

'name' => 'NackTraceInterval',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'NackRetry',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'BurstProtection',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'EmulateNetworkDisconnect',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'AssertStrangeErrors',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x55',

'name' => 'PercentBWUsageForAccelStreaming',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'Proxy Setting',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ProxyHostName',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x22a',

'name' => 'ProxyPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'ProxyBypassForLocal',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3e8',

'name' => 'PktGracePeriodAtEOSForBPP',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x9c4',

'name' => 'PktGracePeriodAtEOSForODP',

'type' => 'int32'

}

]

}

]
}
]
},
{
'opcode' =>
'create',
'name' =>
'Shared Properties',
'node' => [
{

'opcode' => 'create',

'name' => 'Local'
}
]
}
]
}
]
}
};

__DATA__

<document WMSNameSpaceVersion="2.0">

<node name="Control Protocol" opcode="create" >
<node name="Object Store" opcode="create" >
<node name="RTSP" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{308786f0-8b15-11d2-b25f-006097d2e41e}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Protocol" opcode="create" type="string"
value="RTSP,RTSPA,RTSPT,RTSPU,RTSPM" />
</node> <!-- Properties -->

</node> <!-- RTSP -->

<node name="Sessionless Multicast" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{f9377800-f38d-11d2-b26c-006097d2e41e}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Protocol" opcode="create" type="string"
value="MCAST,RTP" />
</node> <!-- Properties -->

</node> <!-- Sessionless Multicast -->

</node> <!-- Object Store -->

<node name="Shared Properties" opcode="create" />
</node> <!-- Control Protocol -->

<node name="Data Protocol" opcode="create" >
<node name="Object Store" opcode="create" >
<node name="RTP" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{cbfb2e20-ab7b-11d2-b261-006097d2e41e}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Format" opcode="create" type="string"
value="x-asf-pf" />
<node name="Protocol" opcode="create" type="string"
value="RTP/AVP" />
</node> <!-- Properties -->

</node> <!-- RTP -->

<node name="RTP/ASF" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{149a44be-dc14-4e94-9cb0-c0268e77df9e}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Format" opcode="create" type="string"
value="x-asfv2-pf,x-asfv2-grp-pf,x-asfv2-frag-pf" />
<node name="Protocol" opcode="create" type="string"
value="RTP/AVP" />
</node> <!-- Properties -->

</node> <!-- RTP/ASF -->

<node name="RTP/AVP" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{d7335e2e-62eb-4ad0-96cd-b31c9d0f9f85}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Format" opcode="create" type="string"
value="PCMU,L8,L16,MPA,G726-24,G726-40" />
<node name="Protocol" opcode="create" type="string"
value="RTP/AVP" />
</node> <!-- Properties -->

</node> <!-- RTP/AVP -->

<node name="RTP/FEC" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{02DEFE42-F8FC-11d2-8670-00C04F6890ED}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Format" opcode="create" type="string"
value="parityfec" />
<node name="Protocol" opcode="create" type="string"
value="RTP/AVP" />
</node> <!-- Properties -->

</node> <!-- RTP/FEC -->

<node name="RTP/WMS-FEC" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{EDAB8E6B-746C-40db-A885-9E4A9EEF27A2}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Format" opcode="create" type="string"
value="wms-fec" />
<node name="Protocol" opcode="create" type="string"
value="RTP/AVP" />
</node> <!-- Properties -->

</node> <!-- RTP/WMS-FEC -->

</node> <!-- Object Store -->

<node name="Shared Properties" opcode="create" />
</node> <!-- Data Protocol -->

<node name="Feedback Protocol" opcode="create" >
<node name="Object Store" opcode="create" >
<node name="RTCP" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{ecfddc81-184e-11d3-ae84-00a0c95ec3f0}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Format" opcode="create" type="string"
value="x-wms-rtx" />
<node name="Protocol" opcode="create" type="string"
value="RTP/AVP" />
</node> <!-- Properties -->

</node> <!-- RTCP -->

</node> <!-- Object Store -->

<node name="Shared Properties" opcode="create" />
</node> <!-- Feedback Protocol -->

<node name="Network Source" opcode="create" >
<node name="Object Store" opcode="create" >
<node name="WMS Http Network Source" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{566A2EFF-5651-4020-AC1A-EB48E4571EA3}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Source Type" opcode="create" type="string"
value="HTTP" />
<node name="DefaultHttpServerPort" opcode="create"
type="int32" value="0x50" />
<node name="DefaultHttpServerSSLPort" opcode="create"
type="int32" value="0x1bb" />
<node name="PacketBuffers" opcode="create" type="int32"
value="0x8" />
<node name="EnableHTTP1_1" opcode="create" type="int32"
value="0x1" />
<node name="OpenTimeout" opcode="create" type="int32"
value="0x1e" />
<node name="SecondSegmentTimeout" opcode="create"
type="int32" value="0x64" />
<node name="ControlAdapter" opcode="create" type="string"
value="" />
<node name="PercentBWUsageForAccelStreaming" opcode="create"
type="int32" value="0x55" />
<node name="Proxy Setting" opcode="create" type="int32"
value="0x3" />
<node name="ProxyHostName" opcode="create" type="string"
value="" />
<node name="ProxyPort" opcode="create" type="int32"
value="0x50" />
<node name="ProxyBypassForLocal" opcode="create"
type="int32" value="0x0" />
</node> <!-- Properties -->

</node> <!-- WMS Http Network Source -->

<node name="WMS Mms Network Source" opcode="create" >
<node name="CLSID" opcode="create" type="string"
value="{DCF6C8B2-F6C0-461b-82DA-35945EADF54A}" />
<node name="Enabled" opcode="create" type="int32" value="0x1"
/>
<node name="Properties" opcode="create" >
<node name="Source Type" opcode="create" type="string"
value="MMS,MMST,MMSU" />
<node name="DefaultServerPort" opcode="create" type="int32"
value="0x6db" />
<node name="MaxReadHeaderRetries" opcode="create"
type="int32" value="0x4" />
<node name="PacketBuffers" opcode="create" type="int32"
value="0x8" />
<node name="DropProb" opcode="create" type="int32"
value="0x0" />
<node name="DropGracePeriod" opcode="create" type="int32"
value="0x0" />
<node name="FirstDropGracePeriod" opcode="create"
type="int32" value="0x0" />
<node name="DropBurstDuration" opcode="create" type="int32"
value="0x0" />
<node name="PacketPairDropProb" opcode="create" type="int32"
value="0x0" />
<node name="NackAlgorithm" opcode="create" type="int32"
value="0x2" />
<node name="NackRateMultiplier" opcode="create" type="int32"
value="0x1" />
<node name="NackBurst" opcode="create" type="int32"
value="0x5dc" />
<node name="NackTraceInterval" opcode="create" type="int32"
value="0x3e8" />
<node name="NackRetry" opcode="create" type="int32"
value="0x1" />
<node name="IgnoreServerVersion" opcode="create"
type="int32" value="0x0" />
<node name="EnableMmsDistribution" opcode="create"
type="int32" value="0x0" />
<node name="AssertStrangeErrors" opcode="create"
type="int32" value="0x0" />
<node name="InactivityTimeout" opcode="create" type="int32"
value="0x5a" />
<node name="OpenTimeout" opcode="create" type="int32"
value="0x20" />
<node name="PercentBWUsageForAccelStreaming" opcode="create"
type="int32" value="0x55" />
<node name="FunnelAdapter" opcode="create" type="string"
value="" />
<node name="ControlAdapter" opcode="create" type="string"
value="" />
<node name="Proxy Setting" opcode="create" type="int32"
value="0x0" />
<node name="ProxyHostName" opcode="create" type="string"
value="" />
<node name="ProxyPort" opcode="create" type="int32"
value="0x6db" />
<node name="ProxyBypassForLocal" opcode="create"
type="int32" value="0x0" />
</node> <!-- Properties -->
 
M

mirod

robic0 said:
This post is in response to someone who asked for help trying to
parse xml into a data structure. The poster couldn't install
XML::parser or XML::Simple. I replied a few times with some
partial code. Good to my word, here is the core of a cut & paste
non-Perl-module based, raw, robust data xml parser into Perl
data structures. Its about 140 lines of code. I imagine its
about 3 times faster than the XML parsers out there, didn't time
it. It doesen't use the overhead of SAX or nodes.

This does not seem to be an XML parser. For example a (very!) cursory
glance seems to indicate that it considers [0-9a-zA-Z]+ to be a NAME
(tag or attribute name), where the XML spec shows it is a tad more
complex (see http://www.xml.com/axml/target.html#NT-Name).

Writing a complete XML parser is fairly hard, indeed a lot harder than
writing a quasi-XML parser, like what you wrote.

You could have refered the OP to SOAP::Lite
(http://search.cpan.org/dist/SOAP-Lite/), which includes a pure-perl
XML::parser replacement (with some explicit limitations).

As it is I think your code is a bit dangerous, as it risks being
re-used by people who will not understand its limitations
 
M

Matt Garrish

mirod said:
robic0 said:
This post is in response to someone who asked for help trying to
parse xml into a data structure. The poster couldn't install
XML::parser or XML::Simple. I replied a few times with some
partial code. Good to my word, here is the core of a cut & paste
non-Perl-module based, raw, robust data xml parser into Perl
data structures. Its about 140 lines of code. I imagine its
about 3 times faster than the XML parsers out there, didn't time
it. It doesen't use the overhead of SAX or nodes.

This does not seem to be an XML parser. For example a (very!) cursory
glance seems to indicate that it considers [0-9a-zA-Z]+ to be a NAME
(tag or attribute name), where the XML spec shows it is a tad more
complex (see http://www.xml.com/axml/target.html#NT-Name).

Writing a complete XML parser is fairly hard, indeed a lot harder than
writing a quasi-XML parser, like what you wrote.

It's always good to point out garbage when one sees it, but it's well known
(proven through numerous posts) that rob knows nothing about xml or markup
languages in general. He's probably just looking for an excuse to swear and
call himself a code god (or whatever he's into these days), so don't be
surprised if that's what you get (i.e., don't bother responding).

Matt
 
R

robic0

Posting changes tommorow on this.
Contact info:
email: robic0-AT-yahoo.com

Alot of bug fixes and modifications.
The first version had many problems.
This is clean version (.9) with options:
ForceArray
Keeproot.
Keepcomments

This works exceptionally well... Let me know
if you try it.
I'm so burned out on this there probably won't
be any updates for along time unless otherwise
if'n I change my mind.

See ya

print <<EOM;
# XML Regex Parser
# Version .9
# 12/21/05
# Copyright 2005,
# by robic0-At-yahoo.com
# -----------------------
EOM

use strict;
use warnings;
use Data::Dumper;

#open DATA, "datafile" or die "can't open datafile...";
#my $gabage1 = <DATA>;
#close DATA;


my $gabage2 = '

<XMLDATA>
<Submission SubmissionID="688904">
<Category CategoryName="Storage/Adapter or Controller">
<Driver FolderName="driver000">
<Language LanguageName="English">
<PackageCreationLocation
FolderName="G:\truyen\WHQL\Athena\raid\driver" />
</Language>
</Driver>
</Category>
</Submission>
</XMLDATA>
';

my $gabage3 = '

<big name="asdf" date="33" >
asdf
<in1>
<!-- howdy folks -->
<in2>jjjj</in2>
<small biz="wefwf" ueue = "second" />
<in3>asbefas</in3>
</in1>
asdfb
</big>

';

my @xml_strings = ($gabage2, $gabage3);

my $VERSION = .9;
my $debug = 0;
my $rmv_white_space = 1;
my $ForceArray = 0;
my $KeepRoot = 0;
my $KeepComments = 1;

## -- XML, start & end regexp substitution delimiter chars --
## match side , substitution side
## -----------------------/-------------------------
my @S_dlim = ('\[' , '['); # use these for reading (debug)
my @E_dlim = ('\]' , ']');
#my @S_dlim = (chr(140) , chr(140)); # use these for production
#my @E_dlim = (chr(141) , chr(141));


for (@xml_strings)
{
print "\n",'='x30,"\n$_\n\n";

if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}
my $ROOT = {}; # container
my ($last_cnt, $cnt, $i) = (-1, 1, 0);

# should only need 2 iterations max, but wth
while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <?XML-Version ?> , have to check the format of '<?'
while (s/<\?([^<>]*)\?>//i) {} # to void xml
versioning
# while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug); $cnt++}

## <!-- Comments -->
if (!$KeepComments) {
while (s/<!--([^<>]*)-->//i) {} # to void
comments
} else {
while
(s/<!--([^<>]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <!-- --> = $1\n" if
($debug);
$ROOT->{$cnt} = { comment => $1 };
$cnt++;
}
# Comments, need to have "anything but <!--
nor --> here" (revisit)
# while
(s/<!--([^(<!--)^(-->)]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) { print "$cnt
<!-- --> = $1\n" if ($debug); $cnt++}
}
## <Tag/> , no content
while
(s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug);
$ROOT->{$cnt} = { $1 => '' };
$cnt++;
}
## <Tag Attributes/> , no content
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2\n" if ($debug);
$ROOT->{$cnt} = { $1 => getAttrHash($2) };
$cnt++;
}
## <Tag> Content </Tag>
while
(s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = $2\n" if ($debug);
my $unknown = '';
if (length($2) > 0) {
my $hcontent = getContentHash($2,
$ROOT);
$unknown = $hcontent;
if (keys (%{$hcontent}) > 1) {
if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
} elsif (exists $hcontent->{'content'}
&& scalar(@{$hcontent->{'content'}}) == 1) {

if ($ForceArray ) {
$unknown =
$hcontent->{'content'};
} else {
$unknown =
${$hcontent->{'content'}}[0];
}
}
}
$ROOT->{$cnt} = { $1 => $unknown };
$cnt++;
}
## <Tag Attributes> Content </Tag>
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2, content: $3\n" if
($debug);
my $hattrib = getAttrHash($2);
if (length($3) > 0) {
my $hcontent = getContentHash($3,
$ROOT);
if (keys (%{$hcontent}) > 1) {
if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
}
while (my ($key,$val) = each
(%{$hcontent})) {
$hattrib->{$key} = $val;
}
}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}
if ($last_cnt != $cnt) {
$i++ ; print "** End pass $i\n" if ($debug);
}
}
if (/<|>/) {
print "($i) XML problem: malformed, syntax or tag
closure:\n$_";
} else {
print "\n** Itterations = $i\n** ForceArray =
$ForceArray\n** KeepRoot = $KeepRoot\n** KeepComments =
$KeepComments\n\n";
#print Dumper($ROOT);
my $outer_element = $cnt-1;
if (exists $ROOT->{$outer_element}) {
my $htodump = $ROOT->{$outer_element};
if (!$KeepRoot && keys (%{$htodump}) == 1) {
my ($key,$val) = each (%{$htodump});
$htodump = $val;
}
my $tmp = {};
%{$tmp} = %{$htodump};
print Dumper($tmp);
} else {print "nothing to output!\n";}
}
}
##
sub adjustForSingleItemArrays
{
my $href = shift;
## if $val is an array ref and has one element
## set $href->{$key} equal to the element
while (my ($key,$val) = each (%{$href})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1) {
$href->{$key} = $val->[0];
}
}
}
}
##
sub getAttrHash
{
my $attstr = shift;
my $ahref = {};
return $ahref unless (defined $attstr);
while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
]*//i) {
$ahref->{$1} = $2;
}
return $ahref;
}
##
sub getContentHash
{
my ($attstr,$hStore) = @_;
my $ahref = {};
return $ahref unless (defined $attstr && defined $hStore);
my @ary = ();
while ($attstr =~
s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
if (defined $1) {
push (@ary, $1);
}
elsif (defined $2 && exists $hStore->{$2}) {
my ($key,$val) = each (%{$hStore->{$2}});
if (exists $ahref->{$key}) {
push (@{$ahref->{$key}}, $val);
} else {
$ahref->{$key} = [$val];
}
}
}
if (scalar(@ary) > 0) { $ahref->{'content'} = [@ary]; }
## if $val is an array ref and has one element and it
## is a hash ref, set {$key} equal to hash ref
if (!$ForceArray) {
while (my ($key,$val) = each (%{$ahref})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1 &&
ref($val->[0]) eq "HASH") {
$ahref->{$key} = $val->[0];
}
}
}
}
return $ahref;
}

__END__


# XML Regex Parser
# Version .9
# 12/21/05
# Copyright 2005,
# by robic0-At-yahoo.com
# -----------------------

==============================


<XMLDATA>
<Submission SubmissionID="688904">
<Category CategoryName="Storage/Adapter or Controller">
<Driver FolderName="driver000">
<Language LanguageName="English">
<PackageCreationLocation
FolderName="G:\truyen\WHQL\Athena\raid\driver" />
</Language>
</Driver>
</Category>
</Submission>
</XMLDATA>



** Itterations = 2
** ForceArray = 0
** KeepRoot = 0
** KeepComments = 1

$VAR1 = {
'Submission' => {
'SubmissionID' => '688904',
'Category' => {
'Driver' => {
'Language'
=> {

'LanguageName' => 'English',

'PackageCreationLocation' => {

'FolderName' => 'G:\\truyen\\WHQL\\Athena\\raid\\driver'

}

},
'FolderName'
=> 'driver000'
},
'CategoryName' =>
'Storage/Adapter or Controller'
}
}
};

==============================


<big name="asdf" date="33" >
asdf
<in1>
<!-- howdy folks -->
<in2>jjjj</in2>
<small biz="wefwf" ueue = "second" />
<in3>asbefas</in3>
</in1>
asdfb
</big>




** Itterations = 1
** ForceArray = 0
** KeepRoot = 0
** KeepComments = 1

$VAR1 = {
'date' => '33',
'name' => 'asdf',
'content' => [
'asdf',
'asdfb'
],
'in1' => {
'small' => {
'ueue' => 'second',
'biz' => 'wefwf'
},
'in2' => 'jjjj',
'comment' => ' howdy folks ',
'in3' => 'asbefas'
}
};
 
R

robic0

robic0 said:
This post is in response to someone who asked for help trying to
parse xml into a data structure. The poster couldn't install
XML::parser or XML::Simple. I replied a few times with some
partial code. Good to my word, here is the core of a cut & paste
non-Perl-module based, raw, robust data xml parser into Perl
data structures. Its about 140 lines of code. I imagine its
about 3 times faster than the XML parsers out there, didn't time
it. It doesen't use the overhead of SAX or nodes.

This does not seem to be an XML parser. For example a (very!) cursory
glance seems to indicate that it considers [0-9a-zA-Z]+ to be a NAME
(tag or attribute name), where the XML spec shows it is a tad more
complex (see http://www.xml.com/axml/target.html#NT-Name).

Writing a complete XML parser is fairly hard, indeed a lot harder than
writing a quasi-XML parser, like what you wrote.

You could have refered the OP to SOAP::Lite
(http://search.cpan.org/dist/SOAP-Lite/), which includes a pure-perl
XML::parser replacement (with some explicit limitations).

As it is I think your code is a bit dangerous, as it risks being
re-used by people who will not understand its limitations

Hey, I don't know how but you started a new "Re:" thread.
I just posted up on the original thread midly reworked code.
If you would like to try it out feel free.

This is indeed xml parser framework logic. There is nothing left now
but incidentals to bring it up to the XML spec like tag naming,
special character escape sequences ("&amp",...). Its not made the
same as XML::parser or SAX. This is something entirely different.
The thrust was to parse the xml into a valid data structure.

The direction this could take is anybodys guess but I have alot
of imagination. I don't think writing a complete xml parser is
fairly hard. I wrote this framework in 4 days and I've used xml
parsers before. The parsing is done purely with regexp however
pulling out the data is real-time as the substitution progresses.
As the substitution moves forward, the xml string shrinks so the
subsequent regex searches get exponentially short resulting in
an extremely efficient and fast parse.

I welcome you to try it out. Perhaps do some time comparisons
with any other parser out there. I may do some more on it
in the next few days.

Post to the thread I'm posting the code to so I can get your
feedback. That is where I will post the next version.

And pay no attention to Matt Garish or Tad McClelan... my
underlings!

robic0
 
R

robic0

This post is in response to someone who asked for help trying to
parse xml into a data structure.

This will fix the final issues with "ForceArray".
Comments have an issue with enclosed "<" or ">" in this
version, other than that they will process normally.
Its a regex issue (shortcoming in my opinion) that can't
match a "not" string. Where I need <!--(all but "<!--")-->.
Where (.*)(?!<!--) won't work in an expression. But I'll
work around that.

This is version .901 from 12-22-05 is the one you want.
This is close to the last post as far as this newsgroup.
Sorry, but I had to get it stable. I've run this on every
big and wierd xml file I could get my hands on. I'm
satisfied with it.

See ya...


print <<EOM;

# XML Regex Parser
# Version .901 - 12/22/05
# Copyright 2005,
# by robic0-At-yahoo.com
# -----------------------
EOM

use strict;
use warnings;
use Data::Dumper;

#open DATA, "sumfile.xml" or die "can't open datafile...";
#my $gabage1 = join ('', <DATA>);
#close DATA;


my $gabage3 = '

<big name="asdf" date="33" >
asdf
<in1>
<!-- howdy f*%$olks -->
<in2>jjjj</in2>
<small biz="wefwf" ueue = "second" />
<!-- and still more -->
<bar><inside>asgfasdf<insF>2</insF>sdfb</inside></bar>
</in1>
<in2>some in3 content</in2>
asdfb
</big>

';

my @xml_strings = ($gabage3);

my $VERSION = .901;
my $debug = 1;
my $rmv_white_space = 1;
my $ForceArray = 0;
my $KeepRoot = 0;
my $KeepComments = 0;

## -- XML, start & end regexp substitution delimiter chars --
## match side , substitution side
## ----------------------/-------------------------------
my @S_dlim = ('\[' , '['); # use these for debug
my @E_dlim = ('\]' , ']');
#my @S_dlim = (chr(140) , chr(140)); # use these for production
#my @E_dlim = (chr(141) , chr(141));


## -- Process xml data --
##
for (@xml_strings)
{
print "\n",'='x30,"\n$_\n\n";

if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}
my $ROOT = {}; # container
my ($last_cnt, $cnt, $i) = (-1, 1, 0);

# should only need 2 iterations max, but wth
while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <?XML-Version ?> , have to check the format of '<?'
while (s/<\?([^<>]*)\?>//i) {} # to void xml
versioning
# while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i)
{ print "$cnt <$1> = \n" if ($debug); $cnt++}

## <!-- Comments -->, nesting not processed,
## also comments can't have "<" or ">" this version.
if (!$KeepComments) {
while (s/<!--[^<>]*-->//s) {} # to void
comments
} else {
while
(s/<!--([^<>]*)-->/$S_dlim[1]$cnt$E_dlim[1]/s) {
# while
(s/<!--([\w\s]*)(?!<!--)-->/$S_dlim[1]$cnt$E_dlim[1]/s) {
print "$cnt <!-- --> = $1\n" if
($debug);
$ROOT->{$cnt} = { comment => $1 };
$cnt++;
}
}
## <Tag/> , no content
while
(s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug);
$ROOT->{$cnt} = { $1 => '' };
$cnt++;
}
## <Tag Attributes/> , no content
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2\n" if ($debug);
$ROOT->{$cnt} = { $1 => getAttrHash($2) };
$cnt++;
}
## <Tag> Content </Tag>
while
(s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = $2\n" if ($debug);
my $unknown = '';
if (length($2) > 0) {
my $hcontent = getContentHash($2,
$ROOT);
$unknown = $hcontent;
if (keys (%{$hcontent}) > 1) {
if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
} else {
if (exists
$hcontent->{'content'} && scalar(@{$hcontent->{'content'}}) == 1) {
if (!$ForceArray ) {
$unknown =
${$hcontent->{'content'}}[0];
} else {$unknown =
$hcontent->{'content'}; }
}
if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
}
}
$ROOT->{$cnt} = { $1 => $unknown };
$cnt++;
}
## <Tag Attributes> Content </Tag>
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2, content: $3\n" if
($debug);
my $hattrib = getAttrHash($2);
if (length($3) > 0) {
my $hcontent = getContentHash($3,
$ROOT);
if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
while (my ($key,$val) = each
(%{$hcontent})) {
$hattrib->{$key} = $val;
}
}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}
if ($last_cnt != $cnt) {
$i++ ; print "** End pass $i\n" if ($debug);
}
}
if (/<|>/) {
print "($i) XML problem: malformed, syntax or tag
closure:\n$_";
} else {
print "\n** Itterations = $i\n** ForceArray =
$ForceArray\n** KeepRoot = $KeepRoot\n** KeepComments =
$KeepComments\n\n";
#print Dumper($ROOT);
my $outer_element = $cnt-1;
if (exists $ROOT->{$outer_element}) {
my $htodump = $ROOT->{$outer_element};
if (!$KeepRoot && keys (%{$htodump}) == 1) {
my ($key,$val) = each (%{$htodump});
$htodump = $val;
}
my $tmp = {};
%{$tmp} = %{$htodump};
print Dumper($tmp);
} else {print "nothing to output!\n";}
}
}
##
sub adjustForSingleItemArrays
{
my $href = shift;
## if $val is an array ref and has one element
## set $href->{$key} equal to the element
while (my ($key,$val) = each (%{$href})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1) {
$href->{$key} = $val->[0];
}
}
}
}
##
sub getAttrHash
{
my $attstr = shift;
my $ahref = {};
return $ahref unless (defined $attstr);
while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
]*//i) {
$ahref->{$1} = $2;
}
return $ahref;
}
##
sub getContentHash
{
my ($attstr,$hStore) = @_;
my $ahref = {};
return $ahref unless (defined $attstr && defined $hStore);
my @ary = ();
while ($attstr =~
s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
if (defined $1) {
push (@ary, $1);
}
elsif (defined $2 && exists $hStore->{$2}) {
my ($key,$val) = each (%{$hStore->{$2}});
if (exists $ahref->{$key}) {
push (@{$ahref->{$key}}, $val);
} else {
$ahref->{$key} = [$val];
}
}
}
if (scalar(@ary) > 0) { $ahref->{'content'} = [@ary]; }
## if $val is an array ref and has one element and it
## is a hash ref, set {$key} equal to hash ref
if (!$ForceArray) {
while (my ($key,$val) = each (%{$ahref})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1 &&
ref($val->[0]) eq "HASH") {
$ahref->{$key} = $val->[0];
}
}
}
}
return $ahref;
}

__END__


# XML Regex Parser
# Version .901 - 12/22/05
# Copyright 2005,
# by robic0-At-yahoo.com
# -----------------------

==============================


<big name="asdf" date="33" >
asdf
<in1>
<!-- howdy f*%$olks -->
<in2>jjjj</in2>
<small biz="wefwf" ueue = "second" />
<!-- and still more -->
<bar><inside>asgfasdf<insF>2</insF>sdfb</inside></bar>
</in1>
<in2>some in3 content</in2>
asdfb
</big>



1 <small> = attr: biz="wefwf" ueue = "second"
2 <in2> = jjjj
3 <insF> = 2
4 <inside> = asgfasdf[3]sdfb
5 <bar> = [4]
6 <in1> = [2][1][5]
7 <in2> = some in3 content
8 <big> = attr: name="asdf" date="33", content: asdf[6][7]asdfb
** End pass 1

** Itterations = 1
** ForceArray = 0
** KeepRoot = 0
** KeepComments = 0

$VAR1 = {
'in2' => 'some in3 content',
'date' => '33',
'name' => 'asdf',
'content' => [
'asdf',
'asdfb'
],
'in1' => {
'small' => {
'ueue' => 'second',
'biz' => 'wefwf'
},
'bar' => {
'inside' => {
'insF' => '2',
'content' => [

'asgfasdf',
'sdfb'
]
}
},
'in2' => 'jjjj'
}
};
 
T

Tad McClellan

robic0 said:
Comments have an issue with enclosed "<" or ">" in this
version, other than that they will process normally.
Its a regex issue (shortcoming in my opinion)


Then you do not understand the mathematics underpinning
regular expressions (ie. set theory).

that can't
match a "not" string. Where I need <!--(all but "<!--")-->.


If you are processing XML, then you do not need that, as
Comment Declarations cannot be nested.

This is version .901 from 12-22-05 is the one you want.


No sensible person will want XML processing code written by
someone who has demonstrated repeatedly that they do not
understand the data that is being processed.
 
B

Bart Van der Donck

robic0 said:
This will fix the final issues with "ForceArray".
Comments have an issue with enclosed "<" or ">" in this
version, other than that they will process normally.
Its a regex issue (shortcoming in my opinion) that can't
match a "not" string. Where I need <!--(all but "<!--")-->.
Where (.*)(?!<!--) won't work in an expression. But I'll
work around that.

This is version .901 from 12-22-05 is the one you want.
This is close to the last post as far as this newsgroup.
Sorry, but I had to get it stable. I've run this on every
big and wierd xml file I could get my hands on. I'm
satisfied with it.

[ code snipped ]

It's very hard to run your code. You are messing up the line ends in
your post. I 've uploaded a corrected version to
www.dotinternet.be/temp/code.txt.

Your software produces errors when using namespaces:

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
<mytag>content</mytag>
<html:br/>
</root>

Your software produces errors when using a DOCTYPE:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<root>
<mytag>content</mytag>
</root>

Your software produces errors when argument values are enclosed by `` '
´´ instead of `` " ´´:

<?xml version='1.0' encoding='UTF-8'?>
<root>
<mytag myargument='argvalue'>content</mytag>
</root>

XML is case sensitive; your program doesn't seem to bother:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<mYTag myargument="argvalue">content</mytag>
</root>

I'm using Microsoft XP's XML parser to check the XML well-formedness.

Your program has many shortcomings.
 
R

robic0

It's very hard to run your code. You are messing up the line ends in
your post. I 've uploaded a corrected version to
www.dotinternet.be/temp/code.txt.
Please don't correct and post code I've written on this.
I'm taking it to a higher level every day. My thoughts on
this won't take it where you want to go. Its my idea
and I'll do just about anything I want with it! The code
strain emminates from my creativity, I gave it birth and
I will progress it. Email me, or post code on specific xml
that doesen't work. Either you get a exception bail out
or you get my general error. Not all xml constucts are
implemented. !DOCTYPE not done yet. Its an infant now,
just the basics. Trust me, I'm gonna do it all.

If you got a host for me that would be great!
I'm going to expand this to every xml construct out there.
 
R

robic0

[snip]
It's very hard to run your code. You are messing up the line ends in
your post.
I'm not messing up "line ends"..
I 've uploaded a corrected version to
www.dotinternet.be/temp/code.txt.
You didn't write the code, you can't correct it..
Your software produces errors when using namespaces:

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
<mytag>content</mytag>
<html:br/>
</root>
Uh, namespaces? wha where?
Your software produces errors when using a DOCTYPE:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<root>
<mytag>content</mytag>
</root>
"<!DOCTYPE..." is not implemented, don't use that xml
Your software produces errors when argument values are enclosed by `` '
´´ instead of `` " ´´:

<?xml version='1.0' encoding='UTF-8'?>
<root>
<mytag myargument='argvalue'>content</mytag>
</root>
Ok, I'l give you that, if '|" is ok for attribute's then I'll put it
in
XML is case sensitive; your program doesn't seem to bother:
Thought that was the case, I turned off case sensitivity, I'll
put it back on
<?xml version="1.0" encoding="UTF-8"?>
<root>
<mYTag myargument="argvalue">content</mytag>
</root>

I'm using Microsoft XP's XML parser to check the XML well-formedness.

Your program has many shortcomings.
My program has a solid framework I wrote in 4 days. I've run it on
every single MShit OS xml on my machine. It works perfect ...

Don't know what you want. Either you want what I wrote your you just
want to bust balls of a software designer. Can't figure out which you
want. One more comment like the one above and I won't post a personal
reply like this one!
if ever you should
 
M

Matt Garrish

Please don't correct and post code I've written on this.
I'm taking it to a higher level every day. My thoughts on
this won't take it where you want to go. Its my idea
and I'll do just about anything I want with it! The code
strain emminates from my creativity, I gave it birth and
I will progress it. Email me, or post code on specific xml
that doesen't work.

I don't think anyone wants your garbage.

Now how about the part where you start dealing with the fact that xml is not
constrained to single lines. Your little toy has a lot of trouble with:

<!-- comment out this section
<oldroot>
<oldstuff>oops!</oldstuff>
</oldroot>
-->

and also:

<myplace
city="here"
province="there"/>

Maybe you should learn XML *before* trying to write this parser of yours.

Matt
 
R

robic0

robic0 said:
On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

This post is in response to someone who asked for help trying to
parse xml into a data structure.
[snip]
It's very hard to run your code. You are messing up the line ends in
your post.
I'm not messing up "line ends"..
I 've uploaded a corrected version to
www.dotinternet.be/temp/code.txt.
You didn't write the code, you can't correct it..
Your software produces errors when using namespaces:

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
<mytag>content</mytag>
<html:br/>
</root>
Uh, namespaces? wha where?
<html:br/>
^
Only \w are allowed in tag names now.
This character can be allowed.
I won't do it until the ramifications of a ":" are clear.
Send me the spec on tags, delimeters that runnon without space
within tags.
I'll see what I can do.
 
R

robic0

Now how about the part where you start dealing with the fact that xml is not
constrained to single lines. Your little toy has a lot of trouble with:
Huh, constrained to single lines?
Wha, where?
<!-- comment out this section
<oldroot>
<oldstuff>oops!</oldstuff>
</oldroot>
-->
Comments are a problem for now. I have a workaround
for the near future. I've posted a general complaint
about this Regex problem to the general forum.
and also:

<myplace
city="here"
province="there"/>
"white space" is not considered as a seperator yet, only " ". If its
xml complieant I will enact it.
Maybe you should learn XML *before* trying to write this parser of yours.

Maybe you should not get or use any my software. If I find out you did
I will sue you!!!!
 
M

Matt Garrish

Huh, constrained to single lines?
Wha, where?

Comments are a problem for now. I have a workaround
for the near future. I've posted a general complaint
about this Regex problem to the general forum.

"white space" is not considered as a seperator yet, only " ". If its
xml complieant I will enact it.

Exactly my point. The last XML processor I built took three weeks just to
write the design for and another 1.5 months to build. And I didn't write my
own parsers; I used a combination of DOM and SAX parsing. You don't know XML
and are proud that you've spent four days designing and writing on the fly
this parser of yours. Are you beginning to see why we don't take you
seriously.
Maybe you should not get or use any my software. If I find out you did
I will sue you!!!!

Maybe you should consider the legal ramifications of what you've done. You
posted the code here asking for help fixing it on the premise that it is
free and open code. By doing so, you've entered an agreement with everyone
on clpm who responds in any way to your code that this will always be the
case. Though I don't believe you could ever make a cent off it, bear in mind
that I have a real cause for legal action if I find out you use this code in
any commercial product (and that includes reproducing it for an employer).

By the way, have you put any thought into the public interface for this
thing? It's nice that it runs line-by-line and uses regexes to find tags,
but that's totally useless for XML parsing. Does it handle events like a SAX
parser? (Not that I see.) Does it build a parent/child tree? (Again, I don't
see anywhere that you can tell what the relationship is between any set of
tags.) Or is this just an exercise in writing regular expressions?

Matt
 
R

robic0

Exactly my point. The last XML processor I built took three weeks just to
write the design for and another 1.5 months to build. And I didn't write my
own parsers; I used a combination of DOM and SAX parsing. You don't know XML
and are proud that you've spent four days designing and writing on the fly
this parser of yours. Are you beginning to see why we don't take you
seriously.


Maybe you should consider the legal ramifications of what you've done. You
posted the code here asking for help fixing it on the premise that it is
free and open code. By doing so, you've entered an agreement with everyone
on clpm who responds in any way to your code that this will always be the
case. Though I don't believe you could ever make a cent off it, bear in mind
that I have a real cause for legal action if I find out you use this code in
any commercial product (and that includes reproducing it for an employer).
Man you make me laff!
By the way, have you put any thought into the public interface for this
thing? It's nice that it runs line-by-line and uses regexes to find tags,
but that's totally useless for XML parsing. Does it handle events like a SAX
parser? (Not that I see.) Does it build a parent/child tree? (Again, I don't
see anywhere that you can tell what the relationship is between any set of
tags.) Or is this just an exercise in writing regular expressions?

Matt
Since its out of sequence, its totally useless for event driven SAX.
However, in-line handling of contents could be re-directed for
special character handling.
Specific accumulation of special "tag" data could be handled too.
You have to think outside the box on this. Definetly the data
structure indenture is right on the money. To modify that data
in-line or pull off just the data you want is no problem.

To tell you the truth, there's a bunch this can do.
You better try to stay off the "negative" machine a little more.
Try the "positive" machine for a while. And oh well, if it flops
who cares, but it punches out some awsome timed data right now.
The technique is new, in my opinion its worth the effort.

Keep the comments coming... I don't care if its negative,
it leads me in the right direction. If I have to swear to get
some feedback so be it.
 
R

robic0

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

I'm back on the job.
I'm going to post some new code this week that
complies with XML spec.

This is the solution for the Comment/CDATA paradigm
that will be incorporated in the new version:

use strict;
use warnings;

$_ = '
<![CDATA[ <!-- imbed comment --> some text <!-- imbed as well -->]]>

<!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->

<!-- This is a real comment -->

';

#### This section of parser deals with
#### circular non-markup imbedding issues.
#### (one inside the other, and so forth)
#### So far just comments & cdata.
#### Use the general substitution magic.
#### This is valid because nesting of
#### comments nor cdata is allowed.

my $cnt = 1;
my %root = ();
my %cdata_elements = ();

print "\n";

# -- Comments (done first) --
while (s/(<!--(.*?)-->)/[$cnt]/s) {
$root{$cnt} = $1;
print "$cnt = Questionable comment: $1\n"; $cnt++;
}
print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
# -- CDATA (done second) --
while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
{
# reconstitute cdata element contents
my $cdata_contents = $1;
my $str = '';
while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
{
if (defined $1)
{
$str .= $1;
}
elsif (defined $2 && exists $root{$2})
{
$str .= $root{$2};
delete $root{$2};
}
else {
my $j = 0; # shouldn't get here
}
}
$root{$cnt} = $str;
$cdata_elements{$cnt} = '';

print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
}
# -- Process leftover comments that are real --
while (my ($key,$val) = each (%root)) {
if (!defined $cdata_elements{$key}) {
# This $root re-assignment is not really necessary
# since $1 will contain the processing text that
# will be processed here, then never used again.
$root{$key} =~ s/<!--(.*?)-->/$1/s;
print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
}
}


__END__

1 = Questionable comment: <!-- imbed comment -->
2 = Questionable comment: <!-- imbed as well -->
3 = Questionable comment: <!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->
4 = Questionable comment: <!-- This is a real comment -->


============================================================

The "Real" Stuff -->


5 = REAL CDATA: <!-- imbed comment --> some text <!-- imbed as well
-->

4 = REAL COMMENT: This is a real comment

3 = REAL COMMENT:
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
 
B

Bart Van der Donck

robic0 said:
I'm back on the job.
I'm going to post some new code this week that
complies with XML spec.

There is more than meets the eye.

An XML file may be well-formed, but invalid if it doesn't comply with
its DTD. Would your program complain about that ?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ELEMENT root ((mytag|mytag2),myothertag+,notrequiredtag?)>
<!ELEMENT mytag (#PCDATA)>
<!ELEMENT myothertag (#PCDATA)>
]>
<root>
<mytag>content 1</mytag>
<myothertag>content 2</myothertag>
</root>

What about the declaration of entities ?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ENTITY my_entity "this content was set by !ENTITY">
]>
<root>
<mytag>&my_entity;</mytag>
<myothertag>content 2</myothertag>
</root>

What about an ATTLIST ?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ATTLIST mytag
att1 CDATA #REQUIRED
att2 CDATA #IMPLIED>
<!ATTLIST myothertag att3 CDATA #FIXED
"this content was set by !ATTLIST">
]>
<root>
<mytag att1="attvalue1" att2="attvalue2">content 1</mytag>
<myothertag>content 2</myothertag>
</root>

What you gonna do with specific XSL tags ?

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<root>
<xsl:sort select="@ID" order="ascending" />
<mytag>
<xsl:attribute name='{name()}'>
<xsl:value-of select="." />
</xsl:attribute>
</mytag>
</root>
</xsl:stylesheet>

What about the rules from an XML schema ?

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="root">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="mytag" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>

It would be a good idea to decode numeric character references:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<mytag>i</mytag>
</root>

Same for the non-numeric ones:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<mytag>&amp;</mytag>
</root>

I would recommend "Perl & XML - XML Processing with Perl" by Erik T.
Ray & Jason McIntosh (edited by O'Reilly). Very good book. See
http://www.oreilly.com/catalog/perlxml/.

You need to learn more about XML:

http://www.w3.org/XML/
http://www.xml.com/
http://www.w3schools.com/xml/default.asp (tip!)
 
M

Matt Garrish

On Sat, 24 Dec 2005 11:57:13 -0500, "Matt Garrish"

Man you make me laff!

Well, at least you're getting as much out of this as I am. It would be nice
if you could drop the script-kiddie talk and write proper English sentences
in the future, though.
Since its out of sequence, its totally useless for event driven SAX.

That's exactly my point. What is this thing supposed to do? The (very
simple) point of an XML parser is to verify the integrity of the document
(validation: either well-formedness or compliance to a dtd or schema) and/or
allow you to access the content.

Your parser has no appreciation of nesting beyond the very trivial, so there
is no way that it can check well-formedness. It (you) also doesn't
understand dtds or schemas, and don't realize how nearly impossible it's
going to be for your parser to validate against one.

To get back to my original point, however, your parser does not build a
tree, so that makes it useless for half the applications of a parser. It
also doesn't handle events like a SAX parser, which makes it useless for the
other half. I'm honestly curious what real world application you think this
is going to have?

Oh, and when are you going to start handling xpath queries?

Matt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top