I'm writing a module to parse an XML file of records. It
will be used by a variety of different applications, e.g.,
loading into a relational database, etc. I'll be using
a SAX based approach, ExpatXS, as the XML files can be
very large.
In the past I've written such modules by assembling a huge
data structure in memory then returning it to the calling
application as, say, a reference to an array of hashes.
This was tremendously convenient yet very very slow. Some
applications would take hours to execute. This time around
I'd like to learn something new and approach it differently.
Is there some way to design this, module plus application,
so that as a record is read the application can process it
immediately? Is this what is know as a pull-based architecture?
How does the application "know" when a new record is available?
Does it listen for something that the module emits? I'm
thinking maybe it can be done with a callback. The callback
subroutine is written in the calling application and when
the end of the record is parsed, that subroutine is called.
I'm sure this is a basic question but it's new to me. Is my
callback idea worth exploring? Are there any design patterns
people can point me to? Example programs? Articles online?
Thanks!
Arvin
I guess i'm coming late to this question but will give it a shot
for you. I see some code thrown around so i'll throw some too.
First and formost, if your processing a xtra large xml file
you want to use a "stream" processor where you get start/end tag/content
notification. The stream processor, if passed a file handle should do something
like this:
"p" is a parsing object instantiated from your program.
$p->parse(*DATA);
-------------
here's what happens:
"module"
============
sub parse {
my ($self, $data) = @_;
throwX ('30') unless (!$self->{'InParse'});
throwX ('31') unless (defined $data);
$self->{'InParse'} = 1;
# call processor
if (ref($data) eq 'SCALAR') {
print "SCALAR ref\n" if ($self->{'debug'});
eval {Processor($self, 1, $data);};
if ($@) {
Cleanup($self); die $@;
}
}
elsif (ref(\$data) eq 'SCALAR') {
print "SCALAR string\n" if ($self->{'debug'});
eval {Processor($self, 1, \$data);};
if ($@) {
Cleanup($self); die $@;
}
} else {
if (ref($data) ne 'GLOB' && ref(\$data) ne 'GLOB') {
$self->{'InParse'} = 0;
die "rp_error_parse, data source not a string or filehandle nor reference to one\n";
}
print "GLOB ref or filehandle\n" if ($self->{'debug'});
eval {Processor($self, 0, $data);};
if ($@) {
Cleanup($self); die $@;
}
}
$self->{'InParse'} = 0;
}
sub Processor
{
my ($obj, $BUFFERED, $rpl_mk) = @_;
my ($markup_file);
my $parse_ln = '';
my $dyna_ln = '';
my $ref_parse_ln = \$parse_ln;
my $ref_dyna_ln = \$dyna_ln;
if ($BUFFERED) {
$ref_parse_ln = $rpl_mk;
$ref_dyna_ln = \$dyna_ln;
} else {
# assume its a ref to a global or global itself
$markup_file = $rpl_mk;
$ref_dyna_ln = $ref_parse_ln;
}
my $ln_cnt = 0;
my $complete_comment = 0;
my $complete_cdata = 0;
my @Tags = ();
my $havroot = 0;
my $last_cpos = 0;
my $done = 0;
my $content = '';
my $altcontent = undef;
$obj->{'origcontent'} = \$content;
while (!$done)
{
$ln_cnt++;
# stream processing (if not buffered)
if (!$BUFFERED) {
if (!($_ = <$markup_file>)) {
# just parse what we have
$done = 1;
# boundry check for runnaway
if (($complete_comment+$complete_cdata) > 0) {
$ln_cnt--;
}
} else {
$$ref_parse_ln .= $_;
## buffer if needing comment/cdata closure
next if ($complete_comment && !/-->/);
next if ($complete_cdata && !/\]\]>/);
## reset comment/cdata flags
$complete_comment = 0;
$complete_cdata = 0;
## flag serialized comments/cdata buffering
if (/(<!--)|(<!\[CDATA\[)/)
{
if (defined $1) { # complete comment
if ($$ref_parse_ln !~ /<!--.*?-->/s) {
$complete_comment = 1;
next;
}
}
elsif (defined $2) { # complete cdata
if ($$ref_parse_ln !~ /<!\[CDATA\[.*?\]\]>/s) {
$complete_cdata = 1;
next;
}
}
}
## buffer until '>' or eof
next if (!/>/);
}
} else {
$ln_cnt = 1;
$done = 1;
}
## REGEX Parsing loop
while ($$ref_parse_ln =~ /$RxParse/g)
{
... the core ...
does callbacks here and alot of other shit, section totals about 3000 lines
}
}
}
=============
This just happens to handle eol oriented file io (also ram based io) and
a buffer passed reference (from a slurrped file? .. this is 20% faster).
The passed file handle technique above will read past as many eol's as
necessary to get a target processing character. The eol is *not* even a
factor. Beyond the *target*, this technique does not buffer file data,
so a very *large* file consumes almost no ram in the transaction.
In order to guarantee the structural integrity of your records you have
to use a shcema checker ahead of time. If you don't care and there's a problem,
well..... Shucks!
In a programming sense, you will want to validate that integrity,
otherwise all that is said below is invalid.
Use a schema checker (and have a schema file, this is mandatory) before you
parse. Then when you parse, you have safeguards.
If you want detailed instructions on how to install Xerces, let me know.
Xerces is a robo-lib generated interface to Apache's Xerces.
A note about schema. Schema is not a guarantee of structural integrity.
It becomes apparent if you have complex possibilities and schema is only
general, as in level oriented. I could give examples and solutions if requested.
When you parse, it is you who is parsing the data into records (structures), not
the parser. You should be in control of start and end of your records.
You have control in your program *when* the eor state is valid. That validity
will be reached by *you* in the context of *your* program.
Before you initiate parsing, you should validate the file (if you can), something
like this:
your program..
use XML::Xerces;
if (!ValidateSchema ($Xmlfilename,$SchemaFilename)) {not valid};
sub ValidateSchema {
my ($xfile, $SchemaFile) = @_;
# Docs:
http://xml.apache.org/xerces-c/apiDocs/classAbstractDOMParser.html#z869_9
my $Xparser = XML::Xerces::XercesDOMParser->new();
$Xparser->setValidationScheme(1);
$Xparser->setDoNamespaces(1);
$Xparser->setDoSchema(1);
#$Xparser->setValidationSchemaFullChecking(1); # full constraint (if enabled, may be time-consuming)
$Xparser->setExternalNoNamespaceSchemaLocation($SchemaFile);
my $ERROR_HANDLER = XLoggingErrorHandler->new(\&LogX_warn, \&LogX_error, \&LogX_ferror, );
$Xparser->setErrorHandler($ERROR_HANDLER);
eval {$Xparser->parse (XML::Xerces::LocalFileInputSource->new($xfile));};
return 1;
}
Ok, i got a important phone call, so i'll continue this in a little later.
The important stuff comes on how to easily
pick off your records and process them in-stream. No problem whatsoever.
Brb..