roy said:
I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.
This seems dangerous to me. Generally XML should be parsed by an XML
parser, not by something that happens to parse some restricted subset of
XML with a particular whitespace pattern that works for one example. If
the person who generates the file changes the white-space, for example,
then your program would break, while they could correctly claim that the
file they produced is valid XML and your program shouldn't have broke on
it. OTOH, if you have assurances the file you receive will always be in a
particular subset of XML that matches the expected white space, etc., then
perhaps the trade-off you are making is acceptable.
If parsing is not the bottleneck, then I'd just use something like
XML::Twig to read this one XML file and parcel it out to 10 XML files as a
preprocessing step. Then run each of those 10 files separately. Of
course, if parsing is the bottleneck, then this would defeat the purpose of
parallelizing.
Also, you should probably adapt your code to support "use strict;"
sub insert_orders {
my ($filename, $from, $to) = @_;
my $xml = new IO::File;
open ($xml, "< $filename");
if ($xml = set_handle ($xml, $from)) {
while (defined ($_dat = <$xml>)) {
$_temp = "\U$_dat\E"; # Convert into
capital letters
$_temp =~ s/\s+//g; # Remove blanks
if ($_temp eq '<ORDER>') {
$_mode = 'order';
$_order = '<?xml version="1.0" encoding="UTF-8"?>' .
"\n";
}
If you start fractionally through an order, this code above "burns" lines
until you get to the start of the first "full" order. Yet your set_handle
code also burns the initial (potentially) fraction of a line. It would be
cleaner if the code to burn data was all in one place.
$_order .= $_dat if $_mode eq 'order';
if ($_temp eq '</ORDER>') {
# load $_order into the database ...
This tries to load $_order into the database even when there is no
order to load, i.e. when you started out in the middle of a previous order.
$_order will be empty, but you try to load it anyway. Is that a problem?
You want to load $_order only when
$_order = '';
$_mode = '';
last if ($to <= tell ($xml));
This has the potential to lose orders. Let's say that $to is 1000, and
an order starts exactly at position 1000. This job will not process that
order, because because $to<=1000 is true. The next-higher job, whose
$start is 1000, also will not process this order, as the "partial" first
line it burned just happened to be a true full line, and that order
therefore gets forgotten. (I've verified this does in fact happen in a test
case)
last if ($to < tell($xml));
(Or change the way you burn data, as suggested above, so it all happens
in only one place.)
....
sub set_handle {
my ($handle, $pos) = @_;
seek($handle,$pos,SEEK_CUR);
if (defined (<$handle>)) # start new line
You probably only want to burn a line when $pos>0. When $pos==0, you
know the first line you read will be complete, so there is no reason
to burn it. Generally the burned line that starts out with each chunk will
be processed in the "previous" chunk, but when $pos==0 there was no
previous chunk. But this will not actually be a problem unless the first
line of you XML file is said:
{ return $handle; }
else
{ return; }
I don't understand the above. If $handle returned an undefined
value this time you read from it, won't it do so next time as well?
(I think the only time this isn't true is when $handle is an alias for
ARGV). So why not just return $handle regardless?
Xho