Simple XML question ...

S

Stephen O'D

I have a big file that looks similar to this:

<file>
<item>
<itemtags>foo</itemtags>
</item>
<item>
<itemtags>foo</itemtags>
</item>
... 1000's of repititions
<item>
<itemtags>foo</itemtags>
</item>
</file>

I want to ensure the file is valid and grab each item (without the
item wrapper - ie all child tags of each item).

I was hoping to do this using XML::parser, but I just cannot work out
how to get the actual markup text contained in a tag. I know can use
it in Subs mode, and set a handler for item. How can I use that to
extract the parts I care about?

Thanks,

Stephen.
 
S

Stephen O'D

Subject: Simple XML question ...

If it's Simple and XML, how 'bout XML::Simple?
I have a big file that looks similar to this:
<file>
<item>
<itemtags>foo</itemtags>
</item>

[snip]

Seems simple enough...
I want to ensure the file is valid and grab each item (without the
item wrapper - ie all child tags of each item).
I was hoping to do this using XML::parser, but I just cannot work out
how to get the actual markup text contained in a tag. I know can use
it in Subs mode, and set a handler for item. How can I use that to
extract the parts I care about?

Well, let's see if X::S can actually do that:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Simple;

die "D'Oh!\n" unless @ARGV;

print $_->{itemtags}, "\n"
for @{ (XMLin shift)->{item} };

__END__

Yes, it seems to do the job. Provided that I understood the job
correctly. Error checking is left as an exercise to the reader: here I
am assuming everything will go fine in any case.

Michele

Thats not exactly what I want. My file is more like:

<file>
<item>
<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
</itemtags>
</item>
<item>
<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
</itemtags>
</item>
... 1000's of repititions
</file>

So I need a series of xmlchunks like the following as my output (they
will be passed to another process for processing one at a time):

<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
</itemtags>

Also, the files I am dealing with are going to be large, and each
itemtags section is about 32K in size.

I am struggling to find some way to get me just the output. I have
something working with XML::Twig:

[sodonnel@millhouse]$ more twig.pl
use XML::Twig;

sub print_it {
my ($t, $elt) = @_;
$elt->set_asis;
print $elt->sprint($elt,1), "\n";
$t->purge;
}

my $t= XML::Twig->new( twig_handlers =>
{ 'item' => \&print_it }
);
$t->parsefile( 'data.xml');

[sodonnel@millhouse]$ perl twig.pl
<itemtags>foo</itemtags>
<itemtags>foo</itemtags>
<itemtags>foo</itemtags>

I have no real experience parsing big xml files in Perl (or
anything). My file has 10 items at a total size of ~400K and it takes
~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
to me - can I expect to parse the file faster than that?

Stephen.
 
M

mirod

Stephen O'D wrote:

I am struggling to find some way to get me just the output. I have
something working with XML::Twig:

[sodonnel@millhouse]$ more twig.pl
use XML::Twig;

sub print_it {
my ($t, $elt) = @_;
$elt->set_asis;
print $elt->sprint($elt,1), "\n";
$t->purge;
}

my $t= XML::Twig->new( twig_handlers =>
{ 'item' => \&print_it }
);
$t->parsefile( 'data.xml');

[sodonnel@millhouse]$ perl twig.pl
<itemtags>foo</itemtags>
<itemtags>foo</itemtags>
<itemtags>foo</itemtags>

I have no real experience parsing big xml files in Perl (or
anything). My file has 10 items at a total size of ~400K and it takes
~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
to me - can I expect to parse the file faster than that?

Hi,

Your code looks about right. The $elt->set_asis I believe is useless
(and dangerous actually), you should probably get rid of it.

As far as speed goes, it depends on your system, you can have a look at
the various benchmarks in the Ways to Rome serie:
http://xmltwig.com/article/index_wtr.html (basically XML::LibXML is very
fast, most other modules are slower.
 
S

Stephen O'D

Thats not exactly what I want. My file is more like:
<file>
<item>
<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
[snip]
So I need a series of xmlchunks like the following as my output (they
will be passed to another process for processing one at a time):
<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
</itemtags>

Well, then it would be even easier, but...


Also, the files I am dealing with are going to be large, and each
itemtags section is about 32K in size. [snip]
I have no real experience parsing big xml files in Perl (or
anything). My file has 10 items at a total size of ~400K and it takes
~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
to me - can I expect to parse the file faster than that?

Ok, for those that are interested I have now two ways of doing this,
using XML::Twig or XML::parser

XML::Twig code:

[sodonnel@millhouse]$ more twig.pl
use XML::Twig;
use Benchmark;

my $item;

sub print_it {
my ($t, $elt) = @_;
$elt->set_asis;
# putting this into $item and then clearing it is stupid
# but its to make it a fair test to what I am doing with
# XML::parser.
$item = $elt->sprint($elt,1), "\n";
$item = '';
$t->purge;
}

my $t= XML::Twig->new( twig_handlers =>
{ 'cloudItem' => \&print_it }
);
my $bstart = new Benchmark;
$t->parsefile( 'cloud.xml');
my $bend = new Benchmark;

print timestr(timediff($bend,$bstart)), "\n";

XML::parser code:

[sodonnel@millhouse]$ more xml_parser.pl
use XML::parser;
use Benchmark;

my( $in_item, $item_text);

my $bstart = new Benchmark;

my $parser = XML::parser->new(Handlers => { Start => \&tag_start,
End => \&tag_end,
Char => \&characters,
});

$parser->parsefile('cloud.xml');
my $bend = new Benchmark;

print timestr(timediff($bend,$bstart)), "\n";

exit(0);

sub tag_start {
my ($xp, $el) = @_;
# this will copy all but the first occurrance into item text
if ($in_item >= 1) { $item_text .= $xp->recognized_string }
if ($el eq 'cloudItem') { $in_item += 1 }
}

sub tag_end {
my ($xp, $el) = @_;
if ($el eq 'cloudItem') { $in_item -= 1 }
if ($in_item == 0) {
#print $item_text;
$item_text = '';
} else {
# copies everything but the closing cloudItem tag
$item_text .= $xp->recognized_string;
}
}

sub characters {
my ($xp, $txt) = @_;
if ($in_item) { $item_text .= $txt }
}

[sodonnel@millhouse]$ perl xml_parser.pl
1 wallclock secs ( 1.59 usr + 0.00 sys = 1.59 CPU)
[sodonnel@millhouse]$ perl twig.pl
5 wallclock secs ( 5.14 usr + 0.02 sys = 5.16 CPU)

So XML::parser wins by quite a way, probably because it doesn't make a
memory structure of the tags. Goodness only knows if this is the best
way, but its good enough for now.

Cheers,

Stephen.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top