Simple XML question ...

Stephen O'D · Feb 5, 2007

I have a big file that looks similar to this:

<file>
<item>
<itemtags>foo</itemtags>
</item>
<item>
<itemtags>foo</itemtags>
</item>
... 1000's of repititions
<item>
<itemtags>foo</itemtags>
</item>
</file>

I want to ensure the file is valid and grab each item (without the
item wrapper - ie all child tags of each item).

I was hoping to do this using XML:

arser, but I just cannot work out
how to get the actual markup text contained in a tag. I know can use
it in Subs mode, and set a handler for item. How can I use that to
extract the parts I care about?

Thanks,

Stephen.

Stephen O'D · Feb 5, 2007

Subject: Simple XML question ...

Click to expand...

If it's Simple and XML, how 'bout XML::Simple?

I have a big file that looks similar to this:

Click to expand...

<file>
<item>
<itemtags>foo</itemtags>
</item>

Click to expand...

[snip]

Seems simple enough...

I want to ensure the file is valid and grab each item (without the
item wrapper - ie all child tags of each item).

Click to expand...

I was hoping to do this using XML:arser, but I just cannot work out
how to get the actual markup text contained in a tag. I know can use
it in Subs mode, and set a handler for item. How can I use that to
extract the parts I care about?

Click to expand...

Well, let's see if X::S can actually do that:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Simple;

die "D'Oh!\n" unless @ARGV;

print $_->{itemtags}, "\n"
for @{ (XMLin shift)->{item} };

__END__

Yes, it seems to do the job. Provided that I understood the job
correctly. Error checking is left as an exercise to the reader: here I
am assuming everything will go fine in any case.

Michele

Thats not exactly what I want. My file is more like:

<file>
<item>
<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
</itemtags>
</item>
<item>
<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
</itemtags>
</item>
... 1000's of repititions
</file>

So I need a series of xmlchunks like the following as my output (they
will be passed to another process for processing one at a time):

<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
</itemtags>

Also, the files I am dealing with are going to be large, and each
itemtags section is about 32K in size.

I am struggling to find some way to get me just the output. I have
something working with XML::Twig:

[sodonnel@millhouse]$ more twig.pl
use XML::Twig;

sub print_it {
my ($t, $elt) = @_;
$elt->set_asis;
print $elt->sprint($elt,1), "\n";
$t->purge;
}

my $t= XML::Twig->new( twig_handlers =>
{ 'item' => \&print_it }
);
$t->parsefile( 'data.xml');

[sodonnel@millhouse]$ perl twig.pl
<itemtags>foo</itemtags>
<itemtags>foo</itemtags>
<itemtags>foo</itemtags>

I have no real experience parsing big xml files in Perl (or
anything). My file has 10 items at a total size of ~400K and it takes
~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
to me - can I expect to parse the file faster than that?

Stephen.

mirod · Feb 5, 2007

Stephen O'D wrote:

I am struggling to find some way to get me just the output. I have
something working with XML::Twig:

[sodonnel@millhouse]$ more twig.pl
use XML::Twig;

sub print_it {
my ($t, $elt) = @_;
$elt->set_asis;
print $elt->sprint($elt,1), "\n";
$t->purge;
}

my $t= XML::Twig->new( twig_handlers =>
{ 'item' => \&print_it }
);
$t->parsefile( 'data.xml');

[sodonnel@millhouse]$ perl twig.pl
<itemtags>foo</itemtags>
<itemtags>foo</itemtags>
<itemtags>foo</itemtags>

I have no real experience parsing big xml files in Perl (or
anything). My file has 10 items at a total size of ~400K and it takes
~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
to me - can I expect to parse the file faster than that?

Hi,

Your code looks about right. The $elt->set_asis I believe is useless
(and dangerous actually), you should probably get rid of it.

As far as speed goes, it depends on your system, you can have a look at
the various benchmarks in the Ways to Rome serie:
http://xmltwig.com/article/index_wtr.html (basically XML::LibXML is very
fast, most other modules are slower.

Stephen O'D · Feb 6, 2007

Thats not exactly what I want. My file is more like:

<file>
<item>
<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>

Click to expand...

[snip]
So I need a series of xmlchunks like the following as my output (they
will be passed to another process for processing one at a time):

Click to expand...

<itemtags>
<tag1>foo</tag1>
<tag2>bar</tag2>
</itemtags>

Click to expand...

Well, then it would be even easier, but...

Also, the files I am dealing with are going to be large, and each
itemtags section is about 32K in size. [snip]
I have no real experience parsing big xml files in Perl (or
anything). My file has 10 items at a total size of ~400K and it takes
~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
to me - can I expect to parse the file faster than that?

Click to expand...

Ok, for those that are interested I have now two ways of doing this,
using XML::Twig or XML:

arser

XML::Twig code:

[sodonnel@millhouse]$ more twig.pl
use XML::Twig;
use Benchmark;

my $item;

sub print_it {
my ($t, $elt) = @_;
$elt->set_asis;
# putting this into $item and then clearing it is stupid
# but its to make it a fair test to what I am doing with
# XML:

arser.
$item = $elt->sprint($elt,1), "\n";
$item = '';
$t->purge;
}

my $t= XML::Twig->new( twig_handlers =>
{ 'cloudItem' => \&print_it }
);
my $bstart = new Benchmark;
$t->parsefile( 'cloud.xml');
my $bend = new Benchmark;

print timestr(timediff($bend,$bstart)), "\n";

XML:

arser code:

[sodonnel@millhouse]$ more xml_parser.pl
use XML:

arser;
use Benchmark;

my( $in_item, $item_text);

my $bstart = new Benchmark;

my $parser = XML:

arser->new(Handlers => { Start => \&tag_start,
End => \&tag_end,
Char => \&characters,
});

$parser->parsefile('cloud.xml');
my $bend = new Benchmark;

print timestr(timediff($bend,$bstart)), "\n";

exit(0);

sub tag_start {
my ($xp, $el) = @_;
# this will copy all but the first occurrance into item text
if ($in_item >= 1) { $item_text .= $xp->recognized_string }
if ($el eq 'cloudItem') { $in_item += 1 }
}

sub tag_end {
my ($xp, $el) = @_;
if ($el eq 'cloudItem') { $in_item -= 1 }
if ($in_item == 0) {
#print $item_text;
$item_text = '';
} else {
# copies everything but the closing cloudItem tag
$item_text .= $xp->recognized_string;
}
}

sub characters {
my ($xp, $txt) = @_;
if ($in_item) { $item_text .= $txt }
}

[sodonnel@millhouse]$ perl xml_parser.pl
1 wallclock secs ( 1.59 usr + 0.00 sys = 1.59 CPU)
[sodonnel@millhouse]$ perl twig.pl
5 wallclock secs ( 5.14 usr + 0.02 sys = 5.16 CPU)

So XML:

arser wins by quite a way, probably because it doesn't make a
memory structure of the tags. Goodness only knows if this is the best
way, but its good enough for now.

Cheers,

Stephen.

XML::Simple question	4	Jan 21, 2006
How to remove an empty line which is created when i deleted a element from my xml file?	0	Oct 1, 2016
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 10, 2022
Help with Hashs/Arrays and XML::Simple	2	Feb 26, 2010
XML::Simple - Processing Query	7	Jul 27, 2010
Insight on a coding project.	1	Jun 19, 2022
extracting part of xml	10	Feb 16, 2006
XML Novice Question	3	Dec 7, 2006

Simple XML question ...

Stephen O'D

Stephen O'D

mirod

Stephen O'D

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads