Simple XML question ...

Discussion in 'Perl Misc' started by Stephen O'D, Feb 5, 2007.

  1. Stephen O'D

    Stephen O'D Guest

    I have a big file that looks similar to this:

    <file>
    <item>
    <itemtags>foo</itemtags>
    </item>
    <item>
    <itemtags>foo</itemtags>
    </item>
    ... 1000's of repititions
    <item>
    <itemtags>foo</itemtags>
    </item>
    </file>

    I want to ensure the file is valid and grab each item (without the
    item wrapper - ie all child tags of each item).

    I was hoping to do this using XML::parser, but I just cannot work out
    how to get the actual markup text contained in a tag. I know can use
    it in Subs mode, and set a handler for item. How can I use that to
    extract the parts I care about?

    Thanks,

    Stephen.
     
    Stephen O'D, Feb 5, 2007
    #1
    1. Advertising

  2. Stephen O'D

    Stephen O'D Guest

    On Feb 5, 2:43 pm, Michele Dondi <> wrote:
    > On 5 Feb 2007 05:40:25 -0800, "Stephen O'D"
    >
    > <> wrote:
    > >Subject: Simple XML question ...

    >
    > If it's Simple and XML, how 'bout XML::Simple?
    >
    > >I have a big file that looks similar to this:

    >
    > ><file>
    > > <item>
    > > <itemtags>foo</itemtags>
    > > </item>

    >
    > [snip]
    >
    > Seems simple enough...
    >
    > >I want to ensure the file is valid and grab each item (without the
    > >item wrapper - ie all child tags of each item).

    >
    > >I was hoping to do this using XML::parser, but I just cannot work out
    > >how to get the actual markup text contained in a tag. I know can use
    > >it in Subs mode, and set a handler for item. How can I use that to
    > >extract the parts I care about?

    >
    > Well, let's see if X::S can actually do that:
    >
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    > use XML::Simple;
    >
    > die "D'Oh!\n" unless @ARGV;
    >
    > print $_->{itemtags}, "\n"
    > for @{ (XMLin shift)->{item} };
    >
    > __END__
    >
    > Yes, it seems to do the job. Provided that I understood the job
    > correctly. Error checking is left as an exercise to the reader: here I
    > am assuming everything will go fine in any case.
    >
    > Michele


    Thats not exactly what I want. My file is more like:

    <file>
    <item>
    <itemtags>
    <tag1>foo</tag1>
    <tag2>bar</tag2>
    </itemtags>
    </item>
    <item>
    <itemtags>
    <tag1>foo</tag1>
    <tag2>bar</tag2>
    </itemtags>
    </item>
    ... 1000's of repititions
    </file>

    So I need a series of xmlchunks like the following as my output (they
    will be passed to another process for processing one at a time):

    <itemtags>
    <tag1>foo</tag1>
    <tag2>bar</tag2>
    </itemtags>

    Also, the files I am dealing with are going to be large, and each
    itemtags section is about 32K in size.

    I am struggling to find some way to get me just the output. I have
    something working with XML::Twig:

    [sodonnel@millhouse]$ more twig.pl
    use XML::Twig;

    sub print_it {
    my ($t, $elt) = @_;
    $elt->set_asis;
    print $elt->sprint($elt,1), "\n";
    $t->purge;
    }

    my $t= XML::Twig->new( twig_handlers =>
    { 'item' => \&print_it }
    );
    $t->parsefile( 'data.xml');

    [sodonnel@millhouse]$ perl twig.pl
    <itemtags>foo</itemtags>
    <itemtags>foo</itemtags>
    <itemtags>foo</itemtags>

    I have no real experience parsing big xml files in Perl (or
    anything). My file has 10 items at a total size of ~400K and it takes
    ~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
    to me - can I expect to parse the file faster than that?

    Stephen.
     
    Stephen O'D, Feb 5, 2007
    #2
    1. Advertising

  3. Stephen O'D

    mirod Guest

    Stephen O'D wrote:


    > I am struggling to find some way to get me just the output. I have
    > something working with XML::Twig:
    >
    > [sodonnel@millhouse]$ more twig.pl
    > use XML::Twig;
    >
    > sub print_it {
    > my ($t, $elt) = @_;
    > $elt->set_asis;
    > print $elt->sprint($elt,1), "\n";
    > $t->purge;
    > }
    >
    > my $t= XML::Twig->new( twig_handlers =>
    > { 'item' => \&print_it }
    > );
    > $t->parsefile( 'data.xml');
    >
    > [sodonnel@millhouse]$ perl twig.pl
    > <itemtags>foo</itemtags>
    > <itemtags>foo</itemtags>
    > <itemtags>foo</itemtags>
    >
    > I have no real experience parsing big xml files in Perl (or
    > anything). My file has 10 items at a total size of ~400K and it takes
    > ~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
    > to me - can I expect to parse the file faster than that?


    Hi,

    Your code looks about right. The $elt->set_asis I believe is useless
    (and dangerous actually), you should probably get rid of it.

    As far as speed goes, it depends on your system, you can have a look at
    the various benchmarks in the Ways to Rome serie:
    http://xmltwig.com/article/index_wtr.html (basically XML::LibXML is very
    fast, most other modules are slower.
     
    mirod, Feb 5, 2007
    #3
  4. Stephen O'D

    Stephen O'D Guest

    > <> wrote:
    > >Thats not exactly what I want. My file is more like:

    >
    > ><file>
    > > <item>
    > > <itemtags>
    > > <tag1>foo</tag1>
    > > <tag2>bar</tag2>

    > [snip]
    > >So I need a series of xmlchunks like the following as my output (they
    > >will be passed to another process for processing one at a time):

    >
    > > <itemtags>
    > > <tag1>foo</tag1>
    > > <tag2>bar</tag2>
    > > </itemtags>

    >
    > Well, then it would be even easier, but...
    >
    >
    >
    > >Also, the files I am dealing with are going to be large, and each
    > >itemtags section is about 32K in size.

    > [snip]
    > >I have no real experience parsing big xml files in Perl (or
    > >anything). My file has 10 items at a total size of ~400K and it takes
    > >~ 5.2 CPU seconds to parse it and print each chunk. That seems slow
    > >to me - can I expect to parse the file faster than that?


    Ok, for those that are interested I have now two ways of doing this,
    using XML::Twig or XML::parser

    XML::Twig code:

    [sodonnel@millhouse]$ more twig.pl
    use XML::Twig;
    use Benchmark;

    my $item;

    sub print_it {
    my ($t, $elt) = @_;
    $elt->set_asis;
    # putting this into $item and then clearing it is stupid
    # but its to make it a fair test to what I am doing with
    # XML::parser.
    $item = $elt->sprint($elt,1), "\n";
    $item = '';
    $t->purge;
    }

    my $t= XML::Twig->new( twig_handlers =>
    { 'cloudItem' => \&print_it }
    );
    my $bstart = new Benchmark;
    $t->parsefile( 'cloud.xml');
    my $bend = new Benchmark;

    print timestr(timediff($bend,$bstart)), "\n";

    XML::parser code:

    [sodonnel@millhouse]$ more xml_parser.pl
    use XML::parser;
    use Benchmark;

    my( $in_item, $item_text);

    my $bstart = new Benchmark;

    my $parser = XML::parser->new(Handlers => { Start => \&tag_start,
    End => \&tag_end,
    Char => \&characters,
    });

    $parser->parsefile('cloud.xml');
    my $bend = new Benchmark;

    print timestr(timediff($bend,$bstart)), "\n";

    exit(0);

    sub tag_start {
    my ($xp, $el) = @_;
    # this will copy all but the first occurrance into item text
    if ($in_item >= 1) { $item_text .= $xp->recognized_string }
    if ($el eq 'cloudItem') { $in_item += 1 }
    }

    sub tag_end {
    my ($xp, $el) = @_;
    if ($el eq 'cloudItem') { $in_item -= 1 }
    if ($in_item == 0) {
    #print $item_text;
    $item_text = '';
    } else {
    # copies everything but the closing cloudItem tag
    $item_text .= $xp->recognized_string;
    }
    }

    sub characters {
    my ($xp, $txt) = @_;
    if ($in_item) { $item_text .= $txt }
    }

    [sodonnel@millhouse]$ perl xml_parser.pl
    1 wallclock secs ( 1.59 usr + 0.00 sys = 1.59 CPU)
    [sodonnel@millhouse]$ perl twig.pl
    5 wallclock secs ( 5.14 usr + 0.02 sys = 5.16 CPU)

    So XML::parser wins by quite a way, probably because it doesn't make a
    memory structure of the tags. Goodness only knows if this is the best
    way, but its good enough for now.

    Cheers,

    Stephen.
     
    Stephen O'D, Feb 6, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    580
  2. Kevin Spencer

    Re: Simple Simple question!!!

    Kevin Spencer, Jun 25, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    689
    Kevin Spencer
    Jun 25, 2004
  3. mathieu
    Replies:
    3
    Views:
    536
    mathieu
    Jan 6, 2007
  4. Replies:
    1
    Views:
    414
    Joseph Kesselman
    Jun 7, 2007
  5. Erik Wasser
    Replies:
    5
    Views:
    500
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page