How to import only part of a large XML file?

Discussion in 'Perl Misc' started by Dwight Army of Champions, Nov 11, 2011.

  1. I have a very large XML file that I want to load, but I don't want to
    necessarily load the entire document; that takes too long. What I want
    to do instead is only key/value pairs that meet certain criteria, like
    only grab entries whose value fall within a certain date for a key
    date_of_entry. Can I just use XML::Simple for this or do I need a
    better module?
     
    Dwight Army of Champions, Nov 11, 2011
    #1
    1. Advertising

  2. On Nov 11, 5:59 pm, Ben Morrow <> wrote:
    > Quoth Dwight Army of Champions <>:
    >
    > > I have a very large XML file that I want to load, but I don't want to
    > > necessarily load the entire document; that takes too long. What I want
    > > to do instead is only key/value pairs that meet certain criteria, like
    > > only grab entries whose value fall within a certain date for a key
    > > date_of_entry. Can I just use XML::Simple for this or do I need a
    > > better module?

    >
    > It sounds like you want either XML::Twig or one of the SAX modules.
    > XML::Simple, at least in non-SAX mode, will load the entire document
    > into a tree structure before letting you see any of it.
    >
    > Ben


    I'm glancing at XML::Twig on search.cpan.org, What methods can I use
    to accomplish these tasks? I don't see any kind of "filter" method...
     
    Dwight Army of Champions, Nov 11, 2011
    #2
    1. Advertising

  3. * Dwight Army of Champions wrote in comp.lang.perl.misc:
    >I have a very large XML file that I want to load, but I don't want to
    >necessarily load the entire document; that takes too long. What I want
    >to do instead is only key/value pairs that meet certain criteria, like
    >only grab entries whose value fall within a certain date for a key
    >date_of_entry. Can I just use XML::Simple for this or do I need a
    >better module?


    It depends on what you mean by "key/value pairs". If you want to filter
    elements based on attributes, and don't particularily need to look at
    child elements, then the SAX modules are likely a good fit, they report
    events like "start of element plus attributes" and "end of element" and
    you have to manage state between the events. Generally, this should help
    <http://perl-xml.sourceforge.net/faq/#parser_selection>, and if you have
    special needs, is likely to give the
    best advice.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
     
    Bjoern Hoehrmann, Nov 11, 2011
    #3
  4. On Nov 11, 6:26 pm, Bjoern Hoehrmann <> wrote:
    > * Dwight Army of Champions wrote in comp.lang.perl.misc:
    >
    > >I have a very large XML file that I want to load, but I don't want to
    > >necessarily load the entire document; that takes too long. What I want
    > >to do instead is only key/value pairs that meet certain criteria, like
    > >only grab entries whose value fall within a certain date for a key
    > >date_of_entry. Can I just use XML::Simple for this or do I need a
    > >better module?

    >
    > It depends on what you mean by "key/value pairs". If you want to filter
    > elements based on attributes, and don't particularily need to look at
    > child elements, then the SAX modules are likely a good fit, they report
    > events like "start of element plus attributes" and "end of element" and
    > you have to manage state between the events. Generally, this should help
    > <http://perl-xml.sourceforge.net/faq/#parser_selection>, and if you have
    > special needs, is likely to give the
    > best advice.
    > --
    > Björn Höhrmann · mailto: ·http://bjoern.hoehrmann.de
    > Am Badedeich 7 · Telefon: +49(0)160/4415681 ·http://www.bjoernsworld.de
    > 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 ·http://www.websitedev.de/


    For example, suppose I have the following XML input file:

    <?xml version="1.0"?>
    <library>
    <book>
    <title>Dreamcatcher</title>
    <author>Stephen King</author>
    <genre>Horror</genre>
    <pages>899</pages>
    <price>23.99</price>
    <rating>5</rating>
    <publication_date>11/27/2001</publication_date>
    </book>
    <book>
    <title>Mystic River</title>
    <author>Dennis Lehane</author>
    <genre>Thriller</genre>
    <pages>390</pages>
    <price>17.49</price>
    <rating>4</rating>
    <publication_date>07/22/2003</publication_date>
    </book>
    <book>
    <title>The Lord Of The Rings</title>
    <author>J. R. R. Tolkien</author>
    <genre>Fantasy</genre>
    <pages>3489</pages>
    <price>10.99</price>
    <rating>5</rating>
    <publication_date>10/12/2005</publication_date>
    </book>
    </library>


    Suppose I only want to import books that were published after January
    1, 2002. If I apply such a filter when I do my initial import, the
    result should look like this:

    $VAR1 = {
    'book' => [
    {
    'publication_date' => '07/22/2003',
    'price' => '17.49',
    'author' => 'Dennis Lehane',
    'title' => 'Mystic River',
    'rating' => '4',
    'pages' => '390',
    'genre' => 'Thriller'
    },
    {
    'publication_date' => '10/12/2005',
    'price' => '10.99',
    'author' => 'J. R. R. Tolkien',
    'title' => 'The Lord Of The Rings',
    'rating' => '5',
    'pages' => '3489',
    'genre' => 'Fantasy'
    }
    ]
    };

    The import will completely ignore entries that don't meet the
    specified criteria (in this case, publication_date >= '1/1/2002').
     
    Dwight Army of Champions, Nov 12, 2011
    #4
  5. * Dwight Army of Champions wrote in comp.lang.perl.misc:
    >For example, suppose I have the following XML input file:


    >Suppose I only want to import books that were published after January
    >1, 2002. If I apply such a filter when I do my initial import, the
    >result should look like this:


    One way to do this would be with a SAX filter: you look for "book"
    elements, store all events until you can decider whether you are
    interested in this branch, and then re-emit or discard the events.
    You can then use some module that turns the SAX stream into some
    more Perl-ish data structure. There are some libraries that allow
    you to filter in this fashion automatically ("xpath filtering"),
    but I am not sure which, if any, modules for Perl do this for you.

    Note that size is quite important here, with 100 MB you might just
    suffer "too long" but with 5 GB you might suffer "impossible" for
    some possible solutions. Some "reader"-style APIs allow you to go
    to a "book" element, read everything up to the end of the element
    into some DOM-style representation, and then make it easy to check
    if you are interested in this branch as you have DOM-style access,
    but only to the interesting part, so you save memory. Similar to
    the SAX filter solution, except that you trade some memory and per-
    haps speed for ease of programming.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
     
    Bjoern Hoehrmann, Nov 12, 2011
    #5
  6. Dwight Army of Champions

    Klaus Guest

    On 12 nov, 01:11, Dwight Army of Champions
    <> wrote:
    > Suppose I only want to import books that were published after January
    > 1, 2002. If I apply such a filter when I do my initial import, the
    > result should look like this:
    >
    > $VAR1 = {
    >           'book' => [
    >                     {
    >                       'publication_date' => '07/22/2003',
    >                       'price' => '17.49',
    >                       'author' => 'Dennis Lehane',
    >                       'title' => 'Mystic River',
    >                       'rating' => '4',
    >                       'pages' => '390',
    >                       'genre' => 'Thriller'
    >                     },
    >                     {
    >                       'publication_date' => '10/12/2005',
    >                       'price' => '10.99',
    >                       'author' => 'J. R. R. Tolkien',
    >                       'title' => 'The Lord Of TheRings',
    >                       'rating' => '5',
    >                       'pages' => '3489',
    >                       'genre' => 'Fantasy'
    >                     }
    >                   ]
    >         };


    That's a perfect Job for XML::Reader

    use strict;
    use warnings;

    use XML::Reader;
    use XML::Simple;
    use Data::Dumper;

    my $huge_xml =
    q{<?xml version="1.0"?>
    <library>
    <book>
    <title>Dreamcatcher</title>
    <author>Stephen King</author>
    <genre>Horror</genre>
    <pages>899</pages>
    <price>23.99</price>
    <rating>5</rating>
    <publication_date>11/27/2001</publication_date>
    </book>
    <book>
    <title>Mystic River</title>
    <author>Dennis Lehane</author>
    <genre>Thriller</genre>
    <pages>390</pages>
    <price>17.49</price>
    <rating>4</rating>
    <publication_date>07/22/2003</publication_date>
    </book>
    <book>
    <title>The Lord Of The Rings</title>
    <author>J. R. R. Tolkien</author>
    <genre>Fantasy</genre>
    <pages>3489</pages>
    <price>10.99</price>
    <rating>5</rating>
    <publication_date>10/12/2005</publication_date>
    </book>
    </library>
    };

    my $selected = { book => [] };

    my $rdr = XML::Reader->new(\$huge_xml, {mode => 'branches'},
    { root => '/library/book', branch => '*' });

    while ($rdr->iterate) {
    my $small_ref = XMLin($rdr->rvalue);

    my ($day, $month, $year) =
    $small_ref->{'publication_date'} =~
    m{\A (\d+) / (\d+) / (\d+) \z}xms;

    unless (defined $day) { $day = 0; }
    unless (defined $month) { $month = 0; }
    unless (defined $year) { $year = 0; }

    my $date = sprintf('%04d-%02d-%02d', $year, $month, $day);

    if ($date ge '2002-01-01') {
    push @{$selected->{book}}, $small_ref;
    }
    }
    print Dumper($selected);

    > The import will completely ignore entries that don't meet the
    > specified criteria (in this case, publication_date >= '1/1/2002').


    Yes, the way it works is that XML::Reader reads from a huge XML only
    small chunks (via $rdr->rvalue) (a small chunk being the '<book>...</
    book> part). This small chunk is then fed into XML::Simple::XMLin() to
    generate a small structure in memory which can then be used to extract
    the date. if the date is >= 1/1/2002, then that small structure in
    memory is pushed to a selected structure.
     
    Klaus, Nov 12, 2011
    #6
  7. Dwight Army of Champions

    Klaus Guest

    On 12 nov, 11:28, Klaus <> wrote:
    > That's a perfect Job for XML::Reader
    > [...]
    > my $huge_xml =
    > q{<?xml version="1.0"?>
    > <library>
    > [...]
    > </library>
    > };
    > [...]
    > my $rdr = XML::Reader->new(\$huge_xml, {mode => 'branches'},
    >   { root => '/library/book', branch => '*' });


    That's, of course, better written with an external file ('huge.xml'):

    open my $fh, '>', 'huge.xml' or die $!;
    print {$fh}
    q{<?xml version="1.0"?>
    <library>
    [...]
    </library>
    };
    close $fh;
    [...]
    my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
    { root => '/library/book', branch => '*' });
    [...]

    The rest stays exactly the same:

    > while ($rdr->iterate) {
    >     my $small_ref = XMLin($rdr->rvalue);
    >
    >     my ($day, $month, $year) =
    >       $small_ref->{'publication_date'} =~
    >       m{\A (\d+) / (\d+) / (\d+) \z}xms;
    >
    >     unless (defined $day)   { $day   = 0; }
    >     unless (defined $month) { $month = 0; }
    >     unless (defined $year)  { $year  = 0; }
    >
    >     my $date = sprintf('%04d-%02d-%02d', $year, $month, $day);
    >
    >     if ($date ge '2002-01-01') {
    >         push @{$selected->{book}}, $small_ref;
    >     }}
    >
    > print Dumper($selected);
    >
    > > The import will completely ignore entries that don't meet the
    > > specified criteria (in this case, publication_date >= '1/1/2002').

    >
    > Yes, the way it works is that XML::Reader reads from a huge XML only
    > small chunks (via $rdr->rvalue) (a small chunk being the '<book>...</
    > book> part). This small chunk is then fed into XML::Simple::XMLin() to
    > generate a small structure in memory which can then be used to extract
    > the date. if the date is >= 1/1/2002, then that small structure in
    > memory is pushed to a selected structure.
     
    Klaus, Nov 12, 2011
    #7
  8. On Nov 12, 5:28 am, Klaus <> wrote:
    > On 12 nov, 01:11, Dwight Army of Champions
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > <> wrote:
    > > Suppose I only want to import books that were published after January
    > > 1, 2002. If I apply such a filter when I do my initial import, the
    > > result should look like this:

    >
    > > $VAR1 = {
    > >           'book' => [
    > >                     {
    > >                       'publication_date' => '07/22/2003',
    > >                       'price' => '17.49',
    > >                       'author' => 'Dennis Lehane',
    > >                       'title' => 'Mystic River',
    > >                       'rating' => '4',
    > >                       'pages' => '390',
    > >                       'genre' => 'Thriller'
    > >                     },
    > >                     {
    > >                       'publication_date' => '10/12/2005',
    > >                       'price' => '10.99',
    > >                       'author' => 'J. R. R. Tolkien',
    > >                       'title' => 'The Lord Of The Rings',
    > >                       'rating' => '5',
    > >                       'pages' => '3489',
    > >                       'genre' => 'Fantasy'
    > >                     }
    > >                   ]
    > >         };

    >
    > That's a perfect Job for XML::Reader
    >
    > use strict;
    > use warnings;
    >
    > use XML::Reader;
    > use XML::Simple;
    > use Data::Dumper;
    >
    > my $huge_xml =
    > q{<?xml version="1.0"?>
    > <library>
    >     <book>
    >         <title>Dreamcatcher</title>
    >         <author>Stephen King</author>
    >         <genre>Horror</genre>
    >         <pages>899</pages>
    >         <price>23.99</price>
    >         <rating>5</rating>
    >         <publication_date>11/27/2001</publication_date>
    >     </book>
    >     <book>
    >         <title>Mystic River</title>
    >         <author>Dennis Lehane</author>
    >         <genre>Thriller</genre>
    >         <pages>390</pages>
    >         <price>17.49</price>
    >         <rating>4</rating>
    >         <publication_date>07/22/2003</publication_date>
    >     </book>
    >     <book>
    >         <title>The Lord Of The Rings</title>
    >         <author>J. R. R. Tolkien</author>
    >         <genre>Fantasy</genre>
    >         <pages>3489</pages>
    >         <price>10.99</price>
    >         <rating>5</rating>
    >         <publication_date>10/12/2005</publication_date>
    >     </book>
    > </library>
    >
    > };
    >
    > my $selected = { book => [] };
    >
    > my $rdr = XML::Reader->new(\$huge_xml, {mode => 'branches'},
    >   { root => '/library/book', branch => '*' });
    >
    > while ($rdr->iterate) {
    >     my $small_ref = XMLin($rdr->rvalue);
    >
    >     my ($day, $month, $year) =
    >       $small_ref->{'publication_date'} =~
    >       m{\A (\d+) / (\d+) / (\d+) \z}xms;
    >
    >     unless (defined $day)   { $day   = 0; }
    >     unless (defined $month) { $month = 0; }
    >     unless (defined $year)  { $year  = 0; }
    >
    >     my $date = sprintf('%04d-%02d-%02d', $year, $month, $day);
    >
    >     if ($date ge '2002-01-01') {
    >         push @{$selected->{book}}, $small_ref;
    >     }}
    >
    > print Dumper($selected);
    >
    > > The import will completely ignore entries that don't meet the
    > > specified criteria (in this case, publication_date >= '1/1/2002').

    >
    > Yes, the way it works is that XML::Reader reads from a huge XML only
    > small chunks (via $rdr->rvalue) (a small chunk being the '<book>...</
    > book> part). This small chunk is then fed into XML::Simple::XMLin() to
    > generate a small structure in memory which can then be used to extract
    > the date. if the date is >= 1/1/2002, then that small structure in
    > memory is pushed to a selected structure.

    Yes that is exactly what I need. Thank you!

    Follow-up question: Suppose that the library contains more than just
    books. Let's say we expand the XML file to include music items, like
    so:

    <music>
    <title>The Future Will Come</title>
    <artist>The Juan Maclean</artist>
    <release_date>04/21/2009</release_date>
    <label>DFA</label>
    </music>
    <music>
    <title>Laughing Stock</title>
    <artist>Talk Talk</artist>
    <release_date>09/16/1991</release_date>
    <label>Verve</label>
    </music>
    <music>
    <title>Hardcore Will Never Die, But You Will</title>
    <artist>Mogwai</artist>
    <release_date>02/14/2011</release_date>
    <label>Rock Action Records</label>
    </music>

    Can we take the January 1, 2002 date and apply it to both
    publication_date for books and release_date for music?

    if ($item_is_a_book && $publication_date ge '2002-01-01') {
    push @{$selected->{book}}, $small_ref;
    }
    else if ($item_is_a_music_item && $release_date ge '2002-01-01') {
    push @{$selected->{music}}, $small_ref;
    }

    I mean, I'm sure we could create an entirely separate XML::Reader
    object and do another traversal of the input file in another while
    loop (this time looking for music instead of books), but that would
    double the execution time of the program. I was wondering if we could
    look for both types of items in one go.
     
    Dwight Army of Champions, Nov 13, 2011
    #8
  9. On 2011-11-11 23:10, Dwight Army of Champions <> wrote:
    > On Nov 11, 5:59 pm, Ben Morrow <> wrote:
    >> Quoth Dwight Army of Champions <>:
    >>
    >> > I have a very large XML file that I want to load, but I don't want to
    >> > necessarily load the entire document; that takes too long. What I want
    >> > to do instead is only key/value pairs that meet certain criteria, like
    >> > only grab entries whose value fall within a certain date for a key
    >> > date_of_entry. Can I just use XML::Simple for this or do I need a
    >> > better module?

    >>
    >> It sounds like you want either XML::Twig or one of the SAX modules.
    >> XML::Simple, at least in non-SAX mode, will load the entire document
    >> into a tree structure before letting you see any of it.
    >>
    >> Ben

    >
    > I'm glancing at XML::Twig on search.cpan.org, What methods can I use
    > to accomplish these tasks? I don't see any kind of "filter" method...


    You specify the "filter" in the constructor. The twig_handlers attribute
    specifies which handler to call for each "twig" (i.e. an element and its
    descendants) that matches an XPath expression. So, if you can express
    your filter as an XPath, you just specify that and your handler will be
    called for each matching twig. If your filter is more complicated, you
    specify a more lenient XPath expression and then do additional filtering
    in the handler.

    For example, here is an excerpt from one of my scripts:

    [...]
    my $twig=XML::Twig->new(
    start_tag_handlers => {
    'table[@class="route"]' => sub {
    $in_route = 1;
    },
    },
    twig_handlers => {
    'title' => sub {
    my ($t, $title) = @_;
    my $stored_title = $title->children_trimmed_text();
    my $computed_title = "hjp: laufen: $date";
    unless ($stored_title eq $computed_title) {
    $title->set_inner_xml($computed_title);
    $modified = 1;
    }
    },
    'table[@class="route"]' => sub {
    $in_route = 0;
    },
    'tr' => sub {
    my ($t, $row) = @_;
    return unless $in_route;
    my @cells = $row->children;
    # print "# of cells: ", scalar(@cells), "\n";

    my @pl = $row->get_xpath('th');
    my $place = $pl[0]->children_trimmed_text if (@pl);

    # there doesn't seem to be an XPath expression
    # equivalent to the CSS selector [att~=val], so
    # we have to do it the hard way.
    my @dt = $row->get_xpath('td');
    @dt = grep { ($_->att('class') // '') =~ /\bdt\b/ } @dt;
    my $stored_dt = "";
    my $stored_q = "";
    if (@dt) {
    $stored_dt = $dt[0]->children_trimmed_text;
    if ($dt[0]->att('class') =~ /\bq([0-9])\b/) {
    $stored_q = $1;
    }
    }
    [...]

    I matched <title> and <table class="route"> elements directly, but I
    couldn't figure out how to match all td elements belonging to class
    "dt", so I matched on <tr> instead and then did a grep over the child
    elements. (I now also see that matching a <tr> within a <table
    class="route"> could be achieved in a much simpler way than I did it. I
    obviously didn't understand XPath very well when I wrote that.

    hp
     
    Peter J. Holzer, Nov 13, 2011
    #9
  10. Dwight Army of Champions

    Klaus Guest

    On 13 nov, 06:44, Dwight Army of Champions
    <> wrote:
    > On Nov 12, 5:28 am, Klaus <> wrote:
    > > That's a perfect Job for XML::Reader
    > > [...]
    > > my $rdr = XML::Reader->new(\$huge_xml, {mode => 'branches'},
    > >   { root => '/library/book', branch => '*' });
    > > while ($rdr->iterate) {
    > >     my $small_ref = XMLin($rdr->rvalue);


    > Yes that is exactly what I need. Thank you!
    >
    > Follow-up question: Suppose that the library contains more than just
    > books. Let's say we expand the XML file to include music
    > items [...]
    >
    > Can we take the January 1, 2002 date and apply it to both
    > publication_date for books and release_date for music?
    >
    > if ($item_is_a_book && $publication_date ge '2002-01-01') {
    >   push @{$selected->{book}}, $small_ref;}
    >
    > else if ($item_is_a_music_item && $release_date ge '2002-01-01') {
    >   push @{$selected->{music}}, $small_ref;
    >
    > }
    >
    > I mean, I'm sure we could create an entirely separate XML::Reader
    > object and do another traversal of the input file in another while
    > loop (this time looking for music instead of books), but that would
    > double the execution time of the program. I was wondering if we could
    > look for both types of items in one go.


    Yes, that's in fact what XML::Reader is designed to do. You just need
    to add another line { root => '/library/music', branch => '*' } and
    then, inside your loop you just need to check $rdr->rx (which is 0 if
    it found a <book> item or 1 if it found a <music> item). With that
    logic, the file 'huge.xml' is parsed only once, while extracting
    <book> and/or <music> items as it goes along.

    *****************************************************

    The important lines are:

    [...]

    my $selected = { book => [], music => [] };

    my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
    { root => '/library/book', branch => '*' },
    { root => '/library/music', branch => '*' });


    while ($rdr->iterate) {
    my $small_ref = XMLin($rdr->rvalue);
    my $topic = $rdr->rx == 0 ? 'book' : 'music';

    [...]

    *****************************************************

    Here is a complete program:

    use strict;
    use warnings;

    use XML::Reader;
    use XML::Simple;
    use Data::Dumper;

    open my $fh, '>', 'huge.xml' or die $!;

    print {$fh}
    q{<?xml version="1.0"?>
    <library>
    <book>
    <title>Dreamcatcher</title>
    <author>Stephen King</author>
    <genre>Horror</genre>
    <pages>899</pages>
    <price>23.99</price>
    <rating>5</rating>
    <publication_date>11/27/2001</publication_date>
    </book>
    <music>
    <title>The Future Will Come</title>
    <artist>The Juan Maclean</artist>
    <release_date>04/21/2009</release_date>
    <label>DFA</label>
    </music>
    <book>
    <title>Mystic River</title>
    <author>Dennis Lehane</author>
    <genre>Thriller</genre>
    <pages>390</pages>
    <price>17.49</price>
    <rating>4</rating>
    <publication_date>07/22/2003</publication_date>
    </book>
    <music>
    <title>Laughing Stock</title>
    <artist>Talk Talk</artist>
    <release_date>09/16/1991</release_date>
    <label>Verve</label>
    </music>
    <book>
    <title>The Lord Of The Rings</title>
    <author>J. R. R. Tolkien</author>
    <genre>Fantasy</genre>
    <pages>3489</pages>
    <price>10.99</price>
    <rating>5</rating>
    <publication_date>10/12/2005</publication_date>
    </book>
    <music>
    <title>Hardcore Will Never Die, But You Will</title>
    <artist>Mogwai</artist>
    <release_date>02/14/2011</release_date>
    <label>Rock Action Records</label>
    </music>
    </library>
    };

    close $fh;

    my $selected = { book => [], music => [] };

    my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
    { root => '/library/book', branch => '*' },
    { root => '/library/music', branch => '*' });

    while ($rdr->iterate) {
    my $small_ref = XMLin($rdr->rvalue);
    my $topic = $rdr->rx == 0 ? 'book' : 'music';

    my $dat_ele = $topic eq 'book'
    ? $small_ref->{'publication_date'}
    : $small_ref->{'release_date'};

    my ($day, $month, $year) = $dat_ele =~
    m{\A (\d+) / (\d+) / (\d+) \z}xms;

    unless (defined $day) { $day = 0; }
    unless (defined $month) { $month = 0; }
    unless (defined $year) { $year = 0; }

    my $date = sprintf('%04d-%02d-%02d', $year, $month, $day);

    if ($topic eq 'book') {
    if ($date ge '2002-01-01') {
    push @{$selected->{book}}, $small_ref;
    }
    }
    elsif ($topic eq 'music') {
    if ($date ge '2002-01-01') {
    push @{$selected->{music}}, $small_ref;
    }
    }
    }

    print Dumper($selected);
     
    Klaus, Nov 13, 2011
    #10
  11. Dwight Army of Champions

    ccc31807 Guest

    On Nov 11, 5:39 pm, Dwight Army of Champions
    <> wrote:
    > I have a very large XML file that I want to load, but I don't want to
    > necessarily load the entire document; that takes too long. What I want
    > to do instead is only key/value pairs that meet certain criteria, like
    > only grab entries whose value fall within a certain date for a key
    > date_of_entry. Can I just use XML::Simple for this or do I need a
    > better module?


    This depends on the nature of your input. I do this kind of thing
    every day, and use a simple regular expression to filter the file. Of
    course, you still have to read every line of the file to make sure
    that you catch all of your intended targets, but you would have to do
    that anyway. This is the kind of task for which it's a lot easier to
    hand roll your own parser than it is to look for, evaluate, learn,
    install, and use some third party module. In my opinion anyway. For
    example:

    SCRIPT
    #! perl
    use warnings;
    use strict;
    my %filter;
    while (<DATA>)
    {
    next unless /\w/;
    chomp;
    if ($_ =~ m!<order>(\d+)</order>!)
    {
    my $key = $1;
    while (<DATA>)
    {
    last if $_ =~ m!</pres>!;
    next unless $_ =~ m!<last>(\w+)</last>!;
    $filter{$key} = $1;
    }
    }
    }
    print "Finished processing file\n";
    foreach my $key (sort keys %filter) { print "$key => $filter{$key}
    \n"; }
    exit(0);

    __DATA__
    <pres>
    <order>1</order>
    <first>George</first>
    <last>Washington</last>
    <year>1788</year>
    </pres>
    <pres>
    <order>2</order>
    <first>John</first>
    <last>Adams</last>
    <year>1796</year>
    </pres>
    <pres>
    <order>3</order>
    <first>Thomas</first>
    <last>Jefferson</last>
    <year>1800</year>
    </pres>

    OUTPUT
    $perl filter_test.plx
    Finished processing file
    1 => Washington
    2 => Adams
    3 => Jefferson
    4 => Madison
    5 => Monroe
    6 => Adams
     
    ccc31807, Nov 16, 2011
    #11
  12. ccc31807 <> writes:
    > On Nov 11, 5:39 pm, Dwight Army of Champions
    > <> wrote:
    >> I have a very large XML file that I want to load, but I don't want to
    >> necessarily load the entire document; that takes too long. What I want
    >> to do instead is only key/value pairs that meet certain criteria, like
    >> only grab entries whose value fall within a certain date for a key
    >> date_of_entry. Can I just use XML::Simple for this or do I need a
    >> better module?

    >
    > This depends on the nature of your input.


    [...]

    > while (<DATA>)
    > {
    > next unless /\w/;
    > chomp;
    > if ($_ =~ m!<order>(\d+)</order>!)
    > {
    > my $key = $1;
    > while (<DATA>)
    > {
    > last if $_ =~ m!</pres>!;
    > next unless $_ =~ m!<last>(\w+)</last>!;
    > $filter{$key} = $1;
    > }
    > }
    > }


    AFAIK, a well-formed XML file could have an order description looking like
    this:

    <order


    >1</order





    >


    meaning, it is not really possible to parse XML without doing a
    character-by-character lexical analysis of the input data stream
    first.
     
    Rainer Weikusat, Nov 16, 2011
    #12
  13. Dwight Army of Champions <> writes:
    > I have a very large XML file that I want to load, but I don't want to
    > necessarily load the entire document; that takes too long. What I want
    > to do instead is only key/value pairs that meet certain criteria, like
    > only grab entries whose value fall within a certain date for a key
    > date_of_entry.


    This is impossible. Technically, XML is a sequential character stream
    and any structured data encoded as XML can only be recovered by
    aggregating characters into tokens based on the rules for XML tokens
    and parsing the resulting token stream.
     
    Rainer Weikusat, Nov 16, 2011
    #13
  14. Dwight Army of Champions

    Willem Guest

    Rainer Weikusat wrote:
    ) AFAIK, a well-formed XML file could have an order description looking like
    ) this:
    )
    )<order
    )
    )
    )>1</order
    )
    )
    )
    )
    )>
    )
    ) meaning, it is not really possible to parse XML without doing a
    ) character-by-character lexical analysis of the input data stream
    ) first.

    Indeed. To me, this is an argument that XML is usually a bad choice,
    especially when you use it to store, transmit and retrieve data.

    It's a _markup_ language, people! Not a data storage language.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Nov 16, 2011
    #14
  15. Dwight Army of Champions

    ccc31807 Guest

    On Nov 16, 11:50 am, Rainer Weikusat <> wrote:
    > AFAIK, a well-formed XML file could have an order description looking like
    > this:
    >
    > <order
    >
    > >1</order

    >
    > meaning, it is not really possible to parse XML without doing a
    > character-by-character lexical analysis of the input data stream
    > first.


    As I said, it depends on the nature of your input. XML handles
    'ragged' data as well as the kind of normalized data we would expect
    to use for an RDBMS. If you aren't sure of the format of your data,
    you obviously have to validate it somehow. Part of this might be
    removing whitespace at the beginning and ends of lines. Sometimes it
    might be removing newlines from several lines until you match some
    kind of closing tag.

    I don't advocate reinventing wheels. I also don't advocate searching
    for a CPAN module as the first step in solving a particular
    programming problem. If you need to run a script continually
    processing the same kind of input, it might pay to cobble together
    some code that does EXACTLY what you need, no more and no less, that
    to use someone else's code.

    I say this as a promiscuous user of CPAN modules -- hardly a week goes
    by that I don't install a new module for one reason or another -- and
    frequently I just look at the source, modify it to do what I need, and
    don't use or require the module.

    TIMTOWTDI, CC.
     
    ccc31807, Nov 16, 2011
    #15
  16. Dwight Army of Champions

    ccc31807 Guest

    On Nov 11, 7:11 pm, Dwight Army of Champions
    <>
    > <?xml version="1.0"?>
    > <library>
    > <book>
    >         <title>Dreamcatcher</title>
    >         <author>Stephen King</author>
    >         <genre>Horror</genre>
    >         <pages>899</pages>
    >         <price>23.99</price>
    >         <rating>5</rating>
    >         <publication_date>11/27/2001</publication_date>
    > </book>

    ....
    > </library>


    If I had this kind of file, and it was a static file, I would read it
    into some kind of database. If you used something like SQLite, you
    could read it into a table <book> element by <book> element, and then
    use normal SQL to munge your data.

    Alternative, you could convert the file into CSV format, which in many
    ways is a lot easier to handle than XML.

    It strikes me that using XML for this kind of work is overkill, unless
    you had a specific requirement to use XML. If you had to use XML it
    might pay to learn a little XSLT and use that instead of Perl. Perl is
    a great language for string processing, but in some cases XSLT works
    better.

    CC.
     
    ccc31807, Nov 16, 2011
    #16
  17. Dwight Army of Champions

    Klaus Guest

    On 16 nov, 18:17, ccc31807 <> wrote:
    > On Nov 11, 7:11 pm, Dwight Army of Champions
    > <>
    >
    > > <?xml version="1.0"?>
    > > <library>
    > > <book>
    > >         <title>Dreamcatcher</title>
    > >         <author>Stephen King</author>
    > >         <genre>Horror</genre>
    > >         <pages>899</pages>
    > >         <price>23.99</price>
    > >         <rating>5</rating>
    > >         <publication_date>11/27/2001</publication_date>
    > > </book>

    > ...
    > > </library>

    >
    > If I had this kind of file, and it was a static file, I would read it
    > into some kind of database. If you used something like SQLite, you
    > could read it into a table <book> element by <book> element, and then
    > use normal SQL to munge your data.
    >
    > Alternative, you could convert the file into CSV format, which in many
    > ways is a lot easier to handle than XML.


    Converting to CSV is as easy as:

    use strict;
    use warnings;

    use XML::Reader;
    use Text::CSV_XS;

    my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
    { root => '/library/book', branch => [
    '/title',
    '/author',
    '/genre',
    '/pages',
    '/price',
    '/rating',
    '/publication_date',
    ]},
    { root => '/library/music', branch => [
    '/title',
    '/artist',
    '/release_date',
    '/label',
    ]});

    my $csv = Text::CSV_XS->new({ sep_char => ',', binary => 1, eol =>
    $/ });
    open my $ofh, '>', 'out.csv' or die $!;

    while ($rdr->iterate) {
    $csv->print($ofh, [ ($rdr->rx == 0 ? 'book' : 'music'), $rdr-
    >value ]);

    }

    close $ofh;
     
    Klaus, Nov 16, 2011
    #17
  18. Dwight Army of Champions

    Klaus Guest

    On 16 nov, 17:32, ccc31807 <> wrote:
    > On Nov 11, 5:39 pm, Dwight Army of Champions
    >
    > <> wrote:
    > > I have a very large XML file that I want to load, but I don't want to
    > > necessarily load the entire document; that takes too long. What I want
    > > to do instead is only key/value pairs that meet certain criteria, like
    > > only grab entries whose value fall within a certain date for a key
    > > date_of_entry. Can I just use XML::Simple for this or do I need a
    > > better module?

    >
    > This depends on the nature of your input. I do this kind of thing
    > every day, and use a simple regular expression to filter the file. Of
    > course, you still have to read every line of the file to make sure
    > that you catch all of your intended targets, but you would have to do
    > that anyway. This is the kind of task for which it's a lot easier to
    > hand roll your own parser than it is to look for, evaluate, learn,
    > install, and use some third party module. In my opinion anyway. For
    > example:
    >
    > SCRIPT
    > #! perl
    > use warnings;
    > use strict;
    > my %filter;
    > while (<DATA>)
    > {
    >     next unless /\w/;
    >     chomp;
    >     if ($_ =~ m!<order>(\d+)</order>!)
    >     {
    >         my $key = $1;
    >         while (<DATA>)
    >         {
    >             last if $_ =~ m!</pres>!;
    >             next unless $_ =~ m!<last>(\w+)</last>!;
    >             $filter{$key} = $1;
    >         }
    >     }}
    >
    > print "Finished processing file\n";
    > foreach my $key (sort keys %filter) { print "$key => $filter{$key}
    > \n"; }
    > exit(0);


    Using XML::Reader, it's even easier:

    use strict;
    use warnings;

    use XML::Reader;

    my %filter;

    my $rdr = XML::Reader->new(\*DATA,
    {mode => 'branches'},
    { root => '/data/pres', branch => [
    '/order',
    '/last',
    ]});

    while ($rdr->iterate) {
    my ($order, $last) = $rdr->value;
    $filter{$order} = $last;
    }

    print "Finished processing file\n";
    foreach my $key (sort keys %filter) {
    print "$key => $filter{$key}\n";
    }

    __DATA__
    <data>
    <pres>
    <order>1</order>
    <first>George</first>
    <last>Washington</last>
    <year>1788</year>
    </pres>
    <pres>
    <order>2</order>
    <first>John</first>
    <last>Adams</last>
    <year>1796</year>
    </pres>
    <pres>
    <order>3</order>
    <first>Thomas</first>
    <last>Jefferson</last>
    <year>1800</year>
    </pres>
    </data>
     
    Klaus, Nov 16, 2011
    #18
  19. ccc31807 <> writes:
    > On Nov 16, 11:50 am, Rainer Weikusat <> wrote:
    >> AFAIK, a well-formed XML file could have an order description looking like
    >> this:
    >>
    >> <order
    >>
    >> >1</order

    >>
    >> meaning, it is not really possible to parse XML without doing a
    >> character-by-character lexical analysis of the input data stream
    >> first.

    >
    > As I said, it depends on the nature of your input. XML handles
    > 'ragged' data as well as the kind of normalized data we would expect
    > to use for an RDBMS. If you aren't sure of the format of your data,
    > you obviously have to validate it somehow. Part of this might be
    > removing whitespace at the beginning and ends of lines. Sometimes it
    > might be removing newlines from several lines until you match some
    > kind of closing tag.


    The point I was trying to make is that the kind of input your (example) code
    can deal with needs to follow the rules of a grammar which is a proper
    subset of the XML grammar.
     
    Rainer Weikusat, Nov 17, 2011
    #19
  20. Dwight Army of Champions

    ccc31807 Guest

    On Nov 17, 2:33 pm, Rainer Weikusat <> wrote:
    > The point I was trying to make is that the kind of input your (example) code
    > can deal with needs to follow the rules of a grammar which is a proper
    > subset of the XML grammar.


    Yes, I understood your point. We all have to deal with messy data, and
    faulty input will kill an application with no hope of recovery if you
    don't deal with the possibility of corrupted data.

    That said, if you are confident of the format of your input (as you
    might have with an input file generated from a database) it might be
    quicker and easier to hand roll your own.

    If you have XML, you can use a SAX parser to process your input
    element by element, and I assume that it would handle your whitespace
    example without a problem.

    I don't deal with XML much, and I really appreciate the post from
    others that illustrate scripts with XML::Reader and the like. I didn't
    have it but installed it yesterday, and have spend several hours
    piddling with it.

    CC.
     
    ccc31807, Nov 17, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Paul Opal
    Replies:
    12
    Views:
    985
    Paul Opal
    Oct 11, 2004
  2. Vitali Gontsharuk
    Replies:
    2
    Views:
    592
    Vitali Gontsharuk
    Aug 25, 2005
  3. unaveen
    Replies:
    1
    Views:
    540
    unaveen
    Mar 18, 2008
  4. Jack
    Replies:
    8
    Views:
    299
  5. Replies:
    5
    Views:
    947
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page