How to import only part of a large XML file?

  • Thread starter Dwight Army of Champions
  • Start date
D

Dwight Army of Champions

I have a very large XML file that I want to load, but I don't want to
necessarily load the entire document; that takes too long. What I want
to do instead is only key/value pairs that meet certain criteria, like
only grab entries whose value fall within a certain date for a key
date_of_entry. Can I just use XML::Simple for this or do I need a
better module?
 
D

Dwight Army of Champions

It sounds like you want either XML::Twig or one of the SAX modules.
XML::Simple, at least in non-SAX mode, will load the entire document
into a tree structure before letting you see any of it.

Ben

I'm glancing at XML::Twig on search.cpan.org, What methods can I use
to accomplish these tasks? I don't see any kind of "filter" method...
 
B

Bjoern Hoehrmann

* Dwight Army of Champions wrote in comp.lang.perl.misc:
I have a very large XML file that I want to load, but I don't want to
necessarily load the entire document; that takes too long. What I want
to do instead is only key/value pairs that meet certain criteria, like
only grab entries whose value fall within a certain date for a key
date_of_entry. Can I just use XML::Simple for this or do I need a
better module?

It depends on what you mean by "key/value pairs". If you want to filter
elements based on attributes, and don't particularily need to look at
child elements, then the SAX modules are likely a good fit, they report
events like "start of element plus attributes" and "end of element" and
you have to manage state between the events. Generally, this should help
<http://perl-xml.sourceforge.net/faq/#parser_selection>, and if you have
special needs, (e-mail address removed) is likely to give the
best advice.
 
D

Dwight Army of Champions

* Dwight Army of Champions wrote in comp.lang.perl.misc:


It depends on what you mean by "key/value pairs". If you want to filter
elements based on attributes, and don't particularily need to look at
child elements, then the SAX modules are likely a good fit, they report
events like "start of element plus attributes" and "end of element" and
you have to manage state between the events. Generally, this should help
<http://perl-xml.sourceforge.net/faq/#parser_selection>, and if you have
special needs, (e-mail address removed) is likely to give the
best advice.

For example, suppose I have the following XML input file:

<?xml version="1.0"?>
<library>
<book>
<title>Dreamcatcher</title>
<author>Stephen King</author>
<genre>Horror</genre>
<pages>899</pages>
<price>23.99</price>
<rating>5</rating>
<publication_date>11/27/2001</publication_date>
</book>
<book>
<title>Mystic River</title>
<author>Dennis Lehane</author>
<genre>Thriller</genre>
<pages>390</pages>
<price>17.49</price>
<rating>4</rating>
<publication_date>07/22/2003</publication_date>
</book>
<book>
<title>The Lord Of The Rings</title>
<author>J. R. R. Tolkien</author>
<genre>Fantasy</genre>
<pages>3489</pages>
<price>10.99</price>
<rating>5</rating>
<publication_date>10/12/2005</publication_date>
</book>
</library>


Suppose I only want to import books that were published after January
1, 2002. If I apply such a filter when I do my initial import, the
result should look like this:

$VAR1 = {
'book' => [
{
'publication_date' => '07/22/2003',
'price' => '17.49',
'author' => 'Dennis Lehane',
'title' => 'Mystic River',
'rating' => '4',
'pages' => '390',
'genre' => 'Thriller'
},
{
'publication_date' => '10/12/2005',
'price' => '10.99',
'author' => 'J. R. R. Tolkien',
'title' => 'The Lord Of The Rings',
'rating' => '5',
'pages' => '3489',
'genre' => 'Fantasy'
}
]
};

The import will completely ignore entries that don't meet the
specified criteria (in this case, publication_date >= '1/1/2002').
 
B

Bjoern Hoehrmann

* Dwight Army of Champions wrote in comp.lang.perl.misc:
For example, suppose I have the following XML input file:
Suppose I only want to import books that were published after January
1, 2002. If I apply such a filter when I do my initial import, the
result should look like this:

One way to do this would be with a SAX filter: you look for "book"
elements, store all events until you can decider whether you are
interested in this branch, and then re-emit or discard the events.
You can then use some module that turns the SAX stream into some
more Perl-ish data structure. There are some libraries that allow
you to filter in this fashion automatically ("xpath filtering"),
but I am not sure which, if any, modules for Perl do this for you.

Note that size is quite important here, with 100 MB you might just
suffer "too long" but with 5 GB you might suffer "impossible" for
some possible solutions. Some "reader"-style APIs allow you to go
to a "book" element, read everything up to the end of the element
into some DOM-style representation, and then make it easy to check
if you are interested in this branch as you have DOM-style access,
but only to the interesting part, so you save memory. Similar to
the SAX filter solution, except that you trade some memory and per-
haps speed for ease of programming.
 
K

Klaus

Suppose I only want to import books that were published after January
1, 2002. If I apply such a filter when I do my initial import, the
result should look like this:

$VAR1 = {
          'book' => [
                    {
                      'publication_date' => '07/22/2003',
                      'price' => '17.49',
                      'author' => 'Dennis Lehane',
                      'title' => 'Mystic River',
                      'rating' => '4',
                      'pages' => '390',
                      'genre' => 'Thriller'
                    },
                    {
                      'publication_date' => '10/12/2005',
                      'price' => '10.99',
                      'author' => 'J. R. R. Tolkien',
                      'title' => 'The Lord Of TheRings',
                      'rating' => '5',
                      'pages' => '3489',
                      'genre' => 'Fantasy'
                    }
                  ]
        };

That's a perfect Job for XML::Reader

use strict;
use warnings;

use XML::Reader;
use XML::Simple;
use Data::Dumper;

my $huge_xml =
q{<?xml version="1.0"?>
<library>
<book>
<title>Dreamcatcher</title>
<author>Stephen King</author>
<genre>Horror</genre>
<pages>899</pages>
<price>23.99</price>
<rating>5</rating>
<publication_date>11/27/2001</publication_date>
</book>
<book>
<title>Mystic River</title>
<author>Dennis Lehane</author>
<genre>Thriller</genre>
<pages>390</pages>
<price>17.49</price>
<rating>4</rating>
<publication_date>07/22/2003</publication_date>
</book>
<book>
<title>The Lord Of The Rings</title>
<author>J. R. R. Tolkien</author>
<genre>Fantasy</genre>
<pages>3489</pages>
<price>10.99</price>
<rating>5</rating>
<publication_date>10/12/2005</publication_date>
</book>
</library>
};

my $selected = { book => [] };

my $rdr = XML::Reader->new(\$huge_xml, {mode => 'branches'},
{ root => '/library/book', branch => '*' });

while ($rdr->iterate) {
my $small_ref = XMLin($rdr->rvalue);

my ($day, $month, $year) =
$small_ref->{'publication_date'} =~
m{\A (\d+) / (\d+) / (\d+) \z}xms;

unless (defined $day) { $day = 0; }
unless (defined $month) { $month = 0; }
unless (defined $year) { $year = 0; }

my $date = sprintf('%04d-%02d-%02d', $year, $month, $day);

if ($date ge '2002-01-01') {
push @{$selected->{book}}, $small_ref;
}
}
print Dumper($selected);
The import will completely ignore entries that don't meet the
specified criteria (in this case, publication_date >= '1/1/2002').

Yes, the way it works is that XML::Reader reads from a huge XML only
small chunks (via $rdr->rvalue) (a small chunk being the '<book>...</
book> part). This small chunk is then fed into XML::Simple::XMLin() to
generate a small structure in memory which can then be used to extract
the date. if the date is >= 1/1/2002, then that small structure in
memory is pushed to a selected structure.
 
K

Klaus

That's a perfect Job for XML::Reader
[...]
my $huge_xml =
q{<?xml version="1.0"?>
<library>
[...]
</library>
};
[...]
my $rdr = XML::Reader->new(\$huge_xml, {mode => 'branches'},
  { root => '/library/book', branch => '*' });

That's, of course, better written with an external file ('huge.xml'):

open my $fh, '>', 'huge.xml' or die $!;
print {$fh}
q{<?xml version="1.0"?>
<library>
[...]
</library>
};
close $fh;
[...]
my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
{ root => '/library/book', branch => '*' });
[...]

The rest stays exactly the same:
 
D

Dwight Army of Champions

Suppose I only want to import books that were published after January
1, 2002. If I apply such a filter when I do my initial import, the
result should look like this:
$VAR1 = {
          'book' => [
                    {
                      'publication_date' => '07/22/2003',
                      'price' => '17.49',
                      'author' => 'Dennis Lehane',
                      'title' => 'Mystic River',
                      'rating' => '4',
                      'pages' => '390',
                      'genre' => 'Thriller'
                    },
                    {
                      'publication_date' => '10/12/2005',
                      'price' => '10.99',
                      'author' => 'J. R. R. Tolkien',
                      'title' => 'The Lord Of The Rings',
                      'rating' => '5',
                      'pages' => '3489',
                      'genre' => 'Fantasy'
                    }
                  ]
        };

That's a perfect Job for XML::Reader

use strict;
use warnings;

use XML::Reader;
use XML::Simple;
use Data::Dumper;

my $huge_xml =
q{<?xml version="1.0"?>
<library>
    <book>
        <title>Dreamcatcher</title>
        <author>Stephen King</author>
        <genre>Horror</genre>
        <pages>899</pages>
        <price>23.99</price>
        <rating>5</rating>
        <publication_date>11/27/2001</publication_date>
    </book>
    <book>
        <title>Mystic River</title>
        <author>Dennis Lehane</author>
        <genre>Thriller</genre>
        <pages>390</pages>
        <price>17.49</price>
        <rating>4</rating>
        <publication_date>07/22/2003</publication_date>
    </book>
    <book>
        <title>The Lord Of The Rings</title>
        <author>J. R. R. Tolkien</author>
        <genre>Fantasy</genre>
        <pages>3489</pages>
        <price>10.99</price>
        <rating>5</rating>
        <publication_date>10/12/2005</publication_date>
    </book>
</library>

};

my $selected = { book => [] };

my $rdr = XML::Reader->new(\$huge_xml, {mode => 'branches'},
  { root => '/library/book', branch => '*' });

while ($rdr->iterate) {
    my $small_ref = XMLin($rdr->rvalue);

    my ($day, $month, $year) =
      $small_ref->{'publication_date'} =~
      m{\A (\d+) / (\d+) / (\d+) \z}xms;

    unless (defined $day)   { $day   = 0; }
    unless (defined $month) { $month = 0; }
    unless (defined $year)  { $year  = 0; }

    my $date = sprintf('%04d-%02d-%02d', $year, $month, $day);

    if ($date ge '2002-01-01') {
        push @{$selected->{book}}, $small_ref;
    }}

print Dumper($selected);
The import will completely ignore entries that don't meet the
specified criteria (in this case, publication_date >= '1/1/2002').

Yes, the way it works is that XML::Reader reads from a huge XML only
small chunks (via $rdr->rvalue) (a small chunk being the '<book>...</
book> part). This small chunk is then fed into XML::Simple::XMLin() to
generate a small structure in memory which can then be used to extract
the date. if the date is >= 1/1/2002, then that small structure in
memory is pushed to a selected structure.
Yes that is exactly what I need. Thank you!

Follow-up question: Suppose that the library contains more than just
books. Let's say we expand the XML file to include music items, like
so:

<music>
<title>The Future Will Come</title>
<artist>The Juan Maclean</artist>
<release_date>04/21/2009</release_date>
<label>DFA</label>
</music>
<music>
<title>Laughing Stock</title>
<artist>Talk Talk</artist>
<release_date>09/16/1991</release_date>
<label>Verve</label>
</music>
<music>
<title>Hardcore Will Never Die, But You Will</title>
<artist>Mogwai</artist>
<release_date>02/14/2011</release_date>
<label>Rock Action Records</label>
</music>

Can we take the January 1, 2002 date and apply it to both
publication_date for books and release_date for music?

if ($item_is_a_book && $publication_date ge '2002-01-01') {
push @{$selected->{book}}, $small_ref;
}
else if ($item_is_a_music_item && $release_date ge '2002-01-01') {
push @{$selected->{music}}, $small_ref;
}

I mean, I'm sure we could create an entirely separate XML::Reader
object and do another traversal of the input file in another while
loop (this time looking for music instead of books), but that would
double the execution time of the program. I was wondering if we could
look for both types of items in one go.
 
P

Peter J. Holzer

I'm glancing at XML::Twig on search.cpan.org, What methods can I use
to accomplish these tasks? I don't see any kind of "filter" method...

You specify the "filter" in the constructor. The twig_handlers attribute
specifies which handler to call for each "twig" (i.e. an element and its
descendants) that matches an XPath expression. So, if you can express
your filter as an XPath, you just specify that and your handler will be
called for each matching twig. If your filter is more complicated, you
specify a more lenient XPath expression and then do additional filtering
in the handler.

For example, here is an excerpt from one of my scripts:

[...]
my $twig=XML::Twig->new(
start_tag_handlers => {
'table[@class="route"]' => sub {
$in_route = 1;
},
},
twig_handlers => {
'title' => sub {
my ($t, $title) = @_;
my $stored_title = $title->children_trimmed_text();
my $computed_title = "hjp: laufen: $date";
unless ($stored_title eq $computed_title) {
$title->set_inner_xml($computed_title);
$modified = 1;
}
},
'table[@class="route"]' => sub {
$in_route = 0;
},
'tr' => sub {
my ($t, $row) = @_;
return unless $in_route;
my @cells = $row->children;
# print "# of cells: ", scalar(@cells), "\n";

my @pl = $row->get_xpath('th');
my $place = $pl[0]->children_trimmed_text if (@pl);

# there doesn't seem to be an XPath expression
# equivalent to the CSS selector [att~=val], so
# we have to do it the hard way.
my @dt = $row->get_xpath('td');
@dt = grep { ($_->att('class') // '') =~ /\bdt\b/ } @dt;
my $stored_dt = "";
my $stored_q = "";
if (@dt) {
$stored_dt = $dt[0]->children_trimmed_text;
if ($dt[0]->att('class') =~ /\bq([0-9])\b/) {
$stored_q = $1;
}
}
[...]

I matched <title> and <table class="route"> elements directly, but I
couldn't figure out how to match all td elements belonging to class
"dt", so I matched on <tr> instead and then did a grep over the child
elements. (I now also see that matching a <tr> within a <table
class="route"> could be achieved in a much simpler way than I did it. I
obviously didn't understand XPath very well when I wrote that.

hp
 
K

Klaus

That's a perfect Job for XML::Reader
[...]
my $rdr = XML::Reader->new(\$huge_xml, {mode => 'branches'},
  { root => '/library/book', branch => '*' });
while ($rdr->iterate) {
    my $small_ref = XMLin($rdr->rvalue);
Yes that is exactly what I need. Thank you!

Follow-up question: Suppose that the library contains more than just
books. Let's say we expand the XML file to include music
items [...]

Can we take the January 1, 2002 date and apply it to both
publication_date for books and release_date for music?

if ($item_is_a_book && $publication_date ge '2002-01-01') {
  push @{$selected->{book}}, $small_ref;}

else if ($item_is_a_music_item && $release_date ge '2002-01-01') {
  push @{$selected->{music}}, $small_ref;

}

I mean, I'm sure we could create an entirely separate XML::Reader
object and do another traversal of the input file in another while
loop (this time looking for music instead of books), but that would
double the execution time of the program. I was wondering if we could
look for both types of items in one go.

Yes, that's in fact what XML::Reader is designed to do. You just need
to add another line { root => '/library/music', branch => '*' } and
then, inside your loop you just need to check $rdr->rx (which is 0 if
it found a <book> item or 1 if it found a <music> item). With that
logic, the file 'huge.xml' is parsed only once, while extracting
<book> and/or <music> items as it goes along.

*****************************************************

The important lines are:

[...]

my $selected = { book => [], music => [] };

my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
{ root => '/library/book', branch => '*' },
{ root => '/library/music', branch => '*' });


while ($rdr->iterate) {
my $small_ref = XMLin($rdr->rvalue);
my $topic = $rdr->rx == 0 ? 'book' : 'music';

[...]

*****************************************************

Here is a complete program:

use strict;
use warnings;

use XML::Reader;
use XML::Simple;
use Data::Dumper;

open my $fh, '>', 'huge.xml' or die $!;

print {$fh}
q{<?xml version="1.0"?>
<library>
<book>
<title>Dreamcatcher</title>
<author>Stephen King</author>
<genre>Horror</genre>
<pages>899</pages>
<price>23.99</price>
<rating>5</rating>
<publication_date>11/27/2001</publication_date>
</book>
<music>
<title>The Future Will Come</title>
<artist>The Juan Maclean</artist>
<release_date>04/21/2009</release_date>
<label>DFA</label>
</music>
<book>
<title>Mystic River</title>
<author>Dennis Lehane</author>
<genre>Thriller</genre>
<pages>390</pages>
<price>17.49</price>
<rating>4</rating>
<publication_date>07/22/2003</publication_date>
</book>
<music>
<title>Laughing Stock</title>
<artist>Talk Talk</artist>
<release_date>09/16/1991</release_date>
<label>Verve</label>
</music>
<book>
<title>The Lord Of The Rings</title>
<author>J. R. R. Tolkien</author>
<genre>Fantasy</genre>
<pages>3489</pages>
<price>10.99</price>
<rating>5</rating>
<publication_date>10/12/2005</publication_date>
</book>
<music>
<title>Hardcore Will Never Die, But You Will</title>
<artist>Mogwai</artist>
<release_date>02/14/2011</release_date>
<label>Rock Action Records</label>
</music>
</library>
};

close $fh;

my $selected = { book => [], music => [] };

my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
{ root => '/library/book', branch => '*' },
{ root => '/library/music', branch => '*' });

while ($rdr->iterate) {
my $small_ref = XMLin($rdr->rvalue);
my $topic = $rdr->rx == 0 ? 'book' : 'music';

my $dat_ele = $topic eq 'book'
? $small_ref->{'publication_date'}
: $small_ref->{'release_date'};

my ($day, $month, $year) = $dat_ele =~
m{\A (\d+) / (\d+) / (\d+) \z}xms;

unless (defined $day) { $day = 0; }
unless (defined $month) { $month = 0; }
unless (defined $year) { $year = 0; }

my $date = sprintf('%04d-%02d-%02d', $year, $month, $day);

if ($topic eq 'book') {
if ($date ge '2002-01-01') {
push @{$selected->{book}}, $small_ref;
}
}
elsif ($topic eq 'music') {
if ($date ge '2002-01-01') {
push @{$selected->{music}}, $small_ref;
}
}
}

print Dumper($selected);
 
C

ccc31807

I have a very large XML file that I want to load, but I don't want to
necessarily load the entire document; that takes too long. What I want
to do instead is only key/value pairs that meet certain criteria, like
only grab entries whose value fall within a certain date for a key
date_of_entry. Can I just use XML::Simple for this or do I need a
better module?

This depends on the nature of your input. I do this kind of thing
every day, and use a simple regular expression to filter the file. Of
course, you still have to read every line of the file to make sure
that you catch all of your intended targets, but you would have to do
that anyway. This is the kind of task for which it's a lot easier to
hand roll your own parser than it is to look for, evaluate, learn,
install, and use some third party module. In my opinion anyway. For
example:

SCRIPT
#! perl
use warnings;
use strict;
my %filter;
while (<DATA>)
{
next unless /\w/;
chomp;
if ($_ =~ m!<order>(\d+)</order>!)
{
my $key = $1;
while (<DATA>)
{
last if $_ =~ m!</pres>!;
next unless $_ =~ m!<last>(\w+)</last>!;
$filter{$key} = $1;
}
}
}
print "Finished processing file\n";
foreach my $key (sort keys %filter) { print "$key => $filter{$key}
\n"; }
exit(0);

__DATA__
<pres>
<order>1</order>
<first>George</first>
<last>Washington</last>
<year>1788</year>
</pres>
<pres>
<order>2</order>
<first>John</first>
<last>Adams</last>
<year>1796</year>
</pres>
<pres>
<order>3</order>
<first>Thomas</first>
<last>Jefferson</last>
<year>1800</year>
</pres>

OUTPUT
$perl filter_test.plx
Finished processing file
1 => Washington
2 => Adams
3 => Jefferson
4 => Madison
5 => Monroe
6 => Adams
 
R

Rainer Weikusat

ccc31807 said:
I have a very large XML file that I want to load, but I don't want to
necessarily load the entire document; that takes too long. What I want
to do instead is only key/value pairs that meet certain criteria, like
only grab entries whose value fall within a certain date for a key
date_of_entry. Can I just use XML::Simple for this or do I need a
better module?

This depends on the nature of your input.
[...]

while (<DATA>)
{
next unless /\w/;
chomp;
if ($_ =~ m!<order>(\d+)</order>!)
{
my $key = $1;
while (<DATA>)
{
last if $_ =~ m!</pres>!;
next unless $_ =~ m!<last>(\w+)</last>!;
$filter{$key} = $1;
}
}
}

AFAIK, a well-formed XML file could have an order description looking like
this:

<order


meaning, it is not really possible to parse XML without doing a
character-by-character lexical analysis of the input data stream
first.
 
R

Rainer Weikusat

Dwight Army of Champions said:
I have a very large XML file that I want to load, but I don't want to
necessarily load the entire document; that takes too long. What I want
to do instead is only key/value pairs that meet certain criteria, like
only grab entries whose value fall within a certain date for a key
date_of_entry.

This is impossible. Technically, XML is a sequential character stream
and any structured data encoded as XML can only be recovered by
aggregating characters into tokens based on the rules for XML tokens
and parsing the resulting token stream.
 
W

Willem

Rainer Weikusat wrote:
) AFAIK, a well-formed XML file could have an order description looking like
) this:
)
)<order
)
)
)>1</order
)
)
)
)
)>
)
) meaning, it is not really possible to parse XML without doing a
) character-by-character lexical analysis of the input data stream
) first.

Indeed. To me, this is an argument that XML is usually a bad choice,
especially when you use it to store, transmit and retrieve data.

It's a _markup_ language, people! Not a data storage language.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
C

ccc31807

AFAIK, a well-formed XML file could have an order description looking like
this:

<order


meaning, it is not really possible to parse XML without doing a
character-by-character lexical analysis of the input data stream
first.

As I said, it depends on the nature of your input. XML handles
'ragged' data as well as the kind of normalized data we would expect
to use for an RDBMS. If you aren't sure of the format of your data,
you obviously have to validate it somehow. Part of this might be
removing whitespace at the beginning and ends of lines. Sometimes it
might be removing newlines from several lines until you match some
kind of closing tag.

I don't advocate reinventing wheels. I also don't advocate searching
for a CPAN module as the first step in solving a particular
programming problem. If you need to run a script continually
processing the same kind of input, it might pay to cobble together
some code that does EXACTLY what you need, no more and no less, that
to use someone else's code.

I say this as a promiscuous user of CPAN modules -- hardly a week goes
by that I don't install a new module for one reason or another -- and
frequently I just look at the source, modify it to do what I need, and
don't use or require the module.

TIMTOWTDI, CC.
 
C

ccc31807

On Nov 11, 7:11 pm, Dwight Army of Champions
<?xml version="1.0"?>
<library>
<book>
        <title>Dreamcatcher</title>
        <author>Stephen King</author>
        <genre>Horror</genre>
        <pages>899</pages>
        <price>23.99</price>
        <rating>5</rating>
        <publication_date>11/27/2001</publication_date>
</book> ....
</library>

If I had this kind of file, and it was a static file, I would read it
into some kind of database. If you used something like SQLite, you
could read it into a table <book> element by <book> element, and then
use normal SQL to munge your data.

Alternative, you could convert the file into CSV format, which in many
ways is a lot easier to handle than XML.

It strikes me that using XML for this kind of work is overkill, unless
you had a specific requirement to use XML. If you had to use XML it
might pay to learn a little XSLT and use that instead of Perl. Perl is
a great language for string processing, but in some cases XSLT works
better.

CC.
 
K

Klaus

On Nov 11, 7:11 pm, Dwight Army of Champions


If I had this kind of file, and it was a static file, I would read it
into some kind of database. If you used something like SQLite, you
could read it into a table <book> element by <book> element, and then
use normal SQL to munge your data.

Alternative, you could convert the file into CSV format, which in many
ways is a lot easier to handle than XML.

Converting to CSV is as easy as:

use strict;
use warnings;

use XML::Reader;
use Text::CSV_XS;

my $rdr = XML::Reader->new('huge.xml', {mode => 'branches'},
{ root => '/library/book', branch => [
'/title',
'/author',
'/genre',
'/pages',
'/price',
'/rating',
'/publication_date',
]},
{ root => '/library/music', branch => [
'/title',
'/artist',
'/release_date',
'/label',
]});

my $csv = Text::CSV_XS->new({ sep_char => ',', binary => 1, eol =>
$/ });
open my $ofh, '>', 'out.csv' or die $!;

while ($rdr->iterate) {
$csv->print($ofh, [ ($rdr->rx == 0 ? 'book' : 'music'), $rdr-
value ]);
}

close $ofh;
 
K

Klaus

This depends on the nature of your input. I do this kind of thing
every day, and use a simple regular expression to filter the file. Of
course, you still have to read every line of the file to make sure
that you catch all of your intended targets, but you would have to do
that anyway. This is the kind of task for which it's a lot easier to
hand roll your own parser than it is to look for, evaluate, learn,
install, and use some third party module. In my opinion anyway. For
example:

SCRIPT
#! perl
use warnings;
use strict;
my %filter;
while (<DATA>)
{
    next unless /\w/;
    chomp;
    if ($_ =~ m!<order>(\d+)</order>!)
    {
        my $key = $1;
        while (<DATA>)
        {
            last if $_ =~ m!</pres>!;
            next unless $_ =~ m!<last>(\w+)</last>!;
            $filter{$key} = $1;
        }
    }}

print "Finished processing file\n";
foreach my $key (sort keys %filter) { print "$key => $filter{$key}
\n"; }
exit(0);

Using XML::Reader, it's even easier:

use strict;
use warnings;

use XML::Reader;

my %filter;

my $rdr = XML::Reader->new(\*DATA,
{mode => 'branches'},
{ root => '/data/pres', branch => [
'/order',
'/last',
]});

while ($rdr->iterate) {
my ($order, $last) = $rdr->value;
$filter{$order} = $last;
}

print "Finished processing file\n";
foreach my $key (sort keys %filter) {
print "$key => $filter{$key}\n";
}

__DATA__
<data>
<pres>
<order>1</order>
<first>George</first>
<last>Washington</last>
<year>1788</year>
</pres>
<pres>
<order>2</order>
<first>John</first>
<last>Adams</last>
<year>1796</year>
</pres>
<pres>
<order>3</order>
<first>Thomas</first>
<last>Jefferson</last>
<year>1800</year>
</pres>
</data>
 
R

Rainer Weikusat

ccc31807 said:
As I said, it depends on the nature of your input. XML handles
'ragged' data as well as the kind of normalized data we would expect
to use for an RDBMS. If you aren't sure of the format of your data,
you obviously have to validate it somehow. Part of this might be
removing whitespace at the beginning and ends of lines. Sometimes it
might be removing newlines from several lines until you match some
kind of closing tag.

The point I was trying to make is that the kind of input your (example) code
can deal with needs to follow the rules of a grammar which is a proper
subset of the XML grammar.
 
C

ccc31807

The point I was trying to make is that the kind of input your (example) code
can deal with needs to follow the rules of a grammar which is a proper
subset of the XML grammar.

Yes, I understood your point. We all have to deal with messy data, and
faulty input will kill an application with no hope of recovery if you
don't deal with the possibility of corrupted data.

That said, if you are confident of the format of your input (as you
might have with an input file generated from a database) it might be
quicker and easier to hand roll your own.

If you have XML, you can use a SAX parser to process your input
element by element, and I assume that it would handle your whitespace
example without a problem.

I don't deal with XML much, and I really appreciate the post from
others that illustrate scripts with XML::Reader and the like. I didn't
have it but installed it yesterday, and have spend several hours
piddling with it.

CC.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top