Get XML content using XML::Twig

Discussion in 'Perl Misc' started by alwaysonnet, Apr 21, 2010.

  1. alwaysonnet

    alwaysonnet Guest

    Hello all,
    I'm trying to parse the XML using XML::Twig Module as my XML could be
    very large to handle using XML::Simple. Please help me out of how to
    print the values based on the following...
    <B>get the values of Sender, Receiver</B>
    <B>get the FileType. In this case possible values are
    InitTAP,FatalRAP,ReTxTAP</B>

    <CODE>
    get the values of Sender, Receiver
    get the FileType. In this case possible values are
    InitTAP,FatalRAP,ReTxTAP
    </CODE>
    <P>Here is the XML content....</P>
    <CODE>
    <?xml version="1.0" encoding="UTF-8"?>
    <Data>
    <ConnectionList>
    <Connection>
    <Sender>BRADD</Sender>
    <Receiver>SHANE</Receiver>
    <FileItemList>
    <FileItem>
    <FileID>378910</FileID>
    <Tmstp>2009-01-16T16:59:07+01:00</Tmstp>
    <FileType>
    <InitTAP>
    <TAPSeqNo>00083</TAPSeqNo>
    <NotifFileInd>false</NotifFileInd>
    <ChargeInfo>
    <TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
    TAPTxCutoffTmstp>
    <TAPAvailTmstp>2009-01-16T16:59:07+01:00</
    TAPAvailTmstp>
    <TAPCurrency>XDR</TAPCurrency>
    <TotalNoOfCalls>39</TotalNoOfCalls>
    <TotalNetCharge>11.470</TotalNetCharge>
    <TotalTax>0.000</TotalTax>
    </ChargeInfo>
    </InitTAP>
    </FileType>
    </FileItem>
    <FileItem>
    <FileID>380582</FileID>
    <Tmstp>2009-01-20T18:00:00+01:00</Tmstp>
    <FileType>
    <ReTxTAP>
    <TAPSeqNo>00083</TAPSeqNo>
    <NotifFileInd>false</NotifFileInd>
    <RefRAPSeqNo>00044</RefRAPSeqNo>
    <RefRAPID>380573</RefRAPID>
    <ChargeInfo>
    <TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
    TAPTxCutoffTmstp>
    <TAPAvailTmstp>2009-01-20T18:00:00+01:00</
    TAPAvailTmstp>
    <TAPCurrency>XDR</TAPCurrency>
    <TotalNoOfCalls>39</TotalNoOfCalls>
    <TotalNetCharge>11.470</TotalNetCharge>
    <TotalTax>0.000</TotalTax>
    </ChargeInfo>
    </ReTxTAP>
    </FileType>
    </FileItem>
    <FileItem>
    <FileID>380573</FileID>
    <Tmstp>2009-01-16T20:34:45+01:00</Tmstp>
    <FileType>
    <FatalRAP>
    <RAPSeqNo>00044</RAPSeqNo>
    <RAPStatus>Exchanged</RAPStatus>
    <RefTAPSeqNo>00083</RefTAPSeqNo>
    <RefTAPID>378910</RefTAPID>
    <RAPCreatTmstp>2009-01-16T20:21:30+01:00</
    RAPCreatTmstp>
    <RAPAvailTmstp>2009-01-16T20:21:30+01:00</
    RAPAvailTmstp>
    <ChargeInfo>
    <TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
    TAPTxCutoffTmstp>
    <TAPAvailTmstp>2009-01-16T16:59:07+01:00</
    TAPAvailTmstp>
    <TAPCurrency>XDR</TAPCurrency>
    <TotalNoOfCalls>-39</TotalNoOfCalls>
    <TotalNetCharge>-11.470</TotalNetCharge>
    <TotalTax>0.000</TotalTax>
    </ChargeInfo>
    </FatalRAP>
    </FileType>
    </FileItem>
    </FileItemList>
    </Connection>
    </ConnectionList>
    </Data>
    </CODE>
     
    alwaysonnet, Apr 21, 2010
    #1
    1. Advertising

  2. alwaysonnet

    John Bokma Guest

    alwaysonnet <> writes:

    > Hello all,
    > I'm trying to parse the XML using XML::Twig Module as my XML could be
    > very large to handle using XML::Simple. Please help me out of how to
    > print the values based on the following...
    > <B>get the values of Sender, Receiver</B>
    > <B>get the FileType. In this case possible values are
    > InitTAP,FatalRAP,ReTxTAP</B>


    For very simple things like this I would (probably, based on what I just
    read) use XML::SAX or (even) XML::parser. Regarding the latter,
    http://johnbokma.com/perl/ has some simple examples under "XML
    Processing using Perl"

    --
    John Bokma j3b

    Hacking & Hiking in Mexico - http://johnbokma.com/
    http://castleamber.com/ - Perl & Python Development
     
    John Bokma, Apr 21, 2010
    #2
    1. Advertising

  3. alwaysonnet

    Klaus Guest

    On 21 avr, 14:35, alwaysonnet <> wrote:
    > Hello all,
    > I'm trying to parse the XML using XML::Twig Module as my XML could be
    > very large to handle using XML::Simple. Please help me out of how to
    > print the values based on the following...
    >  <B>get the values of Sender, Receiver</B>
    >  <B>get the FileType. In this case possible values are
    > InitTAP,FatalRAP,ReTxTAP</B>
    >
    > <CODE>
    >  get the values of Sender, Receiver
    >  get the FileType. In this case possible values are
    > InitTAP,FatalRAP,ReTxTAP
    > </CODE>


    What Tad McClellan and John Bokma suggested should be your first path
    of investigation.

    However, let me bring in a shameless plug:

    You could also use my module XML::Reader
    http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm

    This module is specifically designed to handle very big XML files, it
    only uses the memory it needs to have one XML element at a time in
    memory (plus a small additional memory for buffering, which is
    independent of the size of the XML file)

    Here is a sample program:

    use strict;
    use warnings;
    use XML::Reader;

    my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
    { root => '/Data/ConnectionList/Connection/Sender', branch =>
    [ '/' ] },
    { root => '/Data/ConnectionList/Connection/Receiver', branch =>
    [ '/' ] },
    { root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    FileType', branch => [
    '/InitTAP/TAPSeqNo',
    '/ReTxTAP/TAPSeqNo',
    '/FatalRAP/RAPSeqNo',
    ] },
    );

    my ($sender, $receiver);

    while ($rdr->iterate) {
    if ($rdr->rx == 0) { $sender = $rdr->rvalue->[0]; }
    elsif ($rdr->rx == 1) { $receiver = $rdr->rvalue->[0]; }
    else {
    my ($InitTAP, $ReTxTAP, $FatalRAP) = @{$rdr->rvalue};
    my ($type, $seqno) = defined $InitTAP ? ('InitTAP',
    $InitTAP)
    : defined $ReTxTAP ? ('ReTxTAP',
    $ReTxTAP)
    : defined $FatalRAP ? ('FatalRAP',
    $FatalRAP)
    : ('???', '???');

    printf "Sender: %-5s, Receiver: %-5s, Type: %-8s, Seqno: %s
    \n",
    $sender, $receiver, $type, $seqno;
    }
    }

    __DATA__
    <?xml version="1.0" encoding="UTF-8"?>
    <Data>
    <ConnectionList>
    <Connection>
    <Sender>BRADD</Sender>
    <Receiver>SHANE</Receiver>
    <FileItemList>
    <FileItem>
    <FileID>378910</FileID>
    <Tmstp>2009-01-16T16:59:07+01:00</Tmstp>
    <FileType>
    <InitTAP>
    <TAPSeqNo>00083</TAPSeqNo>
    <NotifFileInd>false</NotifFileInd>
    <ChargeInfo>
    <TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
    TAPTxCutoffTmstp>
    <TAPAvailTmstp>2009-01-16T16:59:07+01:00</
    TAPAvailTmstp>
    <TAPCurrency>XDR</TAPCurrency>
    <TotalNoOfCalls>39</TotalNoOfCalls>
    <TotalNetCharge>11.470</TotalNetCharge>
    <TotalTax>0.000</TotalTax>
    </ChargeInfo>
    </InitTAP>
    </FileType>
    </FileItem>
    <FileItem>
    <FileID>380582</FileID>
    <Tmstp>2009-01-20T18:00:00+01:00</Tmstp>
    <FileType>
    <ReTxTAP>
    <TAPSeqNo>00083</TAPSeqNo>
    <NotifFileInd>false</NotifFileInd>
    <RefRAPSeqNo>00044</RefRAPSeqNo>
    <RefRAPID>380573</RefRAPID>
    <ChargeInfo>
    <TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
    TAPTxCutoffTmstp>
    <TAPAvailTmstp>2009-01-20T18:00:00+01:00</
    TAPAvailTmstp>
    <TAPCurrency>XDR</TAPCurrency>
    <TotalNoOfCalls>39</TotalNoOfCalls>
    <TotalNetCharge>11.470</TotalNetCharge>
    <TotalTax>0.000</TotalTax>
    </ChargeInfo>
    </ReTxTAP>
    </FileType>
    </FileItem>
    <FileItem>
    <FileID>380573</FileID>
    <Tmstp>2009-01-16T20:34:45+01:00</Tmstp>
    <FileType>
    <FatalRAP>
    <RAPSeqNo>00044</RAPSeqNo>
    <RAPStatus>Exchanged</RAPStatus>
    <RefTAPSeqNo>00083</RefTAPSeqNo>
    <RefTAPID>378910</RefTAPID>
    <RAPCreatTmstp>2009-01-16T20:21:30+01:00</
    RAPCreatTmstp>
    <RAPAvailTmstp>2009-01-16T20:21:30+01:00</
    RAPAvailTmstp>
    <ChargeInfo>
    <TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
    TAPTxCutoffTmstp>
    <TAPAvailTmstp>2009-01-16T16:59:07+01:00</
    TAPAvailTmstp>
    <TAPCurrency>XDR</TAPCurrency>
    <TotalNoOfCalls>-39</TotalNoOfCalls>
    <TotalNetCharge>-11.470</TotalNetCharge>
    <TotalTax>0.000</TotalTax>
    </ChargeInfo>
    </FatalRAP>
    </FileType>
    </FileItem>
    </FileItemList>
    </Connection>
    </ConnectionList>
    </Data>

    =======
    Here is the output:

    Sender: BRADD, Receiver: SHANE, Type: InitTAP , Seqno: 00083
    Sender: BRADD, Receiver: SHANE, Type: ReTxTAP , Seqno: 00083
    Sender: BRADD, Receiver: SHANE, Type: FatalRAP, Seqno: 00044
     
    Klaus, Apr 21, 2010
    #3
  4. alwaysonnet

    Guest

    On Wed, 21 Apr 2010 10:06:14 -0700 (PDT), Klaus <> wrote:

    >On 21 avr, 14:35, alwaysonnet <> wrote:
    >> Hello all,
    >> I'm trying to parse the XML using XML::Twig Module as my XML could be
    >> very large to handle using XML::Simple. Please help me out of how to
    >> print the values based on the following...
    >>  <B>get the values of Sender, Receiver</B>
    >>  <B>get the FileType. In this case possible values are
    >> InitTAP,FatalRAP,ReTxTAP</B>
    >>
    >> <CODE>
    >>  get the values of Sender, Receiver
    >>  get the FileType. In this case possible values are
    >> InitTAP,FatalRAP,ReTxTAP
    >> </CODE>

    >
    >What Tad McClellan and John Bokma suggested should be your first path
    >of investigation.
    >
    >However, let me bring in a shameless plug:
    >
    >You could also use my module XML::Reader
    >http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm

    Indeed shameless.
    >
    >This module is specifically designed to handle very big XML files, it
    >only uses the memory it needs to have one XML element at a time in
    >memory (plus a small additional memory for buffering, which is
    >independent of the size of the XML file)

    Is memory at a premium?
    >
    >Here is a sample program:
    >
    >use strict;
    >use warnings;
    >use XML::Reader;
    >
    >my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
    > { root => '/Data/ConnectionList/Connection/Sender', branch =>
    >[ '/' ] },
    > { root => '/Data/ConnectionList/Connection/Receiver', branch =>
    >[ '/' ] },
    > { root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    >FileType', branch => [
    > '/InitTAP/TAPSeqNo',
    > '/ReTxTAP/TAPSeqNo',
    > '/FatalRAP/RAPSeqNo',

    ^^^^^^^^^^^^
    What do these have to do with it?
    > ] },
    > );
    >
    >my ($sender, $receiver);
    >
    >while ($rdr->iterate) {
    > if ($rdr->rx == 0) { $sender = $rdr->rvalue->[0]; }
    > elsif ($rdr->rx == 1) { $receiver = $rdr->rvalue->[0]; }
    > else {
    > my ($InitTAP, $ReTxTAP, $FatalRAP) = @{$rdr->rvalue};

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    Again, what do these have to do with it?
    [snip]
    >=======
    >Here is the output:
    >
    >Sender: BRADD, Receiver: SHANE, Type: InitTAP , Seqno: 00083
    >Sender: BRADD, Receiver: SHANE, Type: ReTxTAP , Seqno: 00083
    >Sender: BRADD, Receiver: SHANE, Type: FatalRAP, Seqno: 00044


    Thats nice. Lets say he generally said "in this case its:"
    InitTAP ReTxTAP FatalRAP
    Why? Because its the file type.
    Maybe he wants all file types of the sender/reciever's.
    But its hard to know what the OP wants isin't it.

    -sln
     
    , Apr 21, 2010
    #4
  5. alwaysonnet

    Klaus Guest

    On 21 avr, 20:07, wrote:
    > On Wed, 21 Apr 2010 10:06:14 -0700 (PDT), Klaus <> wrote:
    > >On 21 avr, 14:35, alwaysonnet <> wrote:
    > >> Hello all,
    > >> I'm trying to parse the XML using XML::Twig Module as my XML could be
    > >> very large to handle using XML::Simple. Please help me out of how to
    > >> print the values based on the following...
    > >>  <B>get the values of Sender, Receiver</B>
    > >>  <B>get the FileType. In this case possible values are
    > >> InitTAP,FatalRAP,ReTxTAP</B>


    > Thats nice. Lets say he generally said "in this case its:"
    > InitTAP  ReTxTAP  FatalRAP
    > Why? Because its the file type.
    > Maybe he wants all file types of the sender/reciever's.


    in that case you use XML::Reader->newhd(... {filter => 2});

    use strict;
    use warnings;
    use XML::Reader;

    my $rdr = XML::Reader->newhd(\*DATA, {filter => 2});

    my ($sender, $receiver);

    while ($rdr->iterate) {
    if ($rdr->path eq '/Data/ConnectionList/Connection/Sender') {
    $sender = $rdr->value;
    }
    elsif ($rdr->path eq '/Data/ConnectionList/Connection/Receiver') {
    $receiver = $rdr->value;
    }
    elsif ($rdr->is_start
    and $rdr->path =~ m{\A /Data/ConnectionList/Connection/
    FileItemList/FileItem/FileType/ (\w+) \z}xms) {
    printf "Sender: %-5s, Receiver: %-5s, Type: %s\n",
    $sender, $receiver, $1;
    }
    }

    Here is the output

    Sender: BRADD, Receiver: SHANE, Type: InitTAP
    Sender: BRADD, Receiver: SHANE, Type: ReTxTAP
    Sender: BRADD, Receiver: SHANE, Type: FatalRAP
     
    Klaus, Apr 21, 2010
    #5
  6. alwaysonnet

    Guest

    On Wed, 21 Apr 2010 11:48:59 -0700 (PDT), Klaus <> wrote:

    >On 21 avr, 20:07, wrote:
    >> On Wed, 21 Apr 2010 10:06:14 -0700 (PDT), Klaus <> wrote:
    >> >On 21 avr, 14:35, alwaysonnet <> wrote:
    >> >> Hello all,
    >> >> I'm trying to parse the XML using XML::Twig Module as my XML could be
    >> >> very large to handle using XML::Simple. Please help me out of how to
    >> >> print the values based on the following...
    >> >>  <B>get the values of Sender, Receiver</B>
    >> >>  <B>get the FileType. In this case possible values are
    >> >> InitTAP,FatalRAP,ReTxTAP</B>

    >
    >> Thats nice. Lets say he generally said "in this case its:"
    >> InitTAP  ReTxTAP  FatalRAP
    >> Why? Because its the file type.
    >> Maybe he wants all file types of the sender/reciever's.

    >
    >in that case you use XML::Reader->newhd(... {filter => 2});
    >
    >use strict;
    >use warnings;
    >use XML::Reader;
    >
    >my $rdr = XML::Reader->newhd(\*DATA, {filter => 2});
    >
    >my ($sender, $receiver);
    >
    >while ($rdr->iterate) {
    > if ($rdr->path eq '/Data/ConnectionList/Connection/Sender') {
    > $sender = $rdr->value;
    > }
    > elsif ($rdr->path eq '/Data/ConnectionList/Connection/Receiver') {
    > $receiver = $rdr->value;
    > }
    > elsif ($rdr->is_start
    > and $rdr->path =~ m{\A /Data/ConnectionList/Connection/
    >FileItemList/FileItem/FileType/ (\w+) \z}xms) {
    > printf "Sender: %-5s, Receiver: %-5s, Type: %s\n",
    > $sender, $receiver, $1;
    > }
    >}
    >
    >Here is the output
    >
    >Sender: BRADD, Receiver: SHANE, Type: InitTAP
    >Sender: BRADD, Receiver: SHANE, Type: ReTxTAP
    >Sender: BRADD, Receiver: SHANE, Type: FatalRAP


    This is pretty good. I assume it does attribute/value as well.
    It appears to be a lot of regex work, the more unknown the
    elements become, but thats a tree stack.

    It would be good though to have a capture mechanism, where
    xml capture can be triggered on/off by the user, later to
    be regurgitated to the user (on demand), and given to an
    xml::simple style mechanism to turn it into filtered records.

    It wouldn't change the simple, low memmory stream parsing at all,
    just the source would be captured (appended) on/off to a named buffer,
    on demand.

    Its not as easy as it seems though. CaptureON/OFF (bufname, before/after),
    nested capture's, single data pool. I think I've done this before.

    -sln
     
    , Apr 22, 2010
    #6
  7. alwaysonnet

    Klaus Guest

    On 22 avr, 02:31, wrote:
    > On Wed, 21 Apr 2010 11:48:59 -0700 (PDT), Klaus <> wrote:
    > >On 21 avr, 20:07, wrote:
    > >> On Wed, 21 Apr 2010 10:06:14 -0700 (PDT), Klaus <> wrote:
    > >> >On 21 avr, 14:35, alwaysonnet <> wrote:
    > >> >> Hello all,
    > >> >> I'm trying to parse the XML using XML::Twig Module as my XML could be
    > >> >> very large to handle using XML::Simple. Please help me out of how to
    > >> >> print the values based on the following...
    > >> >>  <B>get the values of Sender, Receiver</B>
    > >> >>  <B>get the FileType. In this case possible values are
    > >> >> InitTAP,FatalRAP,ReTxTAP</B>


    > This is pretty good. I assume it does attribute/value as well.


    Yes it does, just put an '@' symbol in the path, for example
    '/InitTAP/ChargeInfo/@attrib1'

    > It appears to be a lot of regex work, the more unknown the
    > elements become, but thats a tree stack.
    >
    > It would be good though to have a capture mechanism, where
    > xml capture can be triggered on/off by the user, later to
    > be regurgitated to the user (on demand), and given to an
    > xml::simple style mechanism to turn it into filtered records.


    For simple structures where you know exactly what you are looking for,
    you can use {filter => 5} like so

    use strict;
    use warnings;
    use XML::Reader;

    use Data::Dumper;

    my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
    { root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    FileType', branch => [
    '/InitTAP/TAPSeqNo',
    '/ReTxTAP/TAPSeqNo',
    '/FatalRAP/RAPSeqNo',
    '/InitTAP/ChargeInfo/@attrib1',
    '/InitTAP/ChargeInfo/TAPCurrency',
    '/ReTxTAP/ChargeInfo/TAPCurrency',
    '/FatalRAP/ChargeInfo/TAPCurrency',
    ] },
    );

    while ($rdr->iterate) {
    print Dumper($rdr->rvalue), "\n";
    }

    > It wouldn't change the simple, low memmory stream parsing at all,
    > just the source would be captured (appended) on/off to a named buffer,
    > on demand.
    > Its not as easy as it seems though. CaptureON/OFF (bufname, before/after),
    > nested capture's, single data pool. I think I've done this before.


    For general capture into a buffer, you would use {filter => 3, using
    => '/Data/ConnectionList/Connection/FileItemList/FileItem/FileType'}

    use strict;
    use warnings;
    use XML::Reader;

    my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
    using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    FileType'});

    my $buffer = '';

    while ($rdr->iterate) {
    my $indentation = ' ' x ($rdr->level - 1);

    if ($rdr->path eq '/') {
    if ($rdr->is_start) {
    $buffer = '';
    }
    elsif ($rdr->is_end) {
    print "\n\n buffer ==>\n", $buffer, "\n\n";
    }
    next;
    }

    if ($rdr->is_start) {
    $buffer .= $indentation.'<'.$rdr->tag.
    join('', map{" $_='".$rdr->att_hash->{$_}."'"} sort keys %
    {$rdr->att_hash}).
    '>'."\n";
    }

    if ($rdr->type eq 'T' and $rdr->value ne '') {
    $buffer .= $indentation.' '.$rdr->value."\n";
    }

    if ($rdr->is_end) {
    $buffer .= $indentation.'</'.$rdr->tag.'>'."\n";
    }
    }
     
    Klaus, Apr 22, 2010
    #7
  8. alwaysonnet

    alwaysonnet Guest

    On Apr 22, 12:39 pm, Klaus <> wrote:
    > On 22 avr, 02:31, wrote:
    >
    > > On Wed, 21 Apr 2010 11:48:59 -0700 (PDT), Klaus <> wrote:
    > > >On 21 avr, 20:07, wrote:
    > > >> On Wed, 21 Apr 2010 10:06:14 -0700 (PDT), Klaus <> wrote:
    > > >> >On 21 avr, 14:35, alwaysonnet <> wrote:
    > > >> >> Hello all,
    > > >> >> I'm trying to parse the XML using XML::Twig Module as my XML could be
    > > >> >> very large to handle using XML::Simple. Please help me out of howto
    > > >> >> print the values based on the following...
    > > >> >>  <B>get the values of Sender, Receiver</B>
    > > >> >>  <B>get the FileType. In this case possible values are
    > > >> >> InitTAP,FatalRAP,ReTxTAP</B>

    > > This is pretty good. I assume it does attribute/value as well.

    >
    > Yes it does, just put an '@' symbol in the path, for example
    > '/InitTAP/ChargeInfo/@attrib1'
    >
    > > It appears to be a lot of regex work, the more unknown the
    > > elements become, but thats a tree stack.

    >
    > > It would be good though to have a capture mechanism, where
    > > xml capture can be triggered on/off by the user, later to
    > > be regurgitated to the user (on demand), and given to an
    > > xml::simple style mechanism to turn it into filtered records.

    >
    > For simple structures where you know exactly what you are looking for,
    > you can use {filter => 5} like so
    >
    > use strict;
    > use warnings;
    > use XML::Reader;
    >
    > use Data::Dumper;
    >
    > my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
    >     { root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    > FileType', branch => [
    >       '/InitTAP/TAPSeqNo',
    >       '/ReTxTAP/TAPSeqNo',
    >       '/FatalRAP/RAPSeqNo',
    >       '/InitTAP/ChargeInfo/@attrib1',
    >       '/InitTAP/ChargeInfo/TAPCurrency',
    >       '/ReTxTAP/ChargeInfo/TAPCurrency',
    >       '/FatalRAP/ChargeInfo/TAPCurrency',
    >     ] },
    >   );
    >
    > while ($rdr->iterate) {
    >     print Dumper($rdr->rvalue), "\n";
    >
    > }
    > > It wouldn't change the simple, low memmory stream parsing at all,
    > > just the source would be captured (appended) on/off to a named buffer,
    > > on demand.
    > > Its not as easy as it seems though. CaptureON/OFF (bufname, before/after),
    > > nested capture's, single data pool. I think I've done this before.

    >
    > For general capture into a buffer, you would use {filter => 3, using
    > => '/Data/ConnectionList/Connection/FileItemList/FileItem/FileType'}
    >
    > use strict;
    > use warnings;
    > use XML::Reader;
    >
    > my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
    >     using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    > FileType'});
    >
    > my $buffer = '';
    >
    > while ($rdr->iterate) {
    >     my $indentation = '  ' x ($rdr->level - 1);
    >
    >     if ($rdr->path eq '/') {
    >         if ($rdr->is_start) {
    >             $buffer = '';
    >         }
    >         elsif ($rdr->is_end) {
    >             print "\n\n buffer ==>\n", $buffer, "\n\n";
    >         }
    >         next;
    >     }
    >
    >     if ($rdr->is_start) {
    >         $buffer .= $indentation.'<'.$rdr->tag.
    >           join('', map{" $_='".$rdr->att_hash->{$_}."'"} sortkeys %
    > {$rdr->att_hash}).
    >           '>'."\n";
    >     }
    >
    >     if ($rdr->type eq 'T' and $rdr->value ne '') {
    >         $buffer .= $indentation.'  '.$rdr->value."\n";
    >     }
    >
    >     if ($rdr->is_end) {
    >         $buffer .= $indentation.'</'.$rdr->tag.'>'."\n";
    >     }
    >
    > }
    >
    >


    My intention is to ~

    - Get each sender and receiver
    - Get the filetype ( could be InitTAP, FatalRAP etc )
    - For each of filetype get the TAPSeqNo, NoofCalls etc....

    Basically I want all the information in place for processing the
    data....

    Also, apart from XML::Twig, is there any module which can handle
    larger XML files..

    any help or suggestions are appreciated.
     
    alwaysonnet, Apr 22, 2010
    #8
  9. alwaysonnet

    Klaus Guest

    On 21 avr, 14:35, alwaysonnet <> wrote:
    > Hello all,
    > I'm trying to parse the XML using XML::Twig Module as my XML could be
    > very large to handle using XML::Simple.


    Klaus <> wrote:
    > However, let me bring in a shameless plug:
    > You could also use my module XML::Reader
    > http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm


    wrote:
    > > Indeed shameless.
    > >
    > > [...]
    > >
    > > It would be good though to have a capture mechanism, where
    > > xml capture can be triggered on/off by the user, later to
    > > be regurgitated to the user (on demand), and given to an
    > > xml::simple style mechanism to turn it into filtered records.


    Here is an example of how to use XML::Reader to capture sub-trees from
    a (potentially very big) XML file into a buffer and pass that buffer
    to XML::Simple:

    use strict;
    use warnings;
    use XML::Reader;

    my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
    using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    FileType'});

    my $buffer = '';

    while ($rdr->iterate) {

    if ($rdr->path eq '/') {
    if ($rdr->is_start) {
    $buffer = qq{<?xml version="1.0" encoding="UTF-8"?
    ><FileType>};

    }
    if ($rdr->is_end) {
    $buffer .= qq{</FileType>};

    use XML::Simple;
    use Data::Dumper;

    my $ref = XMLin($buffer);
    print Dumper($ref), "\n\n";
    }
    next;
    }

    if ($rdr->is_start) {
    $buffer .= '<'.$rdr->tag.
    join('', map{" $_='".$rdr->att_hash->{$_}."'"} sort keys %
    {$rdr->att_hash}).
    '>';
    }

    if ($rdr->type eq 'T' and $rdr->value ne '') {
    $buffer .= $rdr->value;
    }

    if ($rdr->is_end) {
    $buffer .= '</'.$rdr->tag.'>';
    }
    }
     
    Klaus, Apr 22, 2010
    #9
  10. alwaysonnet

    Klaus Guest

    On 21 avr, 14:35, alwaysonnet <> wrote:
    > Hello all,
    > I'm trying to parse the XML using XML::Twig Module as my XML could be
    > very large to handle using XML::Simple.


    On Wed, 21 Apr 2010 10:06:14, Klaus <> wrote:
    > What Tad McClellan and John Bokma suggested should be your first
    > path of investigation.
    > However, let me bring in a shameless plug:
    > You could also use my module XML::Reader
    > http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm


    On 21 avr, 20:07, wrote:
    > Indeed shameless.


    On 22 avr, 10:24, alwaysonnet <> wrote:
    > My intention is to ~
    > - Get each sender and receiver
    > - Get the filetype ( could be InitTAP, FatalRAP etc )
    > - For each of filetype get the TAPSeqNo, NoofCalls etc....
    >
    > Basically I want all the information in place for processing the
    > data....
    >
    > Also, apart from XML::Twig, is there any module which can handle
    > larger XML files..


    As I said before, take the advice of Tad McClellan and John Bokma
    first.

    If, for whatever reason, you can't follow their advice, (and, for
    whatever reason, you can't use XML::Twig either) there is always my
    "shameless plug" XML::Reader:

    There are, in my opinion, two scenarios:

    Scenario 1:
    You already know how to parse your XML with XML::Simple, but the XML
    file is too big to fit entirely into memory.
    In that case, I suggest you follow my example (with XML::Reader) that
    I gave in this thread today (where I said: "...Here is an example of
    how to use XML::Reader to capture sub-trees...)
    see http://groups.google.com/group/comp.lang.perl.misc/msg/4bb3a769d96c1b2e

    Scenario 2:
    You know the general rules of your XML parsing, but you don't know
    which XML module to use (and you can't follow the advice from Tad
    McClellan and from John Bokma).
    In that case I suggest you follow my example (with XML::Reader) that I
    gave in this thread yesterday (where I said: "...use XML::Reader-
    >newhd(... {filter => 2})...")

    see http://groups.google.com/group/comp.lang.perl.misc/msg/762534f342f939e6
     
    Klaus, Apr 22, 2010
    #10
  11. On 22/04/2010 09:24, alwaysonnet wrote:
    > On Apr 22, 12:39 pm, Klaus<> wrote:
    >>
    >> [XML::Reader examples and discussion omitted]
    >>

    >
    > My intention is to ~
    >
    > - Get each sender and receiver
    > - Get the filetype ( could be InitTAP, FatalRAP etc )
    > - For each of filetype get the TAPSeqNo, NoofCalls etc....
    >
    > Basically I want all the information in place for processing the
    > data....
    >
    > Also, apart from XML::Twig, is there any module which can handle
    > larger XML files..


    Well there's the XML::Reader that Klaus has thoughtfully spent time
    explaining and providing examples for. You didn't say whether there is
    some reason you'd not use that.

    >
    > any help or suggestions are appreciated.
    >


    For very arge XML files, the obvious approach to consider is any SAX
    parser. Perl SAX modules I've used before include XML::parser and XML::SAX.

    Have you Googled for "Perl SAX" and searched CPAN for SAX?

    --
    RGB
     
    RedGrittyBrick, Apr 22, 2010
    #11
  12. On 22/04/2010 10:34, RedGrittyBrick wrote:
    > On 22/04/2010 09:24, alwaysonnet wrote:
    >> On Apr 22, 12:39 pm, Klaus<> wrote:
    >>>
    >>> [XML::Reader examples and discussion omitted]
    >>>

    >>
    >> My intention is to ~
    >>
    >> - Get each sender and receiver
    >> - Get the filetype ( could be InitTAP, FatalRAP etc )
    >> - For each of filetype get the TAPSeqNo, NoofCalls etc....
    >>
    >> Basically I want all the information in place for processing the
    >> data....
    >>
    >> Also, apart from XML::Twig, is there any module which can handle
    >> larger XML files..

    >
    > Well there's the XML::Reader that Klaus has thoughtfully spent time
    > explaining and providing examples for. You didn't say whether there is
    > some reason you'd not use that.
    >
    >>
    >> any help or suggestions are appreciated.
    >>

    >
    > For very arge XML files, the obvious approach to consider is any SAX
    > parser. Perl SAX modules I've used before include XML::parser and XML::SAX.
    >
    > Have you Googled for "Perl SAX" and searched CPAN for SAX?
    >


    I recommend you read this
    http://xmltwig.com/article/ways_to_rome/ways_to_rome.html




    --
    RGB
     
    RedGrittyBrick, Apr 22, 2010
    #12
  13. alwaysonnet

    alwaysonnet Guest

    On Apr 22, 2:34 pm, RedGrittyBrick <>
    wrote:
    > On 22/04/2010 09:24, alwaysonnet wrote:
    >
    >
    >
    > > On Apr 22, 12:39 pm, Klaus<>  wrote:

    >
    > >> [XML::Reader examples and discussion omitted]

    >
    > > My intention is to ~

    >
    > > - Get each sender and receiver
    > > - Get the filetype ( could be InitTAP, FatalRAP etc )
    > > - For each of filetype get the TAPSeqNo, NoofCalls etc....

    >
    > > Basically I want all the information in place for processing the
    > > data....

    >
    > > Also, apart from XML::Twig, is there any module which can handle
    > > larger XML files..

    >
    > Well there's the XML::Reader that Klaus has thoughtfully spent time
    > explaining and providing examples for. You didn't say whether there is
    > some reason you'd not use that.
    >
    >
    >
    > > any help or suggestions are appreciated.

    >
    > For very arge XML files, the obvious approach to consider is any SAX
    > parser. Perl SAX modules I've used before include XML::parser and XML::SAX.
    >
    > Have you Googled for "Perl SAX" and searched CPAN for SAX?
    >
    > --
    > RGB


    I do find XML::Reader quite helpful for me.

    I'm comparing my existing code with 40MB of XML file with XML::Simple
    and XML::Reader to find out what fits by bill..
     
    alwaysonnet, Apr 22, 2010
    #13
  14. alwaysonnet

    alwaysonnet Guest

    I'll post my observations in my next post regarding the comparison
    times between XML::Simple and XML::Reader modules...

    Anyway, it is good to use Storable module to store my datastructure on
    the disk or use it directly. I know this is an irrelevant question in
    this context, but I'm trying to understand the possible ways for
    parsing the XML file..

    >>Code i've tried so far...

    use strict;
    use XML::Simple;
    use Storable;
    use Data::Dumper;

    my ($XML_FILE) = "sample.xml";

    my $mldata = XMLin($XML_FILE);

    store \$mldata, 'file';
    my $hashref = retrieve('file');

    #print Dumper($hashref);
     
    alwaysonnet, Apr 22, 2010
    #14
  15. alwaysonnet

    Klaus Guest

    On 22 avr, 10:29, Klaus <> wrote:
    > On 21 avr, 14:35, alwaysonnet <> wrote:
    > > Hello all,
    > > I'm trying to parse the XML using XML::Twig Module as my XML could be
    > > very large to handle using XML::Simple.

    > Klaus <> wrote:
    > > However, let me bring in a shameless plug:
    > > You could also use my module XML::Reader
    > >http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm

    > wrote:
    > > > Indeed shameless.

    >
    > > > [...]

    >
    > > > It would be good though to have a capture mechanism, where
    > > > xml capture can be triggered on/off by the user, later to
    > > > be regurgitated to the user (on demand), and given to an
    > > > xml::simple style mechanism to turn it into filtered records.

    >
    > use XML::Reader;
    > my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
    >     using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    > FileType'});


    I have now released XML::Reader 0.34
    http://search.cpan.org/~keichner/XML-Reader-0.34/lib/XML/Reader.pm

    This new version allows to write the same program (...the program that
    uses XML::Reader to capture sub-trees from a potentially very big XML
    file into a buffer and pass that buffer to XML::Simple...) even
    shorter:

    use strict;
    use warnings;
    use XML::Reader 0.34;

    use XML::Simple;
    use Data::Dumper;

    my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
    { root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    FileType', branch => '*' },
    );

    while ($rdr->iterate) {
    my $buffer = $rdr->rval;
    my $ref = XMLin($buffer);
    print Dumper($ref), "\n\n";
    }
     
    Klaus, Apr 26, 2010
    #15
  16. alwaysonnet

    Guest

    On Mon, 26 Apr 2010 13:13:24 -0700 (PDT), Klaus <> wrote:

    >On 22 avr, 10:29, Klaus <> wrote:
    >> On 21 avr, 14:35, alwaysonnet <> wrote:
    >> > Hello all,
    >> > I'm trying to parse the XML using XML::Twig Module as my XML could be
    >> > very large to handle using XML::Simple.

    >> Klaus <> wrote:
    >> > However, let me bring in a shameless plug:
    >> > You could also use my module XML::Reader
    >> >http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm

    >> wrote:
    >> > > Indeed shameless.

    >>
    >> > > [...]

    >>
    >> > > It would be good though to have a capture mechanism, where
    >> > > xml capture can be triggered on/off by the user, later to
    >> > > be regurgitated to the user (on demand), and given to an
    >> > > xml::simple style mechanism to turn it into filtered records.

    >>
    >> use XML::Reader;
    >> my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
    >>     using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    >> FileType'});

    >
    >I have now released XML::Reader 0.34
    >http://search.cpan.org/~keichner/XML-Reader-0.34/lib/XML/Reader.pm
    >
    >This new version allows to write the same program (...the program that
    >uses XML::Reader to capture sub-trees from a potentially very big XML
    >file into a buffer and pass that buffer to XML::Simple...) even
    >shorter:
    >
    >use strict;
    >use warnings;
    >use XML::Reader 0.34;
    >
    >use XML::Simple;
    >use Data::Dumper;
    >
    >my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
    > { root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
    >FileType', branch => '*' },
    > );
    >
    >while ($rdr->iterate) {
    > my $buffer = $rdr->rval;
    > my $ref = XMLin($buffer);
    > print Dumper($ref), "\n\n";
    >}


    Good job on this.

    my $buffer = '';

    while ($rdr->iterate) {
    $buffer .= $rdr->rval;
    }

    if (length $buffer) {
    my $ref = XMLin('<FileItem>'.$buffer.'</FileItem>');
    print Dumper($ref), "\n\n";
    }

    -sln
     
    , Apr 26, 2010
    #16
  17. alwaysonnet

    John Bokma Guest

    Klaus <> writes:

    > my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},


    To me filter is very unclear. I understand that it are options to the
    program, but just 5 is very confusing. Maybe split "filter" in several
    options which combined result in 1,2,3,4,5 ?

    why is the constructor called newhd?

    anyway, thanks for mentioning this module, I will check it out when I
    have more time.

    --
    John Bokma j3b

    Hacking & Hiking in Mexico - http://johnbokma.com/
    http://castleamber.com/ - Perl & Python Development
     
    John Bokma, Apr 27, 2010
    #17
  18. alwaysonnet

    Klaus Guest

    On 26 avr, 23:58, wrote:
    > my $buffer = '';
    >
    > while ($rdr->iterate) {
    >    $buffer .= $rdr->rval;
    >
    > }
    >
    > if (length $buffer) {
    >    my $ref = XMLin('<FileItem>'.$buffer.'</FileItem>');
    >    print Dumper($ref), "\n\n";
    >
    > }


    If memory is not important, than you can use use XML::Reader 0.34
    qw(slurp_xml):

    use strict;
    use warnings;
    use XML::Reader 0.34 qw(slurp_xml);

    use XML::Simple;
    use Data::Dumper;

    my $root = '/Data/ConnectionList/Connection/FileItemList/FileItem/
    FileType';
    my $lref = slurp_xml(\*DATA, {root => $root, branch => '*'});
    my $buffer = join '', map {$$_} @{$lref->[0]};
    my $ref = XMLin("<Item>$buffer</Item>");

    print Dumper($ref), "\n\n";
     
    Klaus, Apr 27, 2010
    #18
  19. alwaysonnet

    Klaus Guest

    On 27 avr, 02:01, John Bokma <> wrote:
    > Klaus <> writes:
    > > my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},

    >
    > To me filter is very unclear. I understand that it are options to the
    > program, but just 5 is very confusing. Maybe split "filter" in several
    > options which combined result in 1,2,3,4,5 ?


    "filter => 2,3,4,5" is just a construction that has historically grown
    inside XML::Reader.

    But I agree very much with you, I also find that "filter => 2,3,4,5"
    is not expressive at all. I will think of a better way to select the
    mode of operation for XML::Reader.

    > why is the constructor called newhd?


    Thanks for the question.

    That, again, is a historic accident. ==> Back in the old days of
    XML::Reader ver 0.01, there used to be an option {filter => 1} and the
    constructor back then was called new() and defaulted to {filter => 1}.

    Then, in version 0.03 (or so) I decided to have the constructor
    default to {filter => 2}, but I didn't want to break code that already
    used the old default, so I came up with a second constructor called
    newhd() that defaults to {filter => 2}.

    At some version of XML::Reader the {filter => 1} and its use of the
    constructor new() had disappeared. Therefore it is possible now to
    rename newhd() back into new(). I think I will go back to constructor
    new() in a future version of XML::Reader.
     
    Klaus, Apr 27, 2010
    #19
  20. alwaysonnet

    Klaus Guest

    On 27 avr, 09:10, Klaus <> wrote:
    > On 27 avr, 02:01, John Bokma <> wrote:
    >
    > > Klaus <> writes:
    > > > my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},

    >
    > > To me filter is very unclear. I understand that it are options to the
    > > program, but just 5 is very confusing. Maybe split "filter" in several
    > > options which combined result in 1,2,3,4,5 ?

    >
    > I will think of a better way to select the
    > mode of operation for XML::Reader.
    >
    > > why is the constructor called newhd?

    >
    > [...] I think I will go back to constructor
    > new() in a future version of XML::Reader.


    I have now released a new version of XML::Reader (ver
    0.35) with some bug fixes, warts removed, relicensing, etc...
    http://search.cpan.org/~keichner/XML-Reader-0.35/lib/XML/Reader.pm

    The line I wrote in my previous post (which was for XML::Reader ver
    0.34) was:

    my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},

    With the new version 0.35 of XML::Reader, the same line would be
    spelled:

    my $rdr = XML::Reader->new(\*DATA, {mode => 'branches'},
     
    Klaus, Apr 29, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sherman Willden
    Replies:
    4
    Views:
    651
    Sherman Willden
    Aug 8, 2003
  2. Sherman Willden
    Replies:
    1
    Views:
    135
    Sisyphus
    Jul 25, 2003
  3. Sherman Willden
    Replies:
    3
    Views:
    170
    Sherman Willden
    Aug 8, 2003
  4. Andres Monroy-Hernandez

    XML::Twig constructor disregarding map_xmlns - bug in module?

    Andres Monroy-Hernandez, Aug 29, 2004, in forum: Perl Misc
    Replies:
    0
    Views:
    109
    Andres Monroy-Hernandez
    Aug 29, 2004
  5. c0rk

    XML::Twig

    c0rk, Sep 25, 2004, in forum: Perl Misc
    Replies:
    4
    Views:
    214
Loading...

Share This Page