looking for efficient way to parse a file

Discussion in 'Perl Misc' started by Eric Martin, Jan 12, 2008.

  1. Eric Martin

    Eric Martin Guest

    Hello,

    I have a file with the following data structure:
    #category
    item name
    data1
    data2
    item name
    data1
    data2
    #category
    item name
    data1
    data2
    .... etc.

    Any line that starts with #, indicates a new category. Between
    categories, there can be any number of items, with associated data.
    Each item has exactly two data properties.

    My plan was to just get an array that contained the index of each of
    the categories and then parse each item from there, since they are in
    a set format...but I was wondering if there were any suggestions for a
    more efficient way...
    Eric Martin, Jan 12, 2008
    #1
    1. Advertising

  2. Eric Martin wrote:
    > I have a file with the following data structure:
    > #category
    > item name
    > data1
    > data2
    > item name
    > data1
    > data2
    > #category
    > item name
    > data1
    > data2
    > ... etc.
    >
    > Any line that starts with #, indicates a new category. Between
    > categories, there can be any number of items, with associated data.
    > Each item has exactly two data properties.
    >
    > My plan was to just get an array that contained the index of each of
    > the categories and then parse each item from there, since they are in
    > a set format...


    Not sure what you mean by that. Could you please expand?

    > but I was wondering if there were any suggestions for a
    > more efficient way...


    Efficient - in what sense?

    To me, the described data structure would suggest a HoHoA (hash of
    hashes of arrays):

    use Data::Dumper;

    my (%HoHoA, $cat);
    while ( <DATA> ) {
    chomp;
    if ( substr($_, 0, 1) eq '#' ) {
    $cat = substr $_, 1;
    next;
    }
    for my $item ( 0, 1 ) {
    chomp( $HoHoA{$cat}{$_}[$item] = <DATA> );
    }
    }
    print Dumper \%HoHoA;

    __DATA__
    #category1
    item1
    data1
    data2
    item2
    data1
    data2
    #category2
    item1
    data1
    data2

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Jan 12, 2008
    #2
    1. Advertising

  3. Eric Martin

    Guest

    Eric Martin <> wrote:
    > Hello,
    >
    > I have a file with the following data structure:
    > #category
    > item name
    > data1
    > data2
    > item name
    > data1
    > data2
    > #category
    > item name
    > data1
    > data2
    > ... etc.
    >
    > Any line that starts with #, indicates a new category. Between
    > categories, there can be any number of items, with associated data.
    > Each item has exactly two data properties.
    >
    > My plan was to just get an array that contained the index of each of
    > the categories


    That suggests the categories are already in an array, or else what is the
    index the index to? I'd probably not bother to load them into an array
    in the first place, just parse it on the fly. Maybe not, depending on
    where it was coming from and how big I expected it to plausibly get.

    > and then parse each item from there, since they are in
    > a set format...but I was wondering if there were any suggestions for a
    > more efficient way...


    Efficient in what sense? Memory? CPU time? Programmer maintenance time?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , Jan 12, 2008
    #3
  4. Eric Martin <> wrote:
    >I have a file with the following data structure:
    >#category
    >item name
    >data1
    >data2
    >item name
    >data1
    >data2
    >#category
    >item name
    >data1
    >data2
    >... etc.
    >
    >Any line that starts with #, indicates a new category. Between
    >categories, there can be any number of items, with associated data.
    >Each item has exactly two data properties.


    That suggests to me a Hash(category) of Hash(item name) of Array (two data
    elements)

    >My plan was to just get an array that contained the index of each of
    >the categories and then parse each item from there, since they are in


    What's an index of a category?

    >a set format...but I was wondering if there were any suggestions for a
    >more efficient way...


    Reading the file line by line in a linear manner is about as efficient as
    you can possibly get because you need to read each item at least once and
    you don't read it more than once, either. The suggested data structure would
    support a linear reading, too.

    jue
    Jürgen Exner, Jan 13, 2008
    #4
  5. Eric Martin

    Eric Martin Guest

    On Jan 12, 2:59 pm, Gunnar Hjalmarsson <> wrote:
    > Eric Martin wrote:
    > > I have a file with the following data structure:
    > > #category
    > > item name
    > > data1
    > > data2
    > > item name
    > > data1
    > > data2
    > > #category
    > > item name
    > > data1
    > > data2
    > > ... etc.

    >
    > > Any line that starts with #, indicates a new category. Between
    > > categories, there can be any number of items, with associated data.
    > > Each item has exactly two data properties.

    >
    > > My plan was to just get an array that contained the index of each of
    > > the categories and then parse each item from there, since they are in
    > > a set format...

    >
    > Not sure what you mean by that. Could you please expand?


    I was thinking of loading the file into an array, iterating over it to
    find the index values for each category, then parsing the data between
    each category, using the array of indexes I previously created.
    However, your suggestion to use a HoHoA and code sample, proved to be
    exactly what I needed.

    >
    > > but I was wondering if there were any suggestions for a
    > > more efficient way...

    >
    > Efficient - in what sense?


    I probably should have said effective ;)

    >
    > To me, the described data structure would suggest a HoHoA (hash of
    > hashes of arrays):
    >
    > use Data::Dumper;
    >
    > my (%HoHoA, $cat);
    > while ( <DATA> ) {
    > chomp;
    > if ( substr($_, 0, 1) eq '#' ) {
    > $cat = substr $_, 1;
    > next;
    > }
    > for my $item ( 0, 1 ) {
    > chomp( $HoHoA{$cat}{$_}[$item] = <DATA> );
    > }}
    >
    > print Dumper \%HoHoA;
    >
    > __DATA__
    > #category1
    > item1
    > data1
    > data2
    > item2
    > data1
    > data2
    > #category2
    > item1
    > data1
    > data2
    >
    > --
    > Gunnar Hjalmarsson
    > Email:http://www.gunnar.cc/cgi-bin/contact.pl


    Thanks for the code sample, it worked great! I didn't realize
    referencing <DATA> in the while block would "increment" the record of
    the data file.

    -Eric
    Eric Martin, Jan 13, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Arash Nikkar
    Replies:
    8
    Views:
    566
    Arash Nikkar
    Nov 27, 2006
  2. Ram  Prasad

    efficient way of looking up huge hashes

    Ram Prasad, May 25, 2007, in forum: C Programming
    Replies:
    8
    Views:
    293
    Tor Rustad
    May 30, 2007
  3. py_genetic
    Replies:
    6
    Views:
    311
    py_genetic
    Jun 19, 2007
  4. Chris Rebert
    Replies:
    1
    Views:
    518
    MrJean1
    Jan 10, 2009
  5. martin
    Replies:
    9
    Views:
    170
    Xicheng Jia
    Apr 15, 2006
Loading...

Share This Page