looking for efficient way to parse a file

E

Eric Martin

Hello,

I have a file with the following data structure:
#category
item name
data1
data2
item name
data1
data2
#category
item name
data1
data2
.... etc.

Any line that starts with #, indicates a new category. Between
categories, there can be any number of items, with associated data.
Each item has exactly two data properties.

My plan was to just get an array that contained the index of each of
the categories and then parse each item from there, since they are in
a set format...but I was wondering if there were any suggestions for a
more efficient way...
 
G

Gunnar Hjalmarsson

Eric said:
I have a file with the following data structure:
#category
item name
data1
data2
item name
data1
data2
#category
item name
data1
data2
... etc.

Any line that starts with #, indicates a new category. Between
categories, there can be any number of items, with associated data.
Each item has exactly two data properties.

My plan was to just get an array that contained the index of each of
the categories and then parse each item from there, since they are in
a set format...

Not sure what you mean by that. Could you please expand?
but I was wondering if there were any suggestions for a
more efficient way...

Efficient - in what sense?

To me, the described data structure would suggest a HoHoA (hash of
hashes of arrays):

use Data::Dumper;

my (%HoHoA, $cat);
while ( <DATA> ) {
chomp;
if ( substr($_, 0, 1) eq '#' ) {
$cat = substr $_, 1;
next;
}
for my $item ( 0, 1 ) {
chomp( $HoHoA{$cat}{$_}[$item] = <DATA> );
}
}
print Dumper \%HoHoA;

__DATA__
#category1
item1
data1
data2
item2
data1
data2
#category2
item1
data1
data2
 
X

xhoster

Eric Martin said:
Hello,

I have a file with the following data structure:
#category
item name
data1
data2
item name
data1
data2
#category
item name
data1
data2
... etc.

Any line that starts with #, indicates a new category. Between
categories, there can be any number of items, with associated data.
Each item has exactly two data properties.

My plan was to just get an array that contained the index of each of
the categories

That suggests the categories are already in an array, or else what is the
index the index to? I'd probably not bother to load them into an array
in the first place, just parse it on the fly. Maybe not, depending on
where it was coming from and how big I expected it to plausibly get.
and then parse each item from there, since they are in
a set format...but I was wondering if there were any suggestions for a
more efficient way...

Efficient in what sense? Memory? CPU time? Programmer maintenance time?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

Eric Martin said:
I have a file with the following data structure:
#category
item name
data1
data2
item name
data1
data2
#category
item name
data1
data2
... etc.

Any line that starts with #, indicates a new category. Between
categories, there can be any number of items, with associated data.
Each item has exactly two data properties.

That suggests to me a Hash(category) of Hash(item name) of Array (two data
elements)
My plan was to just get an array that contained the index of each of
the categories and then parse each item from there, since they are in

What's an index of a category?
a set format...but I was wondering if there were any suggestions for a
more efficient way...

Reading the file line by line in a linear manner is about as efficient as
you can possibly get because you need to read each item at least once and
you don't read it more than once, either. The suggested data structure would
support a linear reading, too.

jue
 
E

Eric Martin

Not sure what you mean by that. Could you please expand?

I was thinking of loading the file into an array, iterating over it to
find the index values for each category, then parsing the data between
each category, using the array of indexes I previously created.
However, your suggestion to use a HoHoA and code sample, proved to be
exactly what I needed.
Efficient - in what sense?

I probably should have said effective ;)
To me, the described data structure would suggest a HoHoA (hash of
hashes of arrays):

use Data::Dumper;

my (%HoHoA, $cat);
while ( <DATA> ) {
chomp;
if ( substr($_, 0, 1) eq '#' ) {
$cat = substr $_, 1;
next;
}
for my $item ( 0, 1 ) {
chomp( $HoHoA{$cat}{$_}[$item] = <DATA> );
}}

print Dumper \%HoHoA;

__DATA__
#category1
item1
data1
data2
item2
data1
data2
#category2
item1
data1
data2

Thanks for the code sample, it worked great! I didn't realize
referencing <DATA> in the while block would "increment" the record of
the data file.

-Eric
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,085
Latest member
cryptooseoagencies

Latest Threads

Top