lightweight access to large data structures?

ivowel · Jun 18, 2007

dear perl experts---

I have a 300MB .csv data file (60MB compressed) that I need to read,
but not write to:

key-part1,key-part2,data1,data2,data3
ibm,2003/01,0.2,0.3,0.4
ibm,1972/01,0.5,0.3,NaN
sunw,2003/01,0.3,NaN,0.1
....

the key-part1+key-part2 combination is unique, but neither key alone
is unique.

my first idea to use this data in perl was a bit naive: create a hash
of hashes, so that I can find data or iterate over all data items that
match only one of the two keys. Something like $data1->{ibm}-

{192601} and $data1->{192601}->{ibm}. great idea indeed, except

that after it gobbled up about 4GB of RAM, my perl program died. this
would have been nice.

I can think of a couple of methods that I could use. I could read the
data with a C program, and then have perl query my C program (e.g.,
through a socket). yikes. I could copy (yikes) the data into a data
base and access it through a data base modules, though I am not sure
what data base I should use for this purpose. (I need not one-key
access, but two key multiple-record access.) or I could do the
combination, and put the data into an SQL data base and learn SQL just
so that I can quickly access my data file. yikes and yikes. maybe
perl6 could do better, but perl6 isn't around yet. is there a way to
code so that perl5 becomes more memory efficient?

This can't be an obscure problem. What is the recommended light-
weight way of dealing with such large-data situations in perl5 ?
advice appreciated...

sincerely,

/iaw

usenet · Jun 18, 2007

I have a 300MB .csv data file...
it gobbled up about 4GB of RAM

There's no reason why 300 MB of input data should consume 4 GB when
imported into a Perl data structure. I have a feeling there is an
infinite loop or gross inefficiency or some other problem in your
program.

Consider using one of the many CSV parsers available on http://search.cpan.org.

Or show us your file reading and parsing code; maybe we can spot the
problem.

Tad McClellan · Jun 18, 2007

This can't be an obscure problem.

Perhaps it is a Question that is Asked Frequently...

perldoc -q memory

How can I make my Perl program take less memory?

What is the recommended light-
weight way of dealing with such large-data situations in perl5 ?

tie()

xhoster · Jun 18, 2007

There's no reason why 300 MB of input data should consume 4 GB when
imported into a Perl data structure.

Sure there is. Perl has all kinds of memory overhead. Heck, I'm surprised
it doesn't take more, given that in his example, each chunk of data is only
a few bytes. For each chunk, you have to have a full scalar struct
(about 20 bytes), plus you need the string storage (starts at about 12
bytes, even if the string is only one character long). Then he has many,
many hash structures, each with high overhead.

Xho

xhoster · Jun 19, 2007

dear perl experts---

I have a 300MB .csv data file (60MB compressed) that I need to read,
but not write to:

key-part1,key-part2,data1,data2,data3
ibm,2003/01,0.2,0.3,0.4
ibm,1972/01,0.5,0.3,NaN
sunw,2003/01,0.3,NaN,0.1
...

the key-part1+key-part2 combination is unique, but neither key alone
is unique.

What is the cardinality of the respective parts to the key? I.e. are there
only two possible value for key-part1, sunw and ibm, and all the rest of
the diversity comes from key-part2?

my first idea to use this data in perl was a bit naive: create a hash
of hashes, so that I can find data or iterate over all data items that
match only one of the two keys.

Do you actually need to be able to do that quickly in both directions?
What are the exact operations that have to be supported quickly?

Something like $data1->{ibm}->{192601} and $data1->{192601}->{ibm}.

What would the value in $data1->{ibm}->{192601} be? An arrayref of
[data1,data2,data3]?

Perhaps you could just tie your hash to something like DBM:

eep. I'm
quite fond of that module, especially when I need pure Perl only solutions.

great idea indeed, except
that after it gobbled up about 4GB of RAM, my perl program died. this
would have been nice.

I can think of a couple of methods that I could use. I could read the
data with a C program, and then have perl query my C program (e.g.,
through a socket). yikes.

Yikes indeed. It is hard for me to dream up of a situation that would
induce me to do this.

I could copy (yikes)

How about just moving it into a database, instead of copying it?

the data into a data
base and access it through a data base modules, though I am not sure
what data base I should use for this purpose.

This seems like a good idea. I'm partial to mysql myself for such simple
projects.

But before that, how do you arrive at the queries that you will submit to
the would-be database in the first place? If they are not interactive, but
rather you have a fixed set of queries to process, you can usually come up
with better text based methods. For example, instead of storing the data
in a hash/database and then reading the queries and applying them to the
data hash, you can store the queries in the hash, and read through the data
a line at a time figuring out what query it pertains to.

Or, you can often just arrange things such that just sorting the data file
in a particular way will accomplish most of the work you need to do.

(I need not one-key
access, but two key multiple-record access.) or I could do the
combination, and put the data into an SQL data base and learn SQL just
so that I can quickly access my data file. yikes and yikes. maybe
perl6 could do better, but perl6 isn't around yet. is there a way to
code so that perl5 becomes more memory efficient?

Yes, but they will be painful and inflexible, and you would have to give us
far more information about exactly what it is you are going to be doing.
For example, you could keep the last 3 columns for each record as one
string of text, and reparse that string each time you need to access the
record. A string takes up much less space than a three-element array which
contains the split up contents of that string.

This can't be an obscure problem.

But it also isn't a general problem. There are a billion ways you can have
way too much data to fit into memory, and a billion things you can want to
do with it. There isn't one solution that fits all of those situations.

Xho

Ilya Zakharevich · Jun 19, 2007

[A complimentary Cc of this posting was sent to

Sure there is. Perl has all kinds of memory overhead. Heck, I'm surprised
it doesn't take more, given that in his example, each chunk of data is only
a few bytes.
Correct.

For each chunk, you have to have a full scalar struct
(about 20 bytes), plus you need the string storage (starts at about 12
bytes, even if the string is only one character long).

Wrong. With Perl's malloc(), the minimal allocated buffer is 4bytes.
With system's malloc() - is, obviously, system-dependent.

Then he has many, many hash structures, each with high overhead.

Correct.

Hope this helps,
Ilya

xhoster · Jun 19, 2007

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to

Sure there is. Perl has all kinds of memory overhead. Heck, I'm
surprised it doesn't take more, given that in his example, each chunk
of data is only a few bytes.
Correct.

For each chunk, you have to have a full scalar struct
(about 20 bytes), plus you need the string storage (starts at about 12
bytes, even if the string is only one character long).

Click to expand...

Wrong. With Perl's malloc(), the minimal allocated buffer is 4bytes.

That may be the minimal amount of space that Perl is capable of allocating,
but that does mean it is not the minimal amount of space that Perl actually
does allocate in any given circumstance. Indeed, under my 64 bit version
of perl it seems that the minimum space allocated for string storage for a
small string is 24 bytes, not the 12 bytes it was last time I checked on a
32 bit system.

#!/usr/bin/perl
use strict;
use warnings;
my @x;
push @x, '' foreach 0..1_000_000;
print "start\t", +(`ps -p $$ -o rss `)[1];
foreach my $size (1..1000) {
$_ .= 'x' foreach (@x) ;
print "$size\t", +(`ps -p $$ -o rss `)[1];
}
__END__

Notice below how no meaningful additional space is needed until the strings
reach 24 bytes long.

$ perl scalar_size2.pl
start 82432
1 82436
2 82436
3 82436
4 82436
5 82436
6 82436
7 82436
8 82436
9 82436
10 82436
11 82436
12 82436
13 82436
14 82436
15 82436
16 82436
17 82436
18 82436
19 82436
20 82436
21 82436
22 82436
23 82436
24 98280
25 98280
26 98280
27 98280
28 98280
29 98280
30 98280
31 98280
32 98280
33 98280
34 98280
35 98280
36 98280
37 98280
38 98280
39 98280
40 113988
41 113988
42 113988
43 113988
44 113988
45 113988

Xho

Ilya Zakharevich · Jun 19, 2007

[A complimentary Cc of this posting was sent to

That may be the minimal amount of space that Perl is capable of allocating,
but that does mean it is not the minimal amount of space that Perl actually
does allocate in any given circumstance.

This does not even parse.

Indeed, under my 64 bit version
of perl it seems that the minimum space allocated for string storage for a
small string is 24 bytes, not the 12 bytes it was last time I checked on a

As I explain, there is no "indeed". Unless you specify that your
usemymalloc=y.

Hope this helps,
Ilya

xhoster · Jun 19, 2007

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to

That may be the minimal amount of space that Perl is capable of
allocating, but that does mean it is not the minimal amount of space
that Perl actually does allocate in any given circumstance.

Click to expand...

This does not even parse.

Misplaced "not".

That may be the minimal amount of space that Perl is capable of
allocating, but that does not mean it is the minimal amount of space
that Perl actually does allocate in any given circumstance.

Xho

Ilya Zakharevich · Jun 20, 2007

[A complimentary Cc of this posting was sent to

Misplaced "not".

That may be the minimal amount of space that Perl is capable of
allocating, but that does not mean it is the minimal amount of space
that Perl actually does allocate in any given circumstance.

Now it parses. And is BS.

(As I said) there is one circumstance where one can know *exactly*
what happens. Perl's malloc() uses EXACTLY 4-byte (or 8-bytes, on
64-bit compiles) areas for small allocations (up to 4/8 bytes). Well,
there is some overhead, about 1byte per allocation; so about 400 small
allocations will use a 2K-arena.

Hope this helps,
Ilya

ivowel · Jun 21, 2007

thank you for all the advice. I solved my problem in a pragmatic but
not convenient way. I now have only one hash that combines both keys
into one key, and which points to an index into one long string which
contains all the data. This seems rather memory efficient, even
though it is not convenient and not in the spirit of my problem.
better than an external data base, external C, or other kludges,
however...

/iaw

comparing large number of data	6	Jan 7, 2007
Web programming: issues with large amounts og data	9	Dec 3, 2008
Confused by hashes/data structures	7	Feb 16, 2004
Try to open the Data Access Page	1	Apr 28, 2010
A Practical Introduction to Data Structures and Algorithm Analysis2Ed by Shaffer	0	Feb 4, 2010
Simple interface to minidom for creating XML files	0	Sep 27, 2010
"archive" data formats	1	Feb 24, 2012
Questions on Using Python to Teach Data Structures and Algorithms	14	Sep 28, 2006

lightweight access to large data structures?

ivowel

usenet

Tad McClellan

xhoster

xhoster

Ilya Zakharevich

xhoster

Ilya Zakharevich

xhoster

Ilya Zakharevich

ivowel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads