lightweight access to large data structures?

I

ivowel

dear perl experts---

I have a 300MB .csv data file (60MB compressed) that I need to read,
but not write to:

key-part1,key-part2,data1,data2,data3
ibm,2003/01,0.2,0.3,0.4
ibm,1972/01,0.5,0.3,NaN
sunw,2003/01,0.3,NaN,0.1
....

the key-part1+key-part2 combination is unique, but neither key alone
is unique.

my first idea to use this data in perl was a bit naive: create a hash
of hashes, so that I can find data or iterate over all data items that
match only one of the two keys. Something like $data1->{ibm}-
{192601} and $data1->{192601}->{ibm}. great idea indeed, except
that after it gobbled up about 4GB of RAM, my perl program died. this
would have been nice.

I can think of a couple of methods that I could use. I could read the
data with a C program, and then have perl query my C program (e.g.,
through a socket). yikes. I could copy (yikes) the data into a data
base and access it through a data base modules, though I am not sure
what data base I should use for this purpose. (I need not one-key
access, but two key multiple-record access.) or I could do the
combination, and put the data into an SQL data base and learn SQL just
so that I can quickly access my data file. yikes and yikes. maybe
perl6 could do better, but perl6 isn't around yet. is there a way to
code so that perl5 becomes more memory efficient?

This can't be an obscure problem. What is the recommended light-
weight way of dealing with such large-data situations in perl5 ?
advice appreciated...

sincerely,

/iaw
 
U

usenet

I have a 300MB .csv data file...
it gobbled up about 4GB of RAM

There's no reason why 300 MB of input data should consume 4 GB when
imported into a Perl data structure. I have a feeling there is an
infinite loop or gross inefficiency or some other problem in your
program.

Consider using one of the many CSV parsers available on http://search.cpan.org.

Or show us your file reading and parsing code; maybe we can spot the
problem.
 
T

Tad McClellan

This can't be an obscure problem.


Perhaps it is a Question that is Asked Frequently...

perldoc -q memory

How can I make my Perl program take less memory?
What is the recommended light-
weight way of dealing with such large-data situations in perl5 ?


tie()
 
X

xhoster

There's no reason why 300 MB of input data should consume 4 GB when
imported into a Perl data structure.

Sure there is. Perl has all kinds of memory overhead. Heck, I'm surprised
it doesn't take more, given that in his example, each chunk of data is only
a few bytes. For each chunk, you have to have a full scalar struct
(about 20 bytes), plus you need the string storage (starts at about 12
bytes, even if the string is only one character long). Then he has many,
many hash structures, each with high overhead.



Xho
 
X

xhoster

dear perl experts---

I have a 300MB .csv data file (60MB compressed) that I need to read,
but not write to:

key-part1,key-part2,data1,data2,data3
ibm,2003/01,0.2,0.3,0.4
ibm,1972/01,0.5,0.3,NaN
sunw,2003/01,0.3,NaN,0.1
...

the key-part1+key-part2 combination is unique, but neither key alone
is unique.

What is the cardinality of the respective parts to the key? I.e. are there
only two possible value for key-part1, sunw and ibm, and all the rest of
the diversity comes from key-part2?
my first idea to use this data in perl was a bit naive: create a hash
of hashes, so that I can find data or iterate over all data items that
match only one of the two keys.

Do you actually need to be able to do that quickly in both directions?
What are the exact operations that have to be supported quickly?
Something like $data1->{ibm}->{192601} and $data1->{192601}->{ibm}.

What would the value in $data1->{ibm}->{192601} be? An arrayref of
[data1,data2,data3]?

Perhaps you could just tie your hash to something like DBM::Deep. I'm
quite fond of that module, especially when I need pure Perl only solutions.
great idea indeed, except
that after it gobbled up about 4GB of RAM, my perl program died. this
would have been nice.

I can think of a couple of methods that I could use. I could read the
data with a C program, and then have perl query my C program (e.g.,
through a socket). yikes.

Yikes indeed. It is hard for me to dream up of a situation that would
induce me to do this.
I could copy (yikes)

How about just moving it into a database, instead of copying it?
the data into a data
base and access it through a data base modules, though I am not sure
what data base I should use for this purpose.

This seems like a good idea. I'm partial to mysql myself for such simple
projects.

But before that, how do you arrive at the queries that you will submit to
the would-be database in the first place? If they are not interactive, but
rather you have a fixed set of queries to process, you can usually come up
with better text based methods. For example, instead of storing the data
in a hash/database and then reading the queries and applying them to the
data hash, you can store the queries in the hash, and read through the data
a line at a time figuring out what query it pertains to.

Or, you can often just arrange things such that just sorting the data file
in a particular way will accomplish most of the work you need to do.
(I need not one-key
access, but two key multiple-record access.) or I could do the
combination, and put the data into an SQL data base and learn SQL just
so that I can quickly access my data file. yikes and yikes. maybe
perl6 could do better, but perl6 isn't around yet. is there a way to
code so that perl5 becomes more memory efficient?

Yes, but they will be painful and inflexible, and you would have to give us
far more information about exactly what it is you are going to be doing.
For example, you could keep the last 3 columns for each record as one
string of text, and reparse that string each time you need to access the
record. A string takes up much less space than a three-element array which
contains the split up contents of that string.
This can't be an obscure problem.

But it also isn't a general problem. There are a billion ways you can have
way too much data to fit into memory, and a billion things you can want to
do with it. There isn't one solution that fits all of those situations.

Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

Sure there is. Perl has all kinds of memory overhead. Heck, I'm surprised
it doesn't take more, given that in his example, each chunk of data is only
a few bytes.
Correct.

For each chunk, you have to have a full scalar struct
(about 20 bytes), plus you need the string storage (starts at about 12
bytes, even if the string is only one character long).

Wrong. With Perl's malloc(), the minimal allocated buffer is 4bytes.
With system's malloc() - is, obviously, system-dependent.
Then he has many, many hash structures, each with high overhead.

Correct.

Hope this helps,
Ilya
 
X

xhoster

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to

Sure there is. Perl has all kinds of memory overhead. Heck, I'm
surprised it doesn't take more, given that in his example, each chunk
of data is only a few bytes.
Correct.

For each chunk, you have to have a full scalar struct
(about 20 bytes), plus you need the string storage (starts at about 12
bytes, even if the string is only one character long).

Wrong. With Perl's malloc(), the minimal allocated buffer is 4bytes.

That may be the minimal amount of space that Perl is capable of allocating,
but that does mean it is not the minimal amount of space that Perl actually
does allocate in any given circumstance. Indeed, under my 64 bit version
of perl it seems that the minimum space allocated for string storage for a
small string is 24 bytes, not the 12 bytes it was last time I checked on a
32 bit system.

#!/usr/bin/perl
use strict;
use warnings;
my @x;
push @x, '' foreach 0..1_000_000;
print "start\t", +(`ps -p $$ -o rss `)[1];
foreach my $size (1..1000) {
$_ .= 'x' foreach (@x) ;
print "$size\t", +(`ps -p $$ -o rss `)[1];
}
__END__

Notice below how no meaningful additional space is needed until the strings
reach 24 bytes long.

$ perl scalar_size2.pl
start 82432
1 82436
2 82436
3 82436
4 82436
5 82436
6 82436
7 82436
8 82436
9 82436
10 82436
11 82436
12 82436
13 82436
14 82436
15 82436
16 82436
17 82436
18 82436
19 82436
20 82436
21 82436
22 82436
23 82436
24 98280
25 98280
26 98280
27 98280
28 98280
29 98280
30 98280
31 98280
32 98280
33 98280
34 98280
35 98280
36 98280
37 98280
38 98280
39 98280
40 113988
41 113988
42 113988
43 113988
44 113988
45 113988

Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

That may be the minimal amount of space that Perl is capable of allocating,
but that does mean it is not the minimal amount of space that Perl actually
does allocate in any given circumstance.

This does not even parse.
Indeed, under my 64 bit version
of perl it seems that the minimum space allocated for string storage for a
small string is 24 bytes, not the 12 bytes it was last time I checked on a

As I explain, there is no "indeed". Unless you specify that your
usemymalloc=y.

Hope this helps,
Ilya
 
X

xhoster

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to

That may be the minimal amount of space that Perl is capable of
allocating, but that does mean it is not the minimal amount of space
that Perl actually does allocate in any given circumstance.

This does not even parse.

Misplaced "not".

That may be the minimal amount of space that Perl is capable of
allocating, but that does not mean it is the minimal amount of space
that Perl actually does allocate in any given circumstance.

Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Misplaced "not".

That may be the minimal amount of space that Perl is capable of
allocating, but that does not mean it is the minimal amount of space
that Perl actually does allocate in any given circumstance.

Now it parses. And is BS.

(As I said) there is one circumstance where one can know *exactly*
what happens. Perl's malloc() uses EXACTLY 4-byte (or 8-bytes, on
64-bit compiles) areas for small allocations (up to 4/8 bytes). Well,
there is some overhead, about 1byte per allocation; so about 400 small
allocations will use a 2K-arena.

Hope this helps,
Ilya
 
I

ivowel

thank you for all the advice. I solved my problem in a pragmatic but
not convenient way. I now have only one hash that combines both keys
into one key, and which points to an index into one long string which
contains all the data. This seems rather memory efficient, even
though it is not convenient and not in the spirit of my problem.
better than an external data base, external C, or other kludges,
however...

/iaw
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top