I don't see how any module could reduce the memory usage of a standard
sort...well, except for using an in-place sort, as you said. Perl's sort is
like a pipe, so it can't be in-place.
If it's overflowing RAM, either switch to a database approach or get more
RAM (or increase virtual memory settings).
Try, for example:
open FH, "|sort"; # may fail on win32, or wherever sort is not available
#or worse yet call a horribly inefficient win32 sort command.
#but generally GNU sort is faster than perl sort for large datasets
#and I've compiled it to Win32 about 7 years ago; it wasn't too hard
DB_File
DBD:

BM
My Cygwin Perl 5.8.2 seems stable at around 5MB memory usage with a 10MB
file called c:/upc_cache.dat while running this script, during either
building the cache DB or reading it.
# tie.pl
use DB_File;
$filename = "c:/upc_cache.db";
$existed = -e $filename;
$flags=O_CREAT|O_RDWR;
$mode=0666;
$X = tie %hash, 'DB_File', $filename, $flags, $mode, $DB_BTREE
|| die $@;
unless ($existed) {
print "Reading cache file, building DB\n";
open IN,"<c:/upc_cache.dat";
while (defined ($_=<IN>))
{
chomp;
$hash{$_} = 1; # $status = $X->put($_,1) ;
}
close IN;
}
@upcs = qw(333 982000154986 982000154985 982000154985 badupc);
for (@upcs) {
print "\$hash{$_} = $hash{$_} \n";
}
# print the contents of the file
while (($k, $v) = each %hash)
{ print "$k\n" if rand() < 0.00001}
undef $X;
untie %hash;
__END__
Example output:
$ rm c:/upc_cache.db ; time perl tie.pl
Reading cache file, building DB
real 0m19.209s
user 0m16.243s
sys 0m1.031s
$ time perl tie.pl
$hash{333} =
$hash{982000154986} = 1
$hash{982000154985} =
$hash{982000154985} =
$hash{badupc} =
011170032586
08921833227
77796696762
80783900017
real 0m28.982s
user 0m27.319s
sys 0m0.690s
$ wc c:/upc_cache.dat
800506 800507 10029081 c:/upc_cache.dat
Okay, so for 800,000 rows, it takes half a minute... not too fast (rand() is
less than 0.6 secs for 800,000 iterations...not a large portion of the time)
You might find GNU sort to be much faster:
$ time sort c:/upc_cache.dat -o c:/upc_cache.out
real 0m1.913s
user 0m1.261s
sys 0m0.360s
True, it's already sorted...so maybe this is better:
$ time perl -le 'for (1..800000){print rand()}' > c:/randjunk
real 0m23.897s
user 0m22.993s
sys 0m0.690s
$ time sort c:/randjunk -o c:/randjunk.srt
real 0m3.955s
user 0m3.304s
sys 0m0.500s
4 seconds versus 20, for 800k records, certainly faster.
But if anyone knonws of a faster sort for Perl, or an in-place one, it would
be interesting to know about.
Hey, an MSN search for Sort::InPlace found this:
http://search.cpan.org/~hrafnkell/Algorithm-SISort-0.14/SISort.pm
....
Sort returns a sorted copy if the array, but Sort_inplace sorts the array in
place (as the name suggests) and returns the number of comparisons done.
(Note that the sorting is always done in place, Sort just copies the array
before calling the internal sort routine.)
....
Maybe you can try that, if order O(n**1.5) speed is adequate.
I'm sorting a simple (but very large) array of arrays. The machine
it's running on (Win2K, ActiveState 5.6.1) has limited memory
resources, and when I call the sort, I get "Out of memory!". The same
script runs fine on a pc with enough RAM. I've increased the virtual
memory setting, but I think that's for code only (??).
Is there a minimal memory footprint sort module? I can't find any
likely candidates with PPM. I haven't coded a sort routine since ....
well, let's say it's been a while.
Related bonus round question(s): Could the out-of-memory problem
be caused when 'sort()' completes and tries to pass an array back?
Will 2 arrays (my original and the returned) exist for a brief period?
Would an in-place sort solve all my problems, plus stop that annoying
hair loss?
TIA.