combinations script - out of memory - optimization?

A

Andrew

Hi, folks

a while ago i obtained a script that computes all unique subsets of a
set of N elements, which I am trying to apply to a new task.

The script works fine when N<14, but past that value my server gives up
with an "Out of memory!" message over CGI (server has 2GB RAM).

This is a one-time computation I need to do (not a regularly accessed
CGI script), so I can tolerate long waits, although anything beyond 30
minutes is worrisome (I may need to do last-minute update before a
deadline). A little about the application: I need to find out the most
balanced distribution (split) of a group of clusters, into two groups
of clusters (clusters themselves cannot be broken up). Clusters have
varying sizes, and it is the "size" quantity which I aggregate over the
two groups. The closer the two aggregate size quantities, the more
balanced the distribution. (There is a second-level sort to "break a
tie" but that's not essential to this post)


I realize that the underlying algorithm here is not Perl-specific, but
I am expressing it in Perl, and the "out of memory" phenomenon probably
has something to do with the Perl's memory management and/or with the
options of memory management available to script authors.

My current need is to compute combinations from a set of 32 elements.
At least process 32C15, 32C16, or thereabouts. Sadly, 32C16=601080390
.. The script below gives up at 32C6 (after processing only 242824
combinations). Is there nothing to be done besides seeking better
hardware? I'm afraid, though, that, given the exponential escalation of
memory usage, even some supercomputing, processor-clustering behemoth
at MIT couldn't handle this...(although that sounds unlikely).

Is there any way to optimize the code below, (clearing memory at
runtime, etc.)? Some completely different code? Could I discard
unwanted combinations (combinations with blatantly below-expected
balance) even while the recursion is in progress? (Can't quite figure
out how).

Here is the code:


sub combos {

my( $levels_left, $selection, @items ) = @_;
return () if $levels_left > @items;
return $selection unless $levels_left;
my @sets;

while( @items ) {
my $new_selection = $selection . ':'. shift @items;
push @sets,
combos( $levels_left-1, $new_selection, @items );
}

return @sets;
}


@combinations=combos( 16, '', (0..31));

-----------
returns an array of strings, each string containing colon-joined
members of array (0..31), a unique combination.

TIA

andrew
 
I

it_says_BALLS_on_your forehead

Andrew said:
Hi, folks

a while ago i obtained a script that computes all unique subsets of a
set of N elements, which I am trying to apply to a new task.

The script works fine when N<14, but past that value my server gives up
with an "Out of memory!" message over CGI (server has 2GB RAM).

--8 said:
Is there any way to optimize the code below, (clearing memory at
runtime, etc.)? Some completely different code? Could I discard
unwanted combinations (combinations with blatantly below-expected
balance) even while the recursion is in progress? (Can't quite figure
out how).

yes, in cases where memory and/or performance become an issue, favor an
iterative solution over a recursive one. i would be more helpful, but
need to wrap my brain around this one when i have time. for now, i hope
this points you in the right direction.
 
X

xhoster

Andrew said:
Hi, folks

a while ago i obtained a script that computes all unique subsets of a
set of N elements, which I am trying to apply to a new task.

The script works fine when N<14, but past that value my server gives up
with an "Out of memory!" message over CGI (server has 2GB RAM).
....

My current need is to compute combinations from a set of 32 elements.
At least process 32C15, 32C16, or thereabouts. Sadly, 32C16=601080390
. The script below gives up at 32C6 (after processing only 242824
combinations). Is there nothing to be done besides seeking better
hardware?
....

You need to iterate over the list rather than building the whole thing
in memory at once. Try Math::Combinatorics with the next_combination
iterator.

Xho
 
D

Dr.Ruud

Andrew schreef:
a while ago i obtained a script that computes all unique subsets of a
set of N elements, which I am trying to apply to a new task.

Take a buffer of size N, and an alfabet of size (N+1).

buffer : ' ' x 32
alfabet: [ 0-9A-V] (for example, leading space means empty)

My current need is to compute combinations from a set of 32 elements.
At least process 32C15, 32C16, or thereabouts.

OK, shrink the buffer to M (M=15 or 16). Loop while 'increasing' the
code in the buffer:

' '
' 0'
' 1'
:
' U'
' V'
' 00'
' 01'
:

' 0U'
' 0V'
' 000'
:
'VVVVVVVVVVVVVVVV'

If the same should not appear more than once in the combination, skip
that value in the loop.

next if /(.).*\1/;

Write interesting buffer values (or just its rank) to a file. Such code
should use hardly any memory.
 
A

Andrew

Dr.Ruud said:
Andrew schreef:
a while ago i obtained a script that computes all unique subsets of a
set of N elements, which I am trying to apply to a new task.

Take a buffer of size N, and an alfabet of size (N+1).

buffer : ' ' x 32
alfabet: [ 0-9A-V] (for example, leading space means empty)

My current need is to compute combinations from a set of 32 elements.
At least process 32C15, 32C16, or thereabouts.

OK, shrink the buffer to M (M=15 or 16). Loop while 'increasing' the
code in the buffer:

' '
' 0'
' 1'
:
' U'
' V'
' 00'
' 01'
:

' 0U'
' 0V'
' 000'
:
'VVVVVVVVVVVVVVVV'

If the same should not appear more than once in the combination, skip
that value in the loop.

next if /(.).*\1/;

Write interesting buffer values (or just its rank) to a file. Such code
should use hardly any memory.

To xhoster, regarding Math::Combinatorics and next_combination() :

it works, thanks. It's not fast but, at least, no memory failures.

On my athlon 3200, 1GB RAM it produces 32C16 combinations at the rate
of 1 million per 3 minutes - which means I'd have to run it for about
30 hours to produce all the 601 million combinations.

(I dread having to process an extra cluster or two at the last minute
:)

To Dr. Ruud, regarding buffer and alphabet:

Your idea leads to interesting twists and contemplations. Something
vaguely reminiscent of defining one's own operators and value types.

I fiddled with that a little bit, but got stuck on my incrementation
algorithm. I deviated from your schema by simply keeping an array of
purely numeric place holders:
E.g.,
map { 0 } (0..31)

as the starting, "zero" complex value (buffer).

(As opposed to, and equivalent to [0-9a-v]. I thought mine would
"increment" easier)

Then tried something like:

$N=5; $M=3 # (for starters, 5C3 )
$highest_digit = $N-1; # a pseudodigit - these can actually be
double-digit values

foreach $i (0..($N**$M)) {
@buffer=increment(\@buffer); # (the equivalent of $x++ in
R1-space :)
}


sub increment {

my @buff=@{$_[0]};
my $carryover=1;

foreach $place (reverse(0..$M)) {
return @buff unless($carryover);

if($buff[$place]<$highest_digit) {
$buff[$place]++;
$carryover=0;
} else {
$buff[$place]=0;
$carryover=1;
}

}

}


For some reason, the above worked but only to three places. I'll return
to this when I have more time. I would need to devise an efficient
redundancy elimination, as well, for my own version.

thanks to all who replied.

andrew
 
D

Dr.Ruud

Andrew schreef:
On my athlon 3200, 1GB RAM it produces 32C16 combinations at the rate
of 1 million per 3 minutes - which means I'd have to run it for about
30 hours to produce all the 601 million combinations.

Are all those combinations interesting?

Suppose the alphabet is '0123', where '0' means empty.
Without order there are only 7 unique non-empty combinations:

'001' (same as 010 and 100)
'002' (same as 020 and 200)
'003' (same as 030 and 300)
'012' (same as 021 and 102 and 120 and 201 and 210)
'013' (same as 031 and 103 and 130 and 301 and 310)
'023' (same as 032 and 203 and 230 and 302 and 320)
'123' (same as 132 and 213 and 231 and 312 and 321)

The '0' is the only character that can occur more than once, and only in
the leading positions. In each combination, the characters other than 0
can occur only once, and are in ascending order.
 
R

robic0

Hi, folks

a while ago i obtained a script that computes all unique subsets of a
set of N elements, which I am trying to apply to a new task.

The script works fine when N<14, but past that value my server gives up
with an "Out of memory!" message over CGI (server has 2GB RAM).

This is a one-time computation I need to do (not a regularly accessed
CGI script), so I can tolerate long waits, although anything beyond 30
minutes is worrisome (I may need to do last-minute update before a
deadline). A little about the application: I need to find out the most
balanced distribution (split) of a group of clusters, into two groups
of clusters (clusters themselves cannot be broken up). Clusters have
varying sizes, and it is the "size" quantity which I aggregate over the
two groups. The closer the two aggregate size quantities, the more
balanced the distribution. (There is a second-level sort to "break a
tie" but that's not essential to this post)


I realize that the underlying algorithm here is not Perl-specific, but
I am expressing it in Perl, and the "out of memory" phenomenon probably
has something to do with the Perl's memory management and/or with the
options of memory management available to script authors.

My current need is to compute combinations from a set of 32 elements.
At least process 32C15, 32C16, or thereabouts. Sadly, 32C16=601080390
. The script below gives up at 32C6 (after processing only 242824
combinations). Is there nothing to be done besides seeking better
hardware? I'm afraid, though, that, given the exponential escalation of
memory usage, even some supercomputing, processor-clustering behemoth
at MIT couldn't handle this...(although that sounds unlikely).

Is there any way to optimize the code below, (clearing memory at
runtime, etc.)? Some completely different code? Could I discard
unwanted combinations (combinations with blatantly below-expected
balance) even while the recursion is in progress? (Can't quite figure
out how).

Here is the code:


sub combos {

my( $levels_left, $selection, @items ) = @_;
return () if $levels_left > @items;
return $selection unless $levels_left;
my @sets;

while( @items ) {
my $new_selection = $selection . ':'. shift @items;
push @sets,
combos( $levels_left-1, $new_selection, @items );
}

return @sets;
}


@combinations=combos( 16, '', (0..31));

-----------
returns an array of strings, each string containing colon-joined
members of array (0..31), a unique combination.

TIA

andrew

Sadly, you should consult a math book for the formula, and not Perl !
If you don't know math, you should start with algegra-1 and take it from
there. Or perhaps you would like to calculate the sum, with Perl when
i->infinity... Although there's numerical techniques to do that but
you have to take the course, or go to the library.

Here's an interesting problem for you, I pose, instead and one that requires
a Phd to solve. I did it in 15 minutes with a hp-51c calculator, pencil and
paper in 1983.

I have a circle. I want to divide it up into 3 equal areas. The division
lines are horizontal. How far do I place the lines away from the center?

I'll give you 6 months to tell me...
 
A

Andrew

Dr.Ruud said:
Andrew schreef:


Are all those combinations interesting?

No, certainly not - and I intend to set a very high "filtration" rate,
and accept only, say, half a percent or some such. But I still have to
iterate over /everything/ and subject everything to an
interesting-vs-uninteresting test, don't I?

When I reported the above rate of 1 mil/3min, that's just the rate at
which Math::Combinatorics produces the unique combinations. I haven't
even run the script with the accept/reject test; obviously, the latter
would slow things down even more.

Suppose the alphabet is '0123', where '0' means empty.
Without order there are only 7 unique non-empty combinations:

'001' (same as 010 and 100)
'002' (same as 020 and 200)
'003' (same as 030 and 300)
'012' (same as 021 and 102 and 120 and 201 and 210)
'013' (same as 031 and 103 and 130 and 301 and 310)
'023' (same as 032 and 203 and 230 and 302 and 320)
'123' (same as 132 and 213 and 231 and 312 and 321)

The '0' is the only character that can occur more than once, and only in
the leading positions. In each combination, the characters other than 0
can occur only once, and are in ascending order.

Yes, of course, I do understand that "012", for my purposes, is the
same as "102", since I am not looking for permutations but rather just
combinations.

I don't follow you on the leading zeros. When I defined my set of
pseudodigits (0..31), these correspond to the elements in my superset
(just as your alphabet ([ 0-9A-V]) does) I was treating the "0" as an
element of the set, like any other element (3, 10, 23, etc.)

In my specific case, 0..31 are indexes of an array (of arrays), and 0
is the first index of that array, so I treat it as all the rest.

thanks

andrew
 
J

Jürgen Exner

Andrew said:
No, certainly not - and I intend to set a very high "filtration" rate,
and accept only, say, half a percent or some such. But I still have
to iterate over /everything/ and subject everything to an
interesting-vs-uninteresting test, don't I?

Not necessarily. Maybe (and I really don't know if this applies to your
specific data) it is possible to transform your interesting/uninteresting
test into a generator which generates only interesting combinations.

As I said, this may or may not be possible for your specific problem. But if
possible it would obviously reduce the load by serval orders of magnitude.

jue
 
A

Andrew

Jürgen Exner said:
Not necessarily. Maybe (and I really don't know if this applies to your
specific data) it is possible to transform your interesting/uninteresting
test into a generator which generates only interesting combinations.

OK, a good general idea to be mindful of (every now and then, revisit
the code of a frequently used module, and remember that it's there for
you to tweak and customize if need be (by "customize" i am trying to
suggest "demodularize" :)).

Thus, i can't see how I can optimize a while loop such as:

while(@combo=$c->next_combination()) {
.....

unless I hack the Math::Combinatorics module (and "demodularize" it).
Again, just iterating the above (without anything within the loop,
other than a counter!) yields the rate I reported.


A couple of self-corrections:
1) i believe the conventional term is "member", not "element" (member
of a set)
2) I said Dr. Ruud's alphabet ( [ 0-9A-V] ) was equivalent to my
(0..31). True, except for his blank space.

thanks

andrew
 
X

xhoster

Andrew said:
OK, a good general idea to be mindful of (every now and then, revisit
the code of a frequently used module, and remember that it's there for
you to tweak and customize if need be (by "customize" i am trying to
suggest "demodularize" :)).

Thus, i can't see how I can optimize a while loop such as:

while(@combo=3D$c->next_combination()) {
.....

unless I hack the Math::Combinatorics module (and "demodularize" it).


Yep. Thems the breaks. The fastest method for any given problem is rarely
the most general or modular method. If you aren't willing to wait, then
you have to be willing to work (to customize for your specific task).

It sounds like your problem is some kind of bin-packing problem. There are
almost certainly much more efficient solutions to that then iterating
through all possible combinations.


That said, here is an (almost) drop in replacement for this use-case of
Math::Combinatorics which is quite a bit faster:

#my $m=Math::Combinatorics->new(count=>16,data=>[0..31]);
my $m=combi->new(16,[0..31]);
while (my @foo = $m->next_combination()) {
#...
}

package combi;
sub new {
my ($class,$c,$data)=@_;
bless [0,[0..$c-1],$data]; # done, index, data
};
sub next_combination {
my $self=shift;
return if $self->[0];
my @return = @{$self->[1]}; # memorize current
if ( ++$self->[1][-1] == @{$self->[2]} ) { #run off the end, carry
--$self->[1][-1];
my $x=$#{$self->[1]}-1;
# Find first place with run to grow
$x-- until $x < 0 or $self->[1][$x]+1 < $self->[1][$x+1] ;
if ($x<0) {
$self->[0]=1; #Done!
} else {
$self->[1][$x]++;
## reset all things after the carried one to be as
## compact as possible
foreach my $i ($x+1..$#{$self->[1]}) {
$self->[1][$i]=$self->[1][$i-1]+1;
};
};
};
return map $self->[2][$_], @return;
};

1;
 
D

Dr.Ruud

Andrew schreef:
Dr.Ruud:

No, certainly not - and I intend to set a very high "filtration" rate,
and accept only, say, half a percent or some such. But I still have
to iterate over /everything/ and subject everything to an
interesting-vs-uninteresting test, don't I?

Not necessarily, as Jürgen already said.

I do understand that "012", for my purposes, is the
same as "102", since I am not looking for permutations but rather just
combinations.

I don't follow you on the leading zeros.

Read '-' as nothing. '--1' is a lonely '1', nothing else. '-12' is a '1'
and a '2'.

'--1' (same as '-1-' and '1--')
'--2' (same as '-2-' and '2--')
'--3' (same as '-3-' and '3--')
'-12' (same as '-21' and '1-2' and '12-' and '2-1' and '21-')
'-13' (same as '-31' and '1-3' and '13-' and '3-1' and '31-')
'-23' (same as '-32' and '2-3' and '23-' and '3-2' and '32-')
'123' (same as '132' and '213' and '231' and '312' and '321')

Now you need an iterator from ([1-3], |3|) to ('--1', '--2', '--3',
'-12', '-13', '-23', '123').

When I defined my set of
pseudodigits (0..31), these correspond to the elements in my superset
(just as your alphabet ([ 0-9A-V]) does) I was treating the "0" as an
element of the set, like any other element (3, 10, 23, etc.)

OK, but can an element also be undef, meaning: not occupied? This is not
really interesting, because 3C3 with one empty spot is 3C2, but could
simplify the generator.


See code below, that has considerable overhead in sub-calls.
20C10 takes about 2 minutes here.

#!/usr/bin/perl
use strict; use warnings;
use integer;

die "usage: $0 <num> <width>" if 2 != @ARGV;

use constant
{ _sym => '-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
, R => +1 # Right
, L => -1 # Left
};
use constant N_sym => -1 + length _sym;
sub sym($) { substr _sym, $_[0], 1 }
sub ix_sym($) { index _sym, $_[0] }

my ($NUM, $WID) = @ARGV;
die "<num>=$NUM, should be (1 .. ". N_sym .")\a\t" if $NUM > N_sym;
die "<wid>=$WID, should be (1 .. $NUM)\a\t" if $WID > $NUM;

my $buf = sym(0) x $WID;
sub buf($) : lvalue { substr $buf, $_[0], 1 }

{ local ($,, $\) = ("\t", "\n");

my $r = 0; # row
my $b = 0; # position in buf
my $d = R; # direction
my $s = $NUM; # symbol-index

while (1)
{
while ($d == L) # backtracing
{
--$b;
$s = ix_sym( buf($b) ); # payload for non-recursivity
--$s;
$d = R if 0 <= $s;
}
last if 0 == $b and 0 == $s;

if ($b < $WID)
{
buf($b) = sym($s);
$b++;
if ($s > 0) { $s-- }; # special case for '-'
}
else
{
print ++$r.':', $buf;
$d = L;
}
}
}

I have a gut feeling that this can be optimized a lot, so maybe I will
come back on it. Try replacing strings by arrays/hashes, and get rid of
the index().
 
D

Dr.Ruud

Andrew schreef:
On my athlon 3200, 1GB RAM it produces 32C16 combinations at the rate
of 1 million per 3 minutes - which means I'd have to run it for about
30 hours to produce all the 601 million combinations.

This creates 616665 interesting combinations out of 20c10, in 32
seconds. The '0' means empty (no draw) again.

#!/usr/bin/perl
use strict; use warnings;
use integer;

die "usage: $0 <num> <width>" if 2 != @ARGV;

use constant N => 36; # length of alphabet

my ($Num, $Wid) = @ARGV;
die "<num>=$Num, should be (1 .. ". N .")\a\t" if $Num > N;
die "<wid>=$Wid, should be (1 .. $Num)\a\t" if $Wid > $Num;

my ($r, $b, $s, @buf) = ( 0, 0, $Num, map(0, 1 .. $Wid) );

while ($b != 0 or $s != 0)
{
if ($b < $Wid)
{
$buf[$b++] = $s--;
if ($s < 0) { $s = 0 }; # no draw
}
else
{
print ++$r . ":\t@buf\n"; # found one
do { $s = $buf[--$b] -1 } until $s >= 0;
}
}
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Andrew
On my athlon 3200, 1GB RAM it produces 32C16 combinations at the rate
of 1 million per 3 minutes - which means I'd have to run it for about
30 hours to produce all the 601 million combinations.

If I understood you question correctly, and you want to produce all
subsets of 16 element of 32-element set, then the following code
(untested for correctness, only for speed ;-) gives 500K/sec on
Athlon/850MHz.

Hope this helps,
Ilya

#!/usr/bin/perl -w
use strict;

my @by_1s;

sub count_1s ($) { # The stupidest algorithm, but the speed doesn't matter
my $n = shift;
die if $n >= 2**32 or $n < 0;
my $c = 0;
for my $bit ( 0..31 ) {
$c++ if $n & (1<<$bit)
}
return $c;
}

for my $n (0..(1<<16)-1) {
push @{$by_1s[count_1s $n]}, $n;
}

$| = 1; # Print '.' immediately
my $c = 0;
for my $in_first (0..16) {
my $in_second = 16 - $in_first;
for my $first (@{$by_1s[$in_first]}) {
for my $second (@{$by_1s[$in_second]}) {
my $out = ($first << 16) | $second;
print '.' unless ++$c & 0xfffff; # every 1M
}
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top