Writing huge Sets() to disk

  • Thread starter =?ISO-8859-2?Q?Martin_MOKREJ=A9?=
  • Start date
?

=?ISO-8859-2?Q?Martin_MOKREJ=A9?=

Hi,
I have sets.Set() objects having up to 20E20 items,
each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.

How can I write them efficiently to disk? To be more exact,
I have 20 sets. _set1 has 1E20 keys of size 1 character.

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
for aa1 in alphabet:
# l = [aa1]
#_set1.add(aa1)
for aa2 in alphabet:
# l.append(aa2)
#_set2.add(''.join(l))
[cut]

The reason I went for sets instead of lists is the speed,
availability of unique, common and other methods.
What would you propose as an elegant solution?
Actually, even those nested for loops take ages. :(
M.
 
P

Paul McGuire

Martin MOKREJ© said:
Hi,
I have sets.Set() objects having up to 20E20 items,
each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.

How can I write them efficiently to disk? To be more exact,
I have 20 sets. _set1 has 1E20 keys of size 1 character.

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',
'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
for aa1 in alphabet:
# l = [aa1]
#_set1.add(aa1)
for aa2 in alphabet:
# l.append(aa2)
#_set2.add(''.join(l))
[cut]

The reason I went for sets instead of lists is the speed,
availability of unique, common and other methods.
What would you propose as an elegant solution?
Actually, even those nested for loops take ages. :(
M.

_set1 only has 19 keys of size 1 character - 'A' is duplicated.

Assuming you replace 'A' with another character (say 'Z'), then here is what
you get:
_set1 - 20 elements (20**1)
_set2 - 400 elements (20**2)
_set3 - 8000 elements (20**3)
....
_set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26

If you have no duplicates in your alphabet, then you wont need to use sets,
every combination will be unique. In this case, just use a generator.

Here's a recursive generator approach that may save you a bunch of nested
editing (set maxDepth as high as you dare):

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
'Y', 'W', 'K', 'R', 'H', 'D', 'E')

maxDepth = 3

def genNextCombo(root=list(),depth=1):
for a in alphabet:
thisRoot = root + [a]
yield "".join( thisRoot )
if depth < maxDepth:
for c in genNextCombo(thisRoot, depth+1):
yield c

for c in genNextCombo():
print c # or write to file, or whatever



-- Paul
 
G

Guest

Paul said:
Hi,
I have sets.Set() objects having up to 20E20 items,
each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.

How can I write them efficiently to disk? To be more exact,
I have 20 sets. _set1 has 1E20 keys of size 1 character.

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',

'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
for aa1 in alphabet:
# l = [aa1]
#_set1.add(aa1)
for aa2 in alphabet:
# l.append(aa2)
#_set2.add(''.join(l))
[cut]

The reason I went for sets instead of lists is the speed,
availability of unique, common and other methods.
What would you propose as an elegant solution?
Actually, even those nested for loops take ages. :(
M.


_set1 only has 19 keys of size 1 character - 'A' is duplicated.

Ahh, you are right. That's a typo, yet I have to figure out which char
I have to use. But 'Z' will do for the tests anyway.
Assuming you replace 'A' with another character (say 'Z'), then here is what
you get:
_set1 - 20 elements (20**1)
_set2 - 400 elements (20**2)
_set3 - 8000 elements (20**3)
...
_set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26

If you have no duplicates in your alphabet, then you wont need to use sets,
every combination will be unique. In this case, just use a generator.

As I noted in later response, actually I will compare these ideal sets to some
real world examples. I have no expectation how many entries will be common
and how many unique. The only thing I know even in real world, I might be in
size ranges of 2E6 or perhaps 2E8?
Here's a recursive generator approach that may save you a bunch of nested
editing (set maxDepth as high as you dare):

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
'Y', 'W', 'K', 'R', 'H', 'D', 'E')

maxDepth = 3

def genNextCombo(root=list(),depth=1):
for a in alphabet:
thisRoot = root + [a]
yield "".join( thisRoot )
if depth < maxDepth:
for c in genNextCombo(thisRoot, depth+1):
yield c

for c in genNextCombo():
print c # or write to file, or whatever

Works nice, but http://www.python.org/doc/2.3.4/ref/yield.html
doesn't really tell what happens here. :( But is quite fast and
definitely saves those the stupid recursively hand-written for
loops.

Thanks Paul!
M.
 
J

John Lenton

you probably want to look into building set-like objects ontop of
tries, given the homogeneity of your language. You should see
imrpovements both in size and speed.
 
S

Simo Melenius

John Lenton said:
you probably want to look into building set-like objects ontop of
tries, given the homogeneity of your language. You should see
imrpovements both in size and speed.

Ternary search trees give _much_ better space-efficiency compared to
tries, at the expense of only slightly slower build and search time.
This is especially essential as the OP mentioned he could have huge
sets of data.


br,
S
 
?

=?ISO-8859-15?Q?Martin_MOKREJ=A6?=

Simo said:
Ternary search trees give _much_ better space-efficiency compared to
tries, at the expense of only slightly slower build and search time.
This is especially essential as the OP mentioned he could have huge
sets of data.

Hi Simo and John,
would you please point me to some docs so I learn what are you talking about? ;)
Many thanks!
Martin
 
J

John Lenton

Ternary search trees give _much_ better space-efficiency compared to
tries, at the expense of only slightly slower build and search time.
This is especially essential as the OP mentioned he could have huge
sets of data.

hmm! sounds like *some*body needs to go read up on ternary trees,
then.

Ok, ok, I'm going...

--
John Lenton ([email protected]) -- Random fortune:
Fortune finishes the great quotations, #6

"But, soft! What light through yonder window breaks?"
It's nothing, honey. Go back to sleep.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB4yB+gPqu395ykGsRAqTqAKCuCPwmL/P0fy3zYPeIM0klDWji0gCgkAk2
jGE2qYQJst+7ZL02Ucz+lL0=
=VFph
-----END PGP SIGNATURE-----
 
B

Bengt Richter

Hi,
I have sets.Set() objects having up to 20E20 items,
What notation are you using when you write 20E20?
IOW, ISTM 1E9 is a billion. So 20E20 would be 2000 billion billion.
Please clarify ;-)
each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.

How can I write them efficiently to disk? To be more exact,
I have 20 sets. _set1 has 1E20 keys of size 1 character.

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
for aa1 in alphabet:
# l = [aa1]
#_set1.add(aa1)
for aa2 in alphabet:
# l.append(aa2)
#_set2.add(''.join(l))
[cut]

The reason I went for sets instead of lists is the speed,
availability of unique, common and other methods.
What would you propose as an elegant solution?
Actually, even those nested for loops take ages. :(

If you will explain a little what you are doing with these set "items"
perhaps someone will think of another way to represent and use your data.

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,139
Latest member
JamaalCald
Top