Writing huge Sets() to disk

Discussion in 'Python' started by =?ISO-8859-2?Q?Martin_MOKREJ=A9?=, Jan 10, 2005.

  1. Hi,
    I have sets.Set() objects having up to 20E20 items,
    each is composed of up to 20 characters. Keeping
    them in memory on !GB machine put's me quickly into swap.
    I don't want to use dictionary approach, as I don't see a sense
    to store None as a value. The items in a set are unique.

    How can I write them efficiently to disk? To be more exact,
    I have 20 sets. _set1 has 1E20 keys of size 1 character.

    alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
    for aa1 in alphabet:
    # l = [aa1]
    #_set1.add(aa1)
    for aa2 in alphabet:
    # l.append(aa2)
    #_set2.add(''.join(l))
    [cut]

    The reason I went for sets instead of lists is the speed,
    availability of unique, common and other methods.
    What would you propose as an elegant solution?
    Actually, even those nested for loops take ages. :(
    M.
    =?ISO-8859-2?Q?Martin_MOKREJ=A9?=, Jan 10, 2005
    #1
    1. Advertising

  2. =?ISO-8859-2?Q?Martin_MOKREJ=A9?=

    Paul McGuire Guest

    "Martin MOKREJ©" <> wrote in message
    news:...
    > Hi,
    > I have sets.Set() objects having up to 20E20 items,
    > each is composed of up to 20 characters. Keeping
    > them in memory on !GB machine put's me quickly into swap.
    > I don't want to use dictionary approach, as I don't see a sense
    > to store None as a value. The items in a set are unique.
    >
    > How can I write them efficiently to disk? To be more exact,
    > I have 20 sets. _set1 has 1E20 keys of size 1 character.
    >
    > alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',

    'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
    > for aa1 in alphabet:
    > # l = [aa1]
    > #_set1.add(aa1)
    > for aa2 in alphabet:
    > # l.append(aa2)
    > #_set2.add(''.join(l))
    > [cut]
    >
    > The reason I went for sets instead of lists is the speed,
    > availability of unique, common and other methods.
    > What would you propose as an elegant solution?
    > Actually, even those nested for loops take ages. :(
    > M.


    _set1 only has 19 keys of size 1 character - 'A' is duplicated.

    Assuming you replace 'A' with another character (say 'Z'), then here is what
    you get:
    _set1 - 20 elements (20**1)
    _set2 - 400 elements (20**2)
    _set3 - 8000 elements (20**3)
    ....
    _set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26

    If you have no duplicates in your alphabet, then you wont need to use sets,
    every combination will be unique. In this case, just use a generator.

    Here's a recursive generator approach that may save you a bunch of nested
    editing (set maxDepth as high as you dare):

    alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
    'Y', 'W', 'K', 'R', 'H', 'D', 'E')

    maxDepth = 3

    def genNextCombo(root=list(),depth=1):
    for a in alphabet:
    thisRoot = root + [a]
    yield "".join( thisRoot )
    if depth < maxDepth:
    for c in genNextCombo(thisRoot, depth+1):
    yield c

    for c in genNextCombo():
    print c # or write to file, or whatever



    -- Paul
    Paul McGuire, Jan 10, 2005
    #2
    1. Advertising

  3. Paul McGuire wrote:
    > "Martin MOKREJ©" <> wrote in message
    > news:...
    >
    >>Hi,
    >> I have sets.Set() objects having up to 20E20 items,
    >>each is composed of up to 20 characters. Keeping
    >>them in memory on !GB machine put's me quickly into swap.
    >>I don't want to use dictionary approach, as I don't see a sense
    >>to store None as a value. The items in a set are unique.
    >>
    >> How can I write them efficiently to disk? To be more exact,
    >>I have 20 sets. _set1 has 1E20 keys of size 1 character.
    >>
    >>alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',

    >
    > 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
    >
    >>for aa1 in alphabet:
    >> # l = [aa1]
    >> #_set1.add(aa1)
    >> for aa2 in alphabet:
    >> # l.append(aa2)
    >> #_set2.add(''.join(l))
    >>[cut]
    >>
    >> The reason I went for sets instead of lists is the speed,
    >>availability of unique, common and other methods.
    >>What would you propose as an elegant solution?
    >>Actually, even those nested for loops take ages. :(
    >>M.

    >
    >
    > _set1 only has 19 keys of size 1 character - 'A' is duplicated.


    Ahh, you are right. That's a typo, yet I have to figure out which char
    I have to use. But 'Z' will do for the tests anyway.

    >
    > Assuming you replace 'A' with another character (say 'Z'), then here is what
    > you get:
    > _set1 - 20 elements (20**1)
    > _set2 - 400 elements (20**2)
    > _set3 - 8000 elements (20**3)
    > ...
    > _set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26
    >
    > If you have no duplicates in your alphabet, then you wont need to use sets,
    > every combination will be unique. In this case, just use a generator.


    As I noted in later response, actually I will compare these ideal sets to some
    real world examples. I have no expectation how many entries will be common
    and how many unique. The only thing I know even in real world, I might be in
    size ranges of 2E6 or perhaps 2E8?

    > Here's a recursive generator approach that may save you a bunch of nested
    > editing (set maxDepth as high as you dare):
    >
    > alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
    > 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
    >
    > maxDepth = 3
    >
    > def genNextCombo(root=list(),depth=1):
    > for a in alphabet:
    > thisRoot = root + [a]
    > yield "".join( thisRoot )
    > if depth < maxDepth:
    > for c in genNextCombo(thisRoot, depth+1):
    > yield c
    >
    > for c in genNextCombo():
    > print c # or write to file, or whatever


    Works nice, but http://www.python.org/doc/2.3.4/ref/yield.html
    doesn't really tell what happens here. :( But is quite fast and
    definitely saves those the stupid recursively hand-written for
    loops.

    Thanks Paul!
    M.
    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 10, 2005
    #3
  4. =?ISO-8859-2?Q?Martin_MOKREJ=A9?=

    John Lenton Guest

    you probably want to look into building set-like objects ontop of
    tries, given the homogeneity of your language. You should see
    imrpovements both in size and speed.
    John Lenton, Jan 10, 2005
    #4
  5. "John Lenton" <> writes:

    > you probably want to look into building set-like objects ontop of
    > tries, given the homogeneity of your language. You should see
    > imrpovements both in size and speed.


    Ternary search trees give _much_ better space-efficiency compared to
    tries, at the expense of only slightly slower build and search time.
    This is especially essential as the OP mentioned he could have huge
    sets of data.


    br,
    S
    Simo Melenius, Jan 10, 2005
    #5
  6. Simo Melenius wrote:
    > "John Lenton" <> writes:
    >
    >
    >>you probably want to look into building set-like objects ontop of
    >>tries, given the homogeneity of your language. You should see
    >>imrpovements both in size and speed.

    >
    >
    > Ternary search trees give _much_ better space-efficiency compared to
    > tries, at the expense of only slightly slower build and search time.
    > This is especially essential as the OP mentioned he could have huge
    > sets of data.


    Hi Simo and John,
    would you please point me to some docs so I learn what are you talking about? ;)
    Many thanks!
    Martin
    =?ISO-8859-15?Q?Martin_MOKREJ=A6?=, Jan 10, 2005
    #6
  7. =?ISO-8859-2?Q?Martin_MOKREJ=A9?=

    John Lenton Guest

    On Tue, Jan 11, 2005 at 12:33:42AM +0200, Simo Melenius wrote:
    > "John Lenton" <> writes:
    >
    > > you probably want to look into building set-like objects ontop of
    > > tries, given the homogeneity of your language. You should see
    > > imrpovements both in size and speed.

    >
    > Ternary search trees give _much_ better space-efficiency compared to
    > tries, at the expense of only slightly slower build and search time.
    > This is especially essential as the OP mentioned he could have huge
    > sets of data.


    hmm! sounds like *some*body needs to go read up on ternary trees,
    then.

    Ok, ok, I'm going...

    --
    John Lenton () -- Random fortune:
    Fortune finishes the great quotations, #6

    "But, soft! What light through yonder window breaks?"
    It's nothing, honey. Go back to sleep.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.5 (GNU/Linux)

    iD8DBQFB4yB+gPqu395ykGsRAqTqAKCuCPwmL/P0fy3zYPeIM0klDWji0gCgkAk2
    jGE2qYQJst+7ZL02Ucz+lL0=
    =VFph
    -----END PGP SIGNATURE-----
    John Lenton, Jan 11, 2005
    #7
  8. On Mon, 10 Jan 2005 17:11:09 +0100, =?ISO-8859-2?Q?Martin_MOKREJ=A9?= <> wrote:

    >Hi,
    > I have sets.Set() objects having up to 20E20 items,

    What notation are you using when you write 20E20?
    IOW, ISTM 1E9 is a billion. So 20E20 would be 2000 billion billion.
    Please clarify ;-)

    >each is composed of up to 20 characters. Keeping
    >them in memory on !GB machine put's me quickly into swap.
    >I don't want to use dictionary approach, as I don't see a sense
    >to store None as a value. The items in a set are unique.
    >
    > How can I write them efficiently to disk? To be more exact,
    >I have 20 sets. _set1 has 1E20 keys of size 1 character.
    >
    >alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
    >for aa1 in alphabet:
    > # l = [aa1]
    > #_set1.add(aa1)
    > for aa2 in alphabet:
    > # l.append(aa2)
    > #_set2.add(''.join(l))
    >[cut]
    >
    > The reason I went for sets instead of lists is the speed,
    >availability of unique, common and other methods.
    >What would you propose as an elegant solution?
    >Actually, even those nested for loops take ages. :(


    If you will explain a little what you are doing with these set "items"
    perhaps someone will think of another way to represent and use your data.

    Regards,
    Bengt Richter
    Bengt Richter, Jan 11, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?ISO-8859-15?Q?Martin_MOKREJ=A6?=

    Re: Writing huge Sets() to disk

    =?ISO-8859-15?Q?Martin_MOKREJ=A6?=, Jan 10, 2005, in forum: Python
    Replies:
    4
    Views:
    369
    Paul Rubin
    Jan 11, 2005
  2. Tim Peters

    Re: Writing huge Sets() to disk

    Tim Peters, Jan 10, 2005, in forum: Python
    Replies:
    3
    Views:
    516
    Scott David Daniels
    Jan 11, 2005
  3. =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=

    Re: Writing huge Sets() to disk

    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 10, 2005, in forum: Python
    Replies:
    11
    Views:
    499
    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=
    Jan 14, 2005
  4. =?ISO-8859-2?Q?Martin_MOKREJ=A9?=

    Re: Writing huge Sets() to disk

    =?ISO-8859-2?Q?Martin_MOKREJ=A9?=, Jan 17, 2005, in forum: Python
    Replies:
    5
    Views:
    343
    Duncan Booth
    Jan 17, 2005
  5. Ara.T.Howard
    Replies:
    0
    Views:
    84
    Ara.T.Howard
    Oct 2, 2004
Loading...

Share This Page