matching strings in a large set of strings

Discussion in 'Python' started by Karin Lagesen, Apr 29, 2010.

  1. Hello.

    I have approx 83 million strings, all 14 characters long. I need to be
    able to take another string and find out whether this one is present
    within the 83 million strings.

    Now, I have tried storing these strings as a list, a set and a dictionary.
    I know that finding things in a set and a dictionary is very much faster
    than working with a list, so I tried those first. However, I run out of
    memory building both the set and the dictionary, so what I seem to be left
    with is the list, and using the in method.

    I imagine that there should be a faster/better way than this?

    TIA,

    Karin
    Karin Lagesen, Apr 29, 2010
    #1
    1. Advertising

  2. Karin Lagesen

    Peter Otten Guest

    Karin Lagesen wrote:

    > I have approx 83 million strings, all 14 characters long. I need to be
    > able to take another string and find out whether this one is present
    > within the 83 million strings.
    >
    > Now, I have tried storing these strings as a list, a set and a dictionary.
    > I know that finding things in a set and a dictionary is very much faster
    > than working with a list, so I tried those first. However, I run out of
    > memory building both the set and the dictionary, so what I seem to be left
    > with is the list, and using the in method.
    >
    > I imagine that there should be a faster/better way than this?


    Do you need all matches or do you just have to know whether there are any?
    Can the search string be shorter than 14 characters?

    One simple improvement over the list may be using one big string instead of
    the 83 million short ones and then search it using string methods.

    Peter
    Peter Otten, Apr 29, 2010
    #2
    1. Advertising

  3. Karin Lagesen

    Miki Guest


    > I have approx 83 million strings, all 14 characters long. I need to be
    > able to take another string and find out whether this one is present
    > within the 83 million strings.

    Have a look at the shelve module.

    If you want to write the algorithm yourself, I suggest
    http://en.wikipedia.org/wiki/Trie

    HTH,
    --
    Miki
    http://pythonwise.blogspot.com
    Miki, Apr 29, 2010
    #3
  4. Karin Lagesen

    Paul Rudin Guest

    "Karin Lagesen" <> writes:

    > Hello.
    >
    > I have approx 83 million strings, all 14 characters long. I need to be
    > able to take another string and find out whether this one is present
    > within the 83 million strings.
    >
    > Now, I have tried storing these strings as a list, a set and a dictionary.
    > I know that finding things in a set and a dictionary is very much faster
    > than working with a list, so I tried those first. However, I run out of
    > memory building both the set and the dictionary, so what I seem to be left
    > with is the list, and using the in method.
    >
    > I imagine that there should be a faster/better way than this?
    >


    Shouldn't a set with 83 million 14 character strings be fine in memory
    on a stock PC these days? I suppose if it's low on ram you might start
    swapping which will kill performance. Perhaps the method you're using to
    build the data structures creates lots of garbage? How much ram do you
    have and how much memory does the python process use as it builds your
    data structures?

    A set should give good performance if the target string is also 14
    characters.

    If you do end up with the list then try using bisect
    <http://docs.python.org/library/bisect.html> which should be quicker
    than just using "in" (which I think just scans linearly through the list
    testing for equality).

    There are other algorithms you can use that have better theoretical
    performance than using bisect for this particular problem, but you get
    into trade offs between executing things in python as opposed to C if
    you start to hand code things in python.
    Paul Rudin, Apr 30, 2010
    #4
  5. On Fri, 30 Apr 2010 08:23:39 +0100, Paul Rudin wrote:

    > "Karin Lagesen" <> writes:
    >
    >> Hello.
    >>
    >> I have approx 83 million strings, all 14 characters long. I need to be
    >> able to take another string and find out whether this one is present
    >> within the 83 million strings.
    >>
    >> Now, I have tried storing these strings as a list, a set and a
    >> dictionary. I know that finding things in a set and a dictionary is
    >> very much faster than working with a list, so I tried those first.
    >> However, I run out of memory building both the set and the dictionary,
    >> so what I seem to be left with is the list, and using the in method.
    >>
    >> I imagine that there should be a faster/better way than this?
    >>
    >>

    > Shouldn't a set with 83 million 14 character strings be fine in memory
    > on a stock PC these days?


    Not even close. Using Python 2.6:

    >>> s = "12345678901234"
    >>> assert len(s) == 14
    >>> import sys
    >>> sys.getsizeof(s)

    38

    So a single 14 char string takes 38 bytes.


    >>> import random, string
    >>> chars = list(string.letters + string.digits)*4
    >>> def rnd_str():

    .... random.shuffle(chars)
    .... return ''.join(chars[:14])
    ....
    >>> s = set()
    >>> while len(s) < 83000:

    .... s.add(rnd_str())
    ....
    >>> sys.getsizeof(s)

    1048688


    So a set with 83000 such strings takes approximately 1 MB. So far fairly
    trivial. But that's just the memory used by the container (the set), not
    the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
    is trivial for a modern PC, but the OP is using 83 million such strings,
    not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
    entry level desktop PC these days is generally 2GB, and entry level
    notebooks might have half a gig.

    If the OP is on a 64 bit system, every pointer will be twice the size,
    leading to even larger memory requirements. And if the strings are
    unicode, then potentially they could be anything up to four times larger
    each. Worst case, the OP could need something of the order of 24 GB to
    store the strings all in memory.


    --
    Steven
    Steven D'Aprano, Apr 30, 2010
    #5
  6. Karin Lagesen

    Paul Rudin Guest

    Duncan Booth <> writes:

    > Paul Rudin <> wrote:
    >
    >> Shouldn't a set with 83 million 14 character strings be fine in memory
    >> on a stock PC these days? I suppose if it's low on ram you might start
    >> swapping which will kill performance. Perhaps the method you're using
    >> to build the data structures creates lots of garbage? How much ram do
    >> you have and how much memory does the python process use as it builds
    >> your data structures?

    >
    > Some simple experiments should show you that a stock PC running a 32 bit
    > Python will struggle:
    >
    >>>> s = "12345678901234"
    >>>> sys.getsizeof(s)

    > 38
    >>>> 83*38

    > 3154
    >
    > So more than 3GB just for the strings (and that's for Python 2.x on
    > Python 3.x you'll need nearly 5GB).
    >
    > Running on a 64 bit version of Python should be fine, but for a 32 bit
    > system a naive approach just isn't going to work.


    It depends - a few gig of RAM can be cheap compared with programmer
    time. If you know you can solve a problem by spending a few euros on
    some extra RAM it can be a good solution! It depends of course where the
    code is being deployed - if it's something that's intended to be
    deployed widely then you can't expect thousands of people to go out and
    buy more RAM - but if it's a one off deployment for a particular
    environment then it can be the best way to go.
    Paul Rudin, Apr 30, 2010
    #6
  7. Karin Lagesen

    News123 Guest

    Dennis Lee Bieber wrote:
    > On Thu, 29 Apr 2010 11:38:28 +0200, "Karin Lagesen"
    > <> declaimed the following in comp.lang.python:
    >
    >> Hello.
    >>
    >> I have approx 83 million strings, all 14 characters long. I need to be
    >> able to take another string and find out whether this one is present
    >> within the 83 million strings.
    >>


    >>

    > So don't load them into memory... First use a file-based (not memory
    >
    >
    > That lets you do a binary search on the file. Much faster than a
    > linear search (linear search will average out to 41.5M read operations;
    > binary should be around 10000 reads)


    Don't you meant 27 reads instead of 41.5 M reads?

    >>> from math import log
    >>> log(83e6)/log(2)

    26.306608000671101
    >>>



    N
    News123, May 1, 2010
    #7
  8. Duncan Booth, 30.04.2010 10:20:
    > So more than 3GB just for the strings (and that's for Python 2.x on
    > Python 3.x you'll need nearly 5GB).
    >
    > Running on a 64 bit version of Python should be fine, but for a 32 bit
    > system a naive approach just isn't going to work.
    >
    > Option 1: use a trie. That should reduce storage, maybe it will reduce
    > it enough, maybe not. It depends on the data.


    Depending on the implementation and the data, a trie can also use a lot
    /more/ space than the set of strings that it contains. The 83 million 14
    character strings can well include a relatively large subset of the
    possible permutations (imagine purely numeric, hex-numeric or lower-case
    alpha-numeric strings, for example), so the trie will likely need to branch
    very often with very low intermediate run length. If you use pointers for
    trie branches, that's at least 8 bytes per branch on a 64bit system, versus
    1 byte per character in a byte string list. Depending on the ratio of
    branches to characters, one or the other may win. So a "naive approach"
    likely won't work for tries either.

    Stefan
    Stefan Behnel, May 1, 2010
    #8
  9. On Sat, 01 May 2010 13:48:02 +0200, News123 <> declaimed
    the following in gmane.comp.python.general:

    > Dennis Lee Bieber wrote:
    > > That lets you do a binary search on the file. Much faster than a
    > > linear search (linear search will average out to 41.5M read operations;
    > > binary should be around 10000 reads)

    >
    > Don't you meant 27 reads instead of 41.5 M reads?
    >

    Don't you mean my 10000 reads? The 41.5M is the average for the
    linear search.

    > >>> from math import log
    > >>> log(83e6)/log(2)

    > 26.306608000671101
    > >>>

    >

    Probably -- it was late at night and I was working things out in my
    head...
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
    Dennis Lee Bieber, May 1, 2010
    #9
  10. Karin Lagesen

    News123 Guest

    Dennis Lee Bieber wrote:
    > On Sat, 01 May 2010 13:48:02 +0200, News123 <> declaimed
    > the following in gmane.comp.python.general:
    >
    >> Dennis Lee Bieber wrote:
    >>> That lets you do a binary search on the file. Much faster than a
    >>> linear search (linear search will average out to 41.5M read operations;
    >>> binary should be around 10000 reads)

    >> Don't you meant 27 reads instead of 41.5 M reads?
    >>

    > Don't you mean my 10000 reads? The 41.5M is the average for the
    > linear search.
    >

    Indeed, this is what I meant. or phrased differently:
    "about 27 reads with a binary search instead of 41.5M reads average with
    a linear search."




    >>>>> from math import log
    >>>>> log(83e6)/log(2)

    >> 26.306608000671101

    > Probably -- it was late at night and I was working things out in my
    > head...


    I know about late nights. I just wondered whether I overlooked something.
    News123, May 1, 2010
    #10
  11. Karin Lagesen

    News123 Guest

    Dennis Lee Bieber wrote:
    > On Sat, 01 May 2010 13:48:02 +0200, News123 <> declaimed
    > the following in gmane.comp.python.general:
    >
    >> Dennis Lee Bieber wrote:
    >>> That lets you do a binary search on the file. Much faster than a
    >>> linear search (linear search will average out to 41.5M read operations;
    >>> binary should be around 10000 reads)

    >> Don't you meant 27 reads instead of 41.5 M reads?
    >>

    > Don't you mean my 10000 reads? The 41.5M is the average for the
    > linear search.
    >

    Indeed, this is what I meant. or phrased differently:
    "about 27 reads with a binary search instead of 41.5M reads average with
    a linear search."




    >>>>> from math import log
    >>>>> log(83e6)/log(2)

    >> 26.306608000671101

    > Probably -- it was late at night and I was working things out in my
    > head...


    I know about late nights. I just wondered whether I overlooked something.
    News123, May 1, 2010
    #11
  12. On Sat, 01 May 2010 23:46:33 +0200, News123 <> declaimed
    the following in gmane.comp.python.general:

    > Indeed, this is what I meant. or phrased differently:
    > "about 27 reads with a binary search instead of 41.5M reads average with
    > a linear search."
    >

    Ah... I'd interpreted your response to have meant 27 reads vs my
    (top of head) 10000 -- both for the binary search... <G>

    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
    Dennis Lee Bieber, May 2, 2010
    #12
  13. In article <>,
    Paul Rudin <> wrote:
    >"Karin Lagesen" <> writes:
    >
    >> Hello.
    >>
    >> I have approx 83 million strings, all 14 characters long. I need to be
    >> able to take another string and find out whether this one is present
    >> within the 83 million strings.
    >>
    >> Now, I have tried storing these strings as a list, a set and a dictionary.
    >> I know that finding things in a set and a dictionary is very much faster
    >> than working with a list, so I tried those first. However, I run out of
    >> memory building both the set and the dictionary, so what I seem to be left
    >> with is the list, and using the in method.
    >>
    >> I imagine that there should be a faster/better way than this?
    >>

    >
    >Shouldn't a set with 83 million 14 character strings be fine in memory
    >on a stock PC these days? I suppose if it's low on ram you might start
    >swapping which will kill performance. Perhaps the method you're using to
    >build the data structures creates lots of garbage? How much ram do you
    >have and how much memory does the python process use as it builds your
    >data structures?


    And if this is practical there should be no swapping problems,
    as the working set will be a small fraction of the data used.

    <SNIP>
    >
    >There are other algorithms you can use that have better theoretical
    >performance than using bisect for this particular problem, but you get
    >into trade offs between executing things in python as opposed to C if
    >you start to hand code things in python.


    There are a lot of possibility, but they depend a great deal on
    secondary conditions, like how often the 83 M dataset changes.

    Groetjes Albert

    --
    --
    Albert van der Horst, UTRECHT,THE NETHERLANDS
    Economic growth -- being exponential -- ultimately falters.
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
    Albert van der Horst, May 2, 2010
    #13
  14. Karin Lagesen

    Bryan Guest

    Karin Lagesen wrote:
    > I have approx 83 million strings, all 14 characters long. I need to be
    > able to take another string and find out whether this one is present
    > within the 83 million strings.

    [...]
    > I run out of memory building both the set and the dictionary, so
    > what I seem to be left with is the list


    I agree with the recommendations to try modules that maintain the set
    on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash
    to about 8 million buckets, and use a compact representation for the
    several strings in each bucket. That gives you most of the speed and
    avoids most of the memory overhead. Python makes it easy:


    class ConstLenStrSet:

    """ Keep a set of strings all of the same length.
    Support set.add() and Python's 'in' test.
    """

    def __init__(self, strlen=14, nbuckets=8000000):
    assert strlen > 0 and nbuckets > 0
    self.strlen = strlen
    self.nbuckets = nbuckets | 1
    self.table = [''] * self.nbuckets

    def _hashstr(self, s):
    return hash(s) % self.nbuckets

    def add(self, s):
    assert len(s) == self.strlen
    if s not in self:
    self.table[self._hashstr(s)] += s

    def __contains__(self, s):
    data = self.table[self._hashstr(s)]
    for i in range(0, len(data), self.strlen):
    if data[i : i + self.strlen] == s:
    return True
    return False


    # Rudimentary demo-test:

    from random import choice
    from string import letters

    def rnd_str(slen=14):
    return ''.join(choice(letters) for _ in range(slen))

    # Test against Python's built-in set
    sref = set()
    for i in range(830000):
    sref.add(rnd_str())

    print('Strings generated.')

    sset = sset = ConstLenStrSet(14, 80000)
    for s in sref:
    sset.add(s)

    print 'ConstLenStrSet loaded.'

    for s in sref:
    assert s in sset

    for i in range(1000):
    s = rnd_str()
    if s in sref:
    print 'That was unlucky.'
    else:
    assert s not in sset



    If building the set is too slow, and you know you don't have a lot of
    duplicate strings, you can use a faster insert method that doesn't
    check whether the string is already in the set:


    def add_quick(self, s):
    assert len(s) == self.strlen
    self.table[self._hashstr(s)] += s


    --
    --Bryan
    Bryan, May 3, 2010
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. softwarepearls_com

    Speeding up matching Strings in a set

    softwarepearls_com, Oct 24, 2008, in forum: Java
    Replies:
    34
    Views:
    834
    Patricia Shanahan
    Oct 26, 2008
  2. anonym
    Replies:
    1
    Views:
    1,013
    Knute Johnson
    Jan 15, 2009
  3. Helmut Jarausch
    Replies:
    3
    Views:
    322
    Dave Angel
    Apr 30, 2010
  4. Joel Goldstick
    Replies:
    1
    Views:
    724
    Arnaud Delobelle
    Aug 31, 2010
  5. Replies:
    5
    Views:
    870
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page