matching strings in a large set of strings

K

Karin Lagesen

Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?

TIA,

Karin
 
P

Peter Otten

Karin said:
I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?

Do you need all matches or do you just have to know whether there are any?
Can the search string be shorter than 14 characters?

One simple improvement over the list may be using one big string instead of
the 83 million short ones and then search it using string methods.

Peter
 
P

Paul Rudin

Karin Lagesen said:
Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?

Shouldn't a set with 83 million 14 character strings be fine in memory
on a stock PC these days? I suppose if it's low on ram you might start
swapping which will kill performance. Perhaps the method you're using to
build the data structures creates lots of garbage? How much ram do you
have and how much memory does the python process use as it builds your
data structures?

A set should give good performance if the target string is also 14
characters.

If you do end up with the list then try using bisect
<http://docs.python.org/library/bisect.html> which should be quicker
than just using "in" (which I think just scans linearly through the list
testing for equality).

There are other algorithms you can use that have better theoretical
performance than using bisect for this particular problem, but you get
into trade offs between executing things in python as opposed to C if
you start to hand code things in python.
 
S

Steven D'Aprano

Karin Lagesen said:
Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a
dictionary. I know that finding things in a set and a dictionary is
very much faster than working with a list, so I tried those first.
However, I run out of memory building both the set and the dictionary,
so what I seem to be left with is the list, and using the in method.

I imagine that there should be a faster/better way than this?
Shouldn't a set with 83 million 14 character strings be fine in memory
38

So a single 14 char string takes 38 bytes.

.... random.shuffle(chars)
.... return ''.join(chars[:14])
........ s.add(rnd_str())
....1048688


So a set with 83000 such strings takes approximately 1 MB. So far fairly
trivial. But that's just the memory used by the container (the set), not
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
is trivial for a modern PC, but the OP is using 83 million such strings,
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
entry level desktop PC these days is generally 2GB, and entry level
notebooks might have half a gig.

If the OP is on a 64 bit system, every pointer will be twice the size,
leading to even larger memory requirements. And if the strings are
unicode, then potentially they could be anything up to four times larger
each. Worst case, the OP could need something of the order of 24 GB to
store the strings all in memory.
 
P

Paul Rudin

Duncan Booth said:
Some simple experiments should show you that a stock PC running a 32 bit
Python will struggle:

3154

So more than 3GB just for the strings (and that's for Python 2.x on
Python 3.x you'll need nearly 5GB).

Running on a 64 bit version of Python should be fine, but for a 32 bit
system a naive approach just isn't going to work.

It depends - a few gig of RAM can be cheap compared with programmer
time. If you know you can solve a problem by spending a few euros on
some extra RAM it can be a good solution! It depends of course where the
code is being deployed - if it's something that's intended to be
deployed widely then you can't expect thousands of people to go out and
buy more RAM - but if it's a one off deployment for a particular
environment then it can be the best way to go.
 
N

News123

Dennis said:
So don't load them into memory... First use a file-based (not memory


That lets you do a binary search on the file. Much faster than a
linear search (linear search will average out to 41.5M read operations;
binary should be around 10000 reads)

Don't you meant 27 reads instead of 41.5 M reads?


N
 
S

Stefan Behnel

Duncan Booth, 30.04.2010 10:20:
So more than 3GB just for the strings (and that's for Python 2.x on
Python 3.x you'll need nearly 5GB).

Running on a 64 bit version of Python should be fine, but for a 32 bit
system a naive approach just isn't going to work.

Option 1: use a trie. That should reduce storage, maybe it will reduce
it enough, maybe not. It depends on the data.

Depending on the implementation and the data, a trie can also use a lot
/more/ space than the set of strings that it contains. The 83 million 14
character strings can well include a relatively large subset of the
possible permutations (imagine purely numeric, hex-numeric or lower-case
alpha-numeric strings, for example), so the trie will likely need to branch
very often with very low intermediate run length. If you use pointers for
trie branches, that's at least 8 bytes per branch on a 64bit system, versus
1 byte per character in a byte string list. Depending on the ratio of
branches to characters, one or the other may win. So a "naive approach"
likely won't work for tries either.

Stefan
 
D

Dennis Lee Bieber

Don't you meant 27 reads instead of 41.5 M reads?
Don't you mean my 10000 reads? The 41.5M is the average for the
linear search.
Probably -- it was late at night and I was working things out in my
head...
 
N

News123

Dennis said:
Don't you mean my 10000 reads? The 41.5M is the average for the
linear search.
Indeed, this is what I meant. or phrased differently:
"about 27 reads with a binary search instead of 41.5M reads average with
a linear search."



Probably -- it was late at night and I was working things out in my
head...

I know about late nights. I just wondered whether I overlooked something.
 
N

News123

Dennis said:
Don't you mean my 10000 reads? The 41.5M is the average for the
linear search.
Indeed, this is what I meant. or phrased differently:
"about 27 reads with a binary search instead of 41.5M reads average with
a linear search."



Probably -- it was late at night and I was working things out in my
head...

I know about late nights. I just wondered whether I overlooked something.
 
D

Dennis Lee Bieber

Indeed, this is what I meant. or phrased differently:
"about 27 reads with a binary search instead of 41.5M reads average with
a linear search."
Ah... I'd interpreted your response to have meant 27 reads vs my
(top of head) 10000 -- both for the binary search... <G>
 
A

Albert van der Horst

Shouldn't a set with 83 million 14 character strings be fine in memory
on a stock PC these days? I suppose if it's low on ram you might start
swapping which will kill performance. Perhaps the method you're using to
build the data structures creates lots of garbage? How much ram do you
have and how much memory does the python process use as it builds your
data structures?

And if this is practical there should be no swapping problems,
as the working set will be a small fraction of the data used.

There are other algorithms you can use that have better theoretical
performance than using bisect for this particular problem, but you get
into trade offs between executing things in python as opposed to C if
you start to hand code things in python.

There are a lot of possibility, but they depend a great deal on
secondary conditions, like how often the 83 M dataset changes.

Groetjes Albert
 
B

Bryan

Karin said:
I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings. [...]
I run out of memory building both the set and the dictionary, so
what I seem to be left with is the list

I agree with the recommendations to try modules that maintain the set
on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash
to about 8 million buckets, and use a compact representation for the
several strings in each bucket. That gives you most of the speed and
avoids most of the memory overhead. Python makes it easy:


class ConstLenStrSet:

""" Keep a set of strings all of the same length.
Support set.add() and Python's 'in' test.
"""

def __init__(self, strlen=14, nbuckets=8000000):
assert strlen > 0 and nbuckets > 0
self.strlen = strlen
self.nbuckets = nbuckets | 1
self.table = [''] * self.nbuckets

def _hashstr(self, s):
return hash(s) % self.nbuckets

def add(self, s):
assert len(s) == self.strlen
if s not in self:
self.table[self._hashstr(s)] += s

def __contains__(self, s):
data = self.table[self._hashstr(s)]
for i in range(0, len(data), self.strlen):
if data[i : i + self.strlen] == s:
return True
return False


# Rudimentary demo-test:

from random import choice
from string import letters

def rnd_str(slen=14):
return ''.join(choice(letters) for _ in range(slen))

# Test against Python's built-in set
sref = set()
for i in range(830000):
sref.add(rnd_str())

print('Strings generated.')

sset = sset = ConstLenStrSet(14, 80000)
for s in sref:
sset.add(s)

print 'ConstLenStrSet loaded.'

for s in sref:
assert s in sset

for i in range(1000):
s = rnd_str()
if s in sref:
print 'That was unlucky.'
else:
assert s not in sset



If building the set is too slow, and you know you don't have a lot of
duplicate strings, you can use a faster insert method that doesn't
check whether the string is already in the set:


def add_quick(self, s):
assert len(s) == self.strlen
self.table[self._hashstr(s)] += s
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top