Using STL map and set on large amount of data


E

Eric

I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

Thank you
 
Ad

Advertisements

A

Alan Johnson

Eric said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

Thank you

See if your compiler has an implementation of hash_set. It is
nonstandard, but many compilers have it as an extension. For example,
in g++ you access it in the namespace __gnu_cxx:

#include <ext/hash_set>
#include <string>

int main()
{
__gnu_cxx::hash_set<std::string> h ;
// Do interesting things here.
return 0 ;
}


This should have the same interface as std::set, but uses a hash table
instead of a balanced binary tree (the usual implementation of
std::set), therefore giving you expected O(1) insertion and lookup,
instead of O(lg n).

Here is some good documentation for hash_set:
http://www.sgi.com/tech/stl/hash_set.html

-Alan
 
M

Mark P

Eric said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Can you be more specific about any other constraints on your strings?
If the sole requirement is that they be unique, then you can just list
the first 10^7 in alphabetical order, starting with aaa...aaa.

If they need to be random then you can do various sneaky things that
won't affect the randomness too much (for most purposes). For example,
you can fix the first character of the string and generate 15 subsequent
random characters. Do this 10^7 / (number of distinct characters) times
for each distinct character. That will let you work with substantially
smaller data sets individually.

Other possibilities exist too, it just depends upon how much of your
system's built-in randomness you're willing to sacrifice.
 
M

Michael Olea

Eric said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

Thank you

In addition to the ideas others have mentioned (e.g. generating strings
deterministicaly using some one or another listing algorithm, or using a
random string generator that does not generate duplicates over a period
large enough for your purposes) you might want to try a trie. This is a
data structure for storing sets of strings. It has the following
properties: insertion of a new string takes time proportional to the length
of the string (and not the number of strings in the set); testing whether
or not a given string is already a member of the set takes time
proportional to the length of the string (and not the number of strings in
the set).

So what's a "trie"? The name supposedly derives from "reTRIEval". Suppose
you have the set of strings {"HIS", "HERS", "HIM", "HE"} Then the trie
would look like (ascii art coming, hope you have a monospaced font):

+
|
+-H--+
|
+-E
| |
| +-$
| |
| +-R-S-$
|
+-I
|
+-M-$
|
+-S-$

Here "$" denotes end of string. Any path from the root to a leaf spells out
a unique string in the stored set. See how it works? See how you could use
it?

- Michael
 
P

Panjandrum

Eric said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Which compiler (version), Standard library, operating system?
Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

Maybe your string template uses dynamic memory allocation. Try a key
class like this:

class key {
public:
key (const char* s) { ... }
inline friend bool operator< (const key& left, const key& right){...}
private:
char buf[17];
//...
};
 
Ad

Advertisements

V

velthuijsen

Eric said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

If the template library provided with your compiler has a hash_set or a
hash_map you might want to use that.

If not you might want to ditch the map if you are only using it to
ensure uniqueness of the inserted strings and go for a vector.
The risk that a few of the values already exist is small (but is
there).
You want only 1.0E+7 values, the total number of available values is
2.23007E+43 (or 4.52313E+74 if case sensitive). The chance of getting
the same string (if you use a decent pRNG) is only 1 divided by
2.23E+36 (or 1 divided 4.52E+67 if case sensitive).

If this is to risky since you must have a guarantee for uniqueness
another trick is to generate 36 (or 62) different maps and assign 10
million divided by 36 (or 62) values to each map, where each map stores
only string starting with one specific alpha numerical value. Then do a
random access to one of the submaps when you need a value. You'll want
to generate more values for each map to decrease the chance that one of
the submaps will run out of values when you access it.
This trick reduces the number of values you need to insert into one map
thereby replacing the time it takes to go through the tree for several
million values by a constant time during creation of the strings and a
constant time increase when retrieving a value.
 
M

Mark P

Eric said:
The strings are randomly generated, so I can't start with a...a, then
a...b, etc.

Well, how about the other idea I suggested then? In fact, you can do
even better on it and not sacrifice any randomness. Begin by assigning
10^7 balls into 26 bins (or however many chars you choose from). You
can make an array count[26] and increment a random element 10^7 times.
Then for each bin i, construct count random, distinct 15 char strings
and prepend to each char. This makes the individual problems
substantially smaller and still retains randomness.

More generally you can use on the order of sqrt(10^7) bins and compute
for each on the order of sqrt(10^7) strings. For example, a bin for
every possible first two characters.
 
M

Mark P

You want only 1.0E+7 values, the total number of available values is
2.23007E+43 (or 4.52313E+74 if case sensitive). The chance of getting
the same string (if you use a decent pRNG) is only 1 divided by
2.23E+36 (or 1 divided 4.52E+67 if case sensitive).

No, it's more than that though still very small. Look up the birthday
problem.
 
V

velthuijsen

True, I should have used P(k)= 1- n!/(n-k)!*n^k for accuracy.
Now to just to write a program that can handle the numbers involved :)
 
M

Me

Eric said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

You're thinking way too hard on this one. All you need to do is make a
function to generate a non-repeating random number in the range [0,
some number in the range 10**7-1 and 16**6-1] and then convert it to a
fixed length hex string of length 6 and call that function 10**7 times.
"Pseudo-Random Traversal of a Set" on
http://www.developer.com/tech/article.php/10923_3496381_3 shows one
easy way of doing this.
 
Ad

Advertisements

V

verec

I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

Not too sure why people should help you generate 10 millions fake
credit card numbers ...
 
P

Paul Groke

verec said:
Not too sure why people should help you generate 10 millions fake
credit card numbers ...

Alphanumeric credit-card numbers?
Looks more like TAN codes to me, but whatever :)
 
Ad

Advertisements

V

verec

Alphanumeric credit-card numbers?
Looks more like TAN codes to me, but whatever :)

Hmmm. How would _you_ go about asking for help without
revealing too much of your private agenda?

In case it didn't occur to you, alphanumeric is a proper
superset of numeric ... and thus, any help he can get on
the first, immediately applies to the second.

Call me paranoid :)
 

Top