Generating random strings without duplicates

E

eric.gagnon

In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

STL set, map?
Could you give me a little code example?

Thank you.
 
B

Bob Hairgrove

In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

STL set, map?
Could you give me a little code example?

160 MB with 1-byte characters, or at least twice that for wchar_t ...
hmmm?

Store them in a database, and put a unique index on that column. True
randomness does not exclude duplicate values.
 
N

niklasb

Bob said:
160 MB with 1-byte characters, or at least twice that for wchar_t ...
hmmm?

Store them in a database, and put a unique index on that column. True
randomness does not exclude duplicate values.

Depending on what sort of "randomness" you require, you might be
able to avoid any kind of storage/lookup at all. Certain pseudo-
random number generators have the property that the same number
never occurs more than once for the duration of the randomizer's
"period" (after which the entire sequence of numbers repeats).
The linear-congruential algorith used by many implementations of
rand() has this property -- but unfortunately, from what I've
read most tend to have periods of only 65,000 or so.

If you could find a version with a period of 10,000,000 or more
then your problem is 99% solved. All you then have to do is use
each random number to generate a 16-character string in such a
way that no two numbers yield the same string. (This should be
easy enough to guarantee. If you limit yourselves to characters
A-Z and 0-9 this gives you up to 36^16 possible strings which
is more than enough.)

This will give you an algorithm for generating up to N unique
strings (N being the period of the randomizer), where uniqueness
is guaranteed by the algorithm with no lookup required.
 
B

bart.kowalski

Depending on what sort of "randomness" you require, you might be
able to avoid any kind of storage/lookup at all. Certain pseudo-
random number generators have the property that the same number
never occurs more than once for the duration of the randomizer's
"period" (after which the entire sequence of numbers repeats).
The linear-congruential algorith used by many implementations of
rand() has this property -- but unfortunately, from what I've
read most tend to have periods of only 65,000 or so.

If you could find a version with a period of 10,000,000 or more
then your problem is 99% solved.

Google: mersenne twister


Bart.
 
R

rossum

In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

STL set, map?
Could you give me a little code example?

Thank you.

An alternative:

1 Generate a random string.
2 Hash the string to a value in a range > 0 - 9 999 999.
3 If this hash value has been used before then goto 1.
4 If the hash value is a new one then the string is also new.

Keep track of which hash values have appeared by using a bit array,
one bit per value in the hash range.

If you pick the hash range to be 25% larger than the number of strings
then on average you will only have to generate 4 strings to find a new
one at the end of the run.


Another alternative:

1 Generate a file of N > 10 000 000 unique strings in advance.
2 Pick a random number R in the range 1 .. N, both inclusive..
3 If R is already picked then pick another.
4 If R is not already picked then output the Rth string.

Again a bit array will keep track of the numbers that you have already
picked.

By generating the strings in advance you can take as long as you want
to get them right. You can load the file from disk for the actual run.


Yet another alternative:

1 Generate a file of N >= 10 000 000 unique strings in advance.
2 For IX = N downto 1
3 Pick a random number R in the range 1 .. IX, both inclusive.
4 Output string R.
5 Copy string IX over string R
6 Endfor



rossum


The ultimate truth is that there is no ultimate truth
 
G

Greg

Bob said:
160 MB with 1-byte characters, or at least twice that for wchar_t ...
hmmm?

Store them in a database, and put a unique index on that column. True
randomness does not exclude duplicate values.

Try "uuidgen"

Greg
 
E

E. Mark Ping

In a program randomly generating 10 000 000 alphanumeric codes of 16
characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an
efficient way to ensure that I do not generate duplicates?

More information would make it possible to answer in something other
than guesses.

Why are you generating all these strings (how will they be used)? Why
is duplication an issue? etc.
 
J

Jim Langston

E. Mark Ping said:
More information would make it possible to answer in something other
than guesses.

Why are you generating all these strings (how will they be used)? Why
is duplication an issue? etc.

And why can't you just use a UUID generator which is guaranteed to be
unique?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top