Comparing strings from the back?

Johannes Bauer · Sep 5, 2012

The memcpy patch was controversial as it broke Adobe Flash which
assumed memcpy was safe like memmove.

Adobe Flash was broken before, making an assumption that is not
guaranteed by the standard. The patch only showed the problem.

Regards,
Johannes

--

Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Peter Otten · Sep 5, 2012

[sorry for seriously broken implementation]

This function will return 1 if the first character differs. It does not
count the number of common characters but rather the more relevant
quantity which is the number of comparisons required to decide if the
strings are equal.

I remember stumbling over the non-zero frequency for len(word) but somehow
plodded on with beautifying the broken results

def count_common(a, b):
"""4
"""
i = 0
for x, y in zip(a, b):
if x != y:
break
i += 1
return i

$ python3 reverse_compare2.py compare /usr/share/dict/words
499500 combinations
average common chars (head) 0.0535
average common chars (tail) 0.322
comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
0: 477371 ************************************************************
1: 18310 **
2: 3191
3: 523
4: 79
5: 15
6: 10
7: 1

tail
0: 385069 ************************************************************
1: 78742 ************
2: 26734 ****
3: 7462 *
4: 1297
5: 168
6: 22
7: 6

Steven D'Aprano · Sep 5, 2012

]

Actually, I'm not. I'm stating exactly what assumptions I'm making to
get my calculation. I'm comparing *random* character strings or
bitstrings.

Excuse me, you are not. You are comparing English words which are highly
non-random.

You, on the other hand, are making vague assumptions which you do not
care for formalize and yet you claim that "the number of comparisons is
equally likely to be 1, 2, 3, ..., N. The average then is". Without any
explanation for this. At all.

I will accept that my explanation was not good enough, but I strongly
disagree that I gave no explanation at all.

Is your assumtion that we're comparing words that have the common prefix
"Herr Professor Frederick "?

No, I am pointing out that *your* assumption that most string comparisons
will halt close to the beginning of the string is an invalid assumption.
Your assumption only holds for some non-random strings.

[...]

You're wrong. It's not counting the number of string comparisons,

I didn't say it was.

it's counting the number of character comparisons.

Correct. But only out of an extremely limited subset of all possible
string comparisons, namely those very few that happen when sorting lists
of English words using a very specific algorithm, namely timsort.

Which is exactly what
we're talking about in this thread, are we not? The sorting of the list
is just to show some example where lots of strings are compared.

It is not good enough to extract a non-random subset of strings (only
English words) and then decrease the data set even further by only
comparing an extremely non-uniform subset of those strings: specifically
those few string comparisons that happen to occur using timsort.

Whatever figure you have calculated by taking this non-random selection,
it is irrelevant to the question of the average performance of string
equality given random strings.

Again, it's not counting the number of string comparisons. It's counting
the average number of character comparisons per string comparison. The
total amount of string comparisons has nothing to do with it.

I didn't say anything about counting string comparisons. But you are
wrong -- the numbers you get are *strongly* influenced by the number of
string comparisons performed. That is just one of the reasons why the
results you get are biased.

Let me spell it out for you: given a list of five letter words:

['fruit', 'apple', 'claim', 'pears', ... , 'toast']

sorting the list may compare:

'fruit' with 'apple'
'apple' with 'claim'

but almost certainly will not compare:

'apple' with 'toast'

and it definitely won't compare:

'zzzax' with 'zzzaq'

because strings like those never get into the list in the first place.

For the sake of simple calculations, let's pretend that there are only
1000 five-letter strings possible. We want to know how many character-
comparisons are done on average when testing two random five-letter
strings for equality. To calculate this average, you must compare every
string to every other string, *including* itself.

This gives 1000*1000 = one million equality tests to be performed. For
each equality test, we count the number of character comparisons
performed, sum all those counts, and divide by 1e6. That is the average
number of char comparisons for random strings of five letters.

But that's not what you do. First you eliminate 900 out of the 1000
possible strings by only sampling English words -- strings like "xxoij"
cannot possibly be selected. So you are left with the highly non-random
subset of 10000 equality tests.

But you haven't finished. Instead of checking all 10000 equality tests,
you then decide to only check the 2000 tests that happen to randomly
occur when using the timsort algorithm to sort a list.

So from the population of one million possible equality tests, you throw
away the vast majority, then average the *strongly non-random* few that
are remaining. Naturally the result you get is strongly biased, and it
has nothing to do with the average Big Oh behaviour of equality on random
strings.

Oscar Benjamin · Sep 5, 2012

]

Actually, I'm not. I'm stating exactly what assumptions I'm making to
get my calculation. I'm comparing *random* character strings or
bitstrings.

Click to expand...

Excuse me, you are not. You are comparing English words which are highly
non-random.

Evidently we have different understandings of what 'random' means. I don't
think it's unreasonable to say that strings drawn uniformly from the set of
all strings in the English language (having a given number of characters) is
random. The distribution is not uniform over the set of all possible character
strings but it is still random. I think Johannes deliberately chose these
strings to emulate a particular kind of 'real' distribution of strings that
might occur in practise.

I will accept that my explanation was not good enough, but I strongly
disagree that I gave no explanation at all.

No, I am pointing out that *your* assumption that most string comparisons
will halt close to the beginning of the string is an invalid assumption.
Your assumption only holds for some non-random strings.

I think you have this backwards. The case where this assumption is provably
true is precisely for random strings. To be clear, when I say 'random' in this
context I mean that each character is chosen independently from the same
probability distribution over the possible characters regardless of which
index it has in the string and regardless of what the other characters are
(IID). In this case the probability that comparison terminates at the jth
character decreases exponentially with j. This means that for large strings
the expected number of character comparisons is independent of the number of
characters in the string as the probability of reaching the later parts of the
string is too small for them to have any significant effect. This is provable
and applies regardless of how many possible characters there are and whether
or not each character is equally likely (except for the pathological case
where one character has a probability of 1).

For strings from 'real' distributions it is harder to make statements about
the 'average case' and it is possible to construct situations where the
comparison would regularly need to compare a common prefix. However, to get
asymptotic performance worse than O(1) it is not sufficient to say that there
may be a common prefix such as 'Herr' in the distribution of strings. It is
necessary that, somehow, the common prefix is likely to grow as the size of
the strings grows.

For example, the set of all strings of length N whose first N//2 characters
are always 'a' and whose remaining characters are chosen IID would lead to
O(N) performance. This is why the file paths example chosen at the start of
this thread is a good one. If a program is dealing with a number of large
strings representing file paths then it is not uncommon that many of those
paths would refer to files in the same deeply nested directory and hence
compare equal for a significant number of characters. This could lead to
average case O(N) performance.

I think it's appropriate to compare string comparison with dict insertion:
Best case O(1) (no hash collisions)
Worst case O(N) (collides with every key)
Average case O(1) (as long as you don't use pathological data)

The only difference with string comparison is that there are some conceivable,
non-malicious cases where the pathological data can occur (such as with file
paths). However, I suspect that out of all the different uses of python
strings these cases are a small minority.

In saying that, it's not inconceivable that someone could exploit string
comparison by providing pathological data to make normally O(1) operations
behave as O(N). If I understand correctly it was precisely this kind of
problem with dict insertion/lookup that lead to the recent hash-seed security
update.

Oscar

Steven D'Aprano · Sep 6, 2012

Evidently we have different understandings of what 'random' means. I
don't think it's unreasonable to say that strings drawn uniformly from
the set of all strings in the English language (having a given number of
characters) is random. The distribution is not uniform over the set of
all possible character strings but it is still random. I think Johannes
deliberately chose these strings to emulate a particular kind of 'real'
distribution of strings that might occur in practise.

Of course for some "real" applications, your strings being compared will
be English words. And for other applications, they will be strings of
numeric digits uniformly distributed between 20000 and 30000. Or taken
from a Gaussian distribution centered around 7000. Or strings made up of
deeply nested sets of parentheses (((((( ... )))))). Whichever special
distribution of strings you pick, I don't think calling them "random" is
terribly useful in context, even if they are generated randomly from some
non-uniform distribution.

But you know, it really doesn't make a difference. Equality testing will
*still* be O(N) since the asymptomatic behaviour for sufficiently large
string comparisons will be bounded by O(N). Multiplicative constants
("half the string" versus "0.001 of the string") do not matter.

I may have been overly-conservative earlier when I said that on average
string equality has to compare half the characters. I thought I had
remembered that from a computer science textbook, but I can't find that
reference now, so possibly I was thinking of something else. (String
searching perhaps?). In any case, the *worst* case for string equality
testing is certainly O(N) (every character must be looked at), and the
*best* case is O(1) obviously (the first character fails to match). But
I'm not so sure about the average case. Further thought is required.

Dave Angel · Sep 6, 2012

<snip>

I may have been overly-conservative earlier when I said that on
average string equality has to compare half the characters. I thought
I had remembered that from a computer science textbook, but I can't
find that reference now, so possibly I was thinking of something else.
(String searching perhaps?). In any case, the *worst* case for string
equality testing is certainly O(N) (every character must be looked
at), and the *best* case is O(1) obviously (the first character fails
to match). But I'm not so sure about the average case. Further thought
is required.

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size passes
some point.

Define random string as being a selection from a set of characters, with
replacement. So if we pick some set of characters, say 10 (or 256, it
doesn't really matter), the number of possible strings is 10**N.

The likelihood of not finding a mismatch within k characters is
(1/10)**k The likelihood of actually reaching the end of the random
string is (1/10)**N. (which is the reciprocal of the number of strings,
naturally)

If we wanted an average number of comparisons, we'd have to calculate a
series, where each term is a probability times a value for k.
sum((k * 9*10**-k) for k in range(1, 10))

Those terms very rapidly approach 0, so it's safe to stop after a few.
Looking at the first 9 items, I see a value of 1.1111111

This may not be quite right, but the value is certainly well under 2 for
a population of 10 characters, chosen randomly. And notice that N
doesn't really come into it.

Oscar Benjamin · Sep 6, 2012

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size passes
some point.

Define random string as being a selection from a set of characters, with
replacement. So if we pick some set of characters, say 10 (or 256, it
doesn't really matter), the number of possible strings is 10**N.

The likelihood of not finding a mismatch within k characters is
(1/10)**k The likelihood of actually reaching the end of the random
string is (1/10)**N. (which is the reciprocal of the number of strings,

If we wanted an average number of comparisons, we'd have to calculate a
series, where each term is a probability times a value for k.
sum((k * 9*10**-k) for k in range(1, 10))

Those terms very rapidly approach 0, so it's safe to stop after a few.
Looking at the first 9 items, I see a value of 1.1111111

This may not be quite right, but the value is certainly well under 2 for
a population of 10 characters, chosen randomly. And notice that N
doesn't really come into it.

It's exactly right. You can obtain this result analytically from
Johannes' formula above. Just replace 256 with 10 to get that the
expected number of comparisons is

(10/9)*(1 - 10**(-N))

The last term shows the dependence on N and is tiny even for N=9.

Oscar

Roy Smith · Sep 6, 2012

Steven D'Aprano said:
In any case, the *worst* case for string equality
testing is certainly O(N) (every character must be looked at), and the
*best* case is O(1) obviously (the first character fails to match).

The best case is O(0), if either string is empty (ducking and running).

Chris Angelico · Sep 6, 2012

The best case is O(0), if either string is empty (ducking and running).

No, O(0) would be when the application decides not to compare at all.

ChrisA (also ducking)

Johannes Bauer · Sep 6, 2012

]

Actually, I'm not. I'm stating exactly what assumptions I'm making to
get my calculation. I'm comparing *random* character strings or
bitstrings.

Click to expand...

Excuse me, you are not. You are comparing English words which are highly
non-random.

Not in my original post. If you read it again, you will clearly see that
I was talking about purely random strings. And since you like to
nitpick, I'll clarify further: I'm talking about bitstrings in which
every bit of every character has the same probability of occurence, 50%.

You then replied by mentioning probability distributions of characters
in different languages/encodings, to which I tried giving a example as
it might (and does) occur in the real world, like sorting a dictionary.

But the original point is still valid: Sorting of randomized bitstrings
is definitely not O(N), you got that wrong.

I will accept that my explanation was not good enough, but I strongly
disagree that I gave no explanation at all.

What possible explanation could there be that comparison of purely
random bitstrings yields an equal amount of comparisons for its length?

No, I am pointing out that *your* assumption that most string comparisons
will halt close to the beginning of the string is an invalid assumption.
Your assumption only holds for some non-random strings.

Definitely not. It *especially* holds true for random strings. For
random strings with an alphabet of size n, the probability that the
first character needs to be compared is n^0, i.e. 1. The probability
that the second character needs to be compared is n^(-1). For the xth
character, it is n^(-x). This is due to the lazy abort in comparison and
it results in the average number of comparisons being extremely short no
matter the bitlength n, or O(log n).

Correct. But only out of an extremely limited subset of all possible
string comparisons, namely those very few that happen when sorting lists
of English words using a very specific algorithm, namely timsort.

Yes, but this was to look at a real-world example (in which way more
comparisons are needed than in the random case). You first were talking
about randomized bitstrings (and so was I), then you brought up
character sets and languages (which, for the original problem, are
entirely irrelevant).

Whatever figure you have calculated by taking this non-random selection,
it is irrelevant to the question of the average performance of string
equality given random strings.

Correct. I gave the estimate for random strings in my very first post.

For the sake of simple calculations, let's pretend that there are only
1000 five-letter strings possible. We want to know how many character-
comparisons are done on average when testing two random five-letter
strings for equality. To calculate this average, you must compare every
string to every other string, *including* itself.

This gives 1000*1000 = one million equality tests to be performed. For
each equality test, we count the number of character comparisons
performed, sum all those counts, and divide by 1e6. That is the average
number of char comparisons for random strings of five letters.

Easy enough. Since you didn't look at my original equation, here are
some numbers and the program attached:

64 combinations for length 6
Char comparisons: 8064
Comparison cnt : 4096
Avg comparison : 1.97

128 combinations for length 7
Char comparisons: 32512
Comparison cnt : 16384
Avg comparison : 1.98

256 combinations for length 8
Char comparisons: 130560
Comparison cnt : 65536
Avg comparison : 1.99

512 combinations for length 9
Char comparisons: 523264
Comparison cnt : 262144
Avg comparison : 2.00

1024 combinations for length 10
Char comparisons: 2095104
Comparison cnt : 1048576
Avg comparison : 2.00

2048 combinations for length 11
Char comparisons: 8384512
Comparison cnt : 4194304
Avg comparison : 2.00

4096 combinations for length 12
Char comparisons: 33546240
Comparison cnt : 16777216
Avg comparison : 2.00

Hopefully now you'll see that your assumption O(n) is dead wrong.

But that's not what you do. First you eliminate 900 out of the 1000
possible strings by only sampling English words -- strings like "xxoij"
cannot possibly be selected. So you are left with the highly non-random
subset of 10000 equality tests.

This is *exactly* what my original calculation did. Enumerate all
possibilites and divide by the total number.

Regards,
Johannes

#!/usr/bin/python3
import itertools
import random

alphabet = [ c for c in "01" ]

def cmpstr(s1, s2):
c = 0
for i in range(len(s1)):
c += 1
if s1 != s2:
break
return c

for strlength in range(1, 16):
combinations = list(itertools.product(*([ alphabet ] * strlength)))

assert(len(combinations) == len(alphabet) ** strlength)
print("%d combinations for length %d" % (len(combinations), strlength))

compcnt = 0
comparisons = 0
for c1 in combinations:
for c2 in combinations:
comparisons += cmpstr(c1, c2)
compcnt += 1
print("Char comparisons: %d" % (comparisons))
print("Comparison cnt : %d" % (compcnt))
print("Avg comparison : %.2f" % (comparisons / compcnt))

--

Zumindest nicht Ã¶ffentlich!

Click to expand...

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Johannes Bauer · Sep 6, 2012

But you know, it really doesn't make a difference. Equality testing will
*still* be O(N) since the asymptomatic behaviour for sufficiently large
string comparisons will be bounded by O(N). Multiplicative constants
("half the string" versus "0.001 of the string") do not matter.

Wrong, at least for randomized strings (i.e. every character with the
same probability). O(N) is worst-case, O(log N) is correct for
randomized strings.

Then again, since you were nitpicking about Landau notation earlier this
thread, every function bound by O(log N) is also bound by O(N), since,
as you correctly stated, it's only a upper bound. We should be talking
about the asymptotic sharp bound, i.e. capital Theta.

I may have been overly-conservative earlier when I said that on average
string equality has to compare half the characters. I thought I had
remembered that from a computer science textbook, but I can't find that
reference now, so possibly I was thinking of something else. (String
searching perhaps?). In any case, the *worst* case for string equality
testing is certainly O(N) (every character must be looked at), and the
*best* case is O(1) obviously (the first character fails to match). But
I'm not so sure about the average case. Further thought is required.

Yes, worst-case is O(N), best case O(1). Average is O(n log n).

Best regards,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Dave Angel · Sep 6, 2012

<snip>
Yes, worst-case is O(N), best case O(1). Average is O(n log n).

Can't see how you came up with an average of n log(n). Fourteen minutes
before you made this post, you demonstrated it was less than 2 for any N.

And I previously posted that for a character set of size 10, the average
was 1.1111

Johannes Bauer · Sep 6, 2012

Can't see how you came up with an average of n log(n). Fourteen minutes
before you made this post, you demonstrated it was less than 2 for any N.

O(log n) is what I meant, sorry! Damnit.

And I previously posted that for a character set of size 10, the average
was 1.1111

For any given string n and and alphabet size l, the average is:

sum(i = 0..n) (1 / (l ^ n))

So with larger word length, the total complexity constantly increases.
The amount by which it increases however is shrinking exponentially with
the word length. Therefore O(log n).

Sorry about the n log n mistake, argh.

Best regards & thanks for the correction,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Johannes Bauer · Sep 6, 2012

Wrong, at least for randomized strings (i.e. every character with the
same probability). O(N) is worst-case, O(log N) is correct for
randomized strings.

^^
Here I write the right thing. Then further below...

Yes, worst-case is O(N), best case O(1). Average is O(n log n).

....I write the wrong thing. O(log n) is what I meant, as Dave correctly
noticed.

Best regards,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Chris Angelico · Sep 6, 2012

Not in my original post. If you read it again, you will clearly see that
I was talking about purely random strings. And since you like to
nitpick, I'll clarify further: I'm talking about bitstrings in which
every bit of every character has the same probability of occurence, 50%.

That sort of string isn't a normal thing to be comparing, though.

Here's an idea. Someone who's doing a lot of arguing in this thread
should take Python, find the string comparison routine, and hack in
some statistics-gathering. Then run *real code* on it. Maybe use this
with one of those web frameworks and run your web site on it for an
hour or two, or fire off some real scripts you really use. Then dump
out the stats at the end. My guess: The bulk of string comparisons
that get to actually comparing byte-for-byte will end up returning
True. Most of the false comparisons will be proven earlier; if I
understand correctly, Python will check for identity (easy true) and
different lengths (easy false). But my guess could turn out to be flat
wrong. In any case, it'll be far FAR more useful than arguing from
totally random, or random word selection, or anything.

Who's game?

ChrisA

Johannes Bauer · Sep 6, 2012

Can't see how you came up with an average of n log(n). Fourteen minutes
before you made this post, you demonstrated it was less than 2 for any N.

And I previously posted that for a character set of size 10, the average
was 1.1111

Again playing with the equations and thinking about it again. And I
completely missed your point. It wasn't about n log n vs. log n. Both
are wrong. This surprises me. I was somehow thinking about the limit of
sum (1/n), n -> infty. But since the summands are shrinking
exponentially, the limit is different.

I think the limit of the average comparisons for a given wordlength n
against infinity with alphabet size l is

l / (l - 1)

i.e. for bitstrings it's 2 and for bytestrings it's 256/255.

This would mean string comparison of randomized strings is O(1). Can
that really be true? It looks like it.

Best regards,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Johannes Bauer · Sep 6, 2012

In any case, it'll be far FAR more useful than arguing from
totally random, or random word selection, or anything.

Who's game?

Okay, patched against Python 3.2.3: http://pastebin.com/PRRN53P6

To invoke display of the stats, compare the string "pleasedumpstats" as
LHO, i.e.:

"pleasedumpstats" < ""

Just ran it against a big script of mine which takes the stringified
IMDb text files, parses it and dumps it into a sqlite3 database.
Surprisingly little string comparisons however (the sqlite module
without doubt uses its own routines internally). Although the database
in the end has about 2 million records, these were the stats:

strCmpEq 1017
strCmpLt 2802
strCmpGt 1633
strCmpTc 16219
strCmpCc 8570

which means 5452 comparisons of which 19% were equal and the rest inequal.

Average string length is about 2.97 characters and aborted was in
average after 1.57 characters.

Maybe someone has a test which makes heavier use of string comparison. I
don't want to make up something, however, since this is (by definition)
going to be artificial.

Best regards,
Johannes

--

Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Johannes Bauer · Sep 6, 2012

"pleasedumpstats" < ""

And against a XML-reading C code generator that uses mako:

strCmpEq 39670
strCmpLt 2766215
strCmpGt 2744002
strCmpTc 37430821
strCmpCc 14048656

Compared strings: 5549887
Equal: 0.7%
Average wordlength: 6.74 chars
Average comparelength: 2.53 chars

Best regards,
Johannes

--

Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Dave Angel · Sep 6, 2012

Again playing with the equations and thinking about it again. And I
completely missed your point. It wasn't about n log n vs. log n. Both
are wrong. This surprises me. I was somehow thinking about the limit of
sum (1/n), n -> infty. But since the summands are shrinking
exponentially, the limit is different.

I think the limit of the average comparisons for a given wordlength n
against infinity with alphabet size l is

l / (l - 1)

i.e. for bitstrings it's 2 and for bytestrings it's 256/255.

This would mean string comparison of randomized strings is O(1). Can
that really be true? It looks like it.

(Just lost the internet in a storm, so I'm not sure how long it'll be
before this sends.)

Thanks, that was exactly my point. Since el is at least 2, the average
number of comparisons is no larger than 2, for any value of N. That's
why, when I'm visually comparing MD5 values, I usually only look at the
first few, and last few hex digits.

However, Chris Angelico (at 10:39) pointed out again that totally random
strings aren't real-world equivalent.

He has now proposed that somebody measure real-world cases here, by
patching the string comparison to gather statistics. I think that's
beyond me at this point. I only joined this thread when the cases under
study were well-defined random, and therefore predictable. <g>

Steven D'Aprano · Sep 7, 2012

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size passes
some point.

Okay, I've spent some time looking for the reference, and can't find it,
so I'm now sure I must have been conflating it with something else.

After further thought, and giving consideration to the arguments given by
people here, I'm now satisfied to say that for equal-length strings,
string equality is best described as O(N).

1) If the strings are equal, a == b will always compare all N
characters in each string.

2) If the strings are unequal, a == b will *at worst* compare
all N characters.

3) Following usual practice in this field, worst case is the
one which conventionally is meant when discussing Big Oh
behaviour. See, for example, "Introduction To Algorithms"
by Cormen, Leiserson and Rivest.

Also of some interest is the best case: O(1) for unequal strings (they
differ at the first character) and O(N) for equal strings.

Also of interest is the case that has caused the majority of the
discussion, the average case. I am now satisfied that the average number
of comparisons for unequal strings is O(1). To be precise, it is bounded
below by 1 comparison (you always have to compare at least one pair of
characters) and bounded above by 2 comparisons.

(I'm talking about the average here -- the actual number of comparisons
can range all the way up to N, but the average is <= 2.)

If I've done the maths right, the exact value for the average is:

((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

for random strings of length N taken from an alphabet of size M.

For M = 2, that average approaches but never exceeds 2 as N increases;
for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333...
and so forth.

Thanks to all who contributed.

Comparing strings from within strings	3	Oct 21, 2003
strings to/from enumerations mapping	16	Sep 1, 2009
strip() using strings instead of chars	4	Jul 11, 2008
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Programming algorithms with strings in C++, and transition from C by example!	5	Mar 30, 2007
Living with a Swarm of Boids - A report from the front	14	Mar 20, 2010
Elise Mooney reports on Channel 9 about Maths Worldwide and the fraudthat it is	1	Apr 17, 2010
On the development of C	211	Mar 9, 2009

Comparing strings from the back?

Johannes Bauer

Peter Otten

Steven D'Aprano

Oscar Benjamin

Steven D'Aprano

Dave Angel

Oscar Benjamin

Roy Smith

Chris Angelico

Johannes Bauer

Johannes Bauer

Dave Angel

Johannes Bauer

Johannes Bauer

Chris Angelico

Johannes Bauer

Johannes Bauer

Johannes Bauer

Dave Angel

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads