]

Actually, I'm not. I'm stating exactly what assumptions I'm making to

get my calculation. I'm comparing *random* character strings or

bitstrings.

Excuse me, you are not. You are comparing English words which are highly

non-random.

You, on the other hand, are making vague assumptions which you do not

care for formalize and yet you claim that "the number of comparisons is

equally likely to be 1, 2, 3, ..., N. The average then is". Without any

explanation for this. At all.

I will accept that my explanation was not good enough, but I strongly

disagree that I gave no explanation at all.

Is your assumtion that we're comparing words that have the common prefix

"Herr Professor Frederick "?

No, I am pointing out that *your* assumption that most string comparisons

will halt close to the beginning of the string is an invalid assumption.

Your assumption only holds for some non-random strings.

[...]

You're wrong. It's not counting the number of string comparisons,

I didn't say it was.

it's counting the number of character comparisons.

Correct. But only out of an extremely limited subset of all possible

string comparisons, namely those very few that happen when sorting lists

of English words using a very specific algorithm, namely timsort.

Which is exactly what

we're talking about in this thread, are we not? The sorting of the list

is just to show some example where lots of strings are compared.

It is not good enough to extract a non-random subset of strings (only

English words) and then decrease the data set even further by only

comparing an extremely non-uniform subset of those strings: specifically

those few string comparisons that happen to occur using timsort.

Whatever figure you have calculated by taking this non-random selection,

it is irrelevant to the question of the average performance of string

equality given random strings.

Again, it's not counting the number of string comparisons. It's counting

the average number of character comparisons per string comparison. The

total amount of string comparisons has nothing to do with it.

I didn't say anything about counting string comparisons. But you are

wrong -- the numbers you get are *strongly* influenced by the number of

string comparisons performed. That is just one of the reasons why the

results you get are biased.

Let me spell it out for you: given a list of five letter words:

['fruit', 'apple', 'claim', 'pears', ... , 'toast']

sorting the list may compare:

'fruit' with 'apple'

'apple' with 'claim'

but almost certainly will not compare:

'apple' with 'toast'

and it definitely won't compare:

'zzzax' with 'zzzaq'

because strings like those never get into the list in the first place.

For the sake of simple calculations, let's pretend that there are only

1000 five-letter strings possible. We want to know how many character-

comparisons are done on average when testing two random five-letter

strings for equality. To calculate this average, you must compare every

string to every other string, *including* itself.

This gives 1000*1000 = one million equality tests to be performed. For

each equality test, we count the number of character comparisons

performed, sum all those counts, and divide by 1e6. That is the average

number of char comparisons for random strings of five letters.

But that's not what you do. First you eliminate 900 out of the 1000

possible strings by only sampling English words -- strings like "xxoij"

cannot possibly be selected. So you are left with the highly non-random

subset of 10000 equality tests.

But you haven't finished. Instead of checking all 10000 equality tests,

you then decide to only check the 2000 tests that happen to randomly

occur when using the timsort algorithm to sort a list.

So from the population of one million possible equality tests, you throw

away the vast majority, then average the *strongly non-random* few that

are remaining. Naturally the result you get is strongly biased, and it

has nothing to do with the average Big Oh behaviour of equality on random

strings.