This shows some significant confusion about Big Oh complexity analysis

here. When you say that sorting the list of words is O(N log N), the N

refers to the number of words in the dictionary. But when you speak about

comparisons being O(N), the N doesn't refer to the size of the dict, but

the size of the individual strings being compared. There is no reasonable

comparison algorithm where the effort required to compare two strings

depends on the size of the dictionary they have come from.

Since these are *completely different Ns*, you can't combine them to get

O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer

nonsense. If you state it like this:

sorting the list of words is O(N log N) where N = length of the list

comparing the words is O(M) where M = the average length of the words

then the overall complexity is O(N log N)*O(M), but since M is a constant

for any particular dictionary (its just the average length of the words)

that reduces down to O(N log N).

To say that sorting a dictionary is O(N log N) does *not* imply that

string comparisons are O(1). It only implies that string comparisons

don't depend on the size of the dictionary. Comparing:

s1 = 'a'*30002

s2 = 'a'*30001 + 'b'

takes the same amount of time whether they are in a dictionary of 100

words or one of 100000 words. For the purposes of calculating the

algorithmic complexity of sorting a large list of words, we can treat

comparisons as an elementary operation even though they actually aren't.

You are making unjustified assumptions about the distribution of letters

in the words. This might be a list of long chemical compounds where the

words typically differ only in their suffix. It might be a list of people

with titles:

Herr Professor Frederick Schmidt

Herr Professor Frederick Wagner

....

But either way, it doesn't matter for the Big Oh behaviour of sorting the

words.

I've just written a program to demonstrate this (below it's attached).

I have no idea why you think this program demonstrates anything about

string equality comparisons. What you are counting is not the average

number of comparisons for string equality, but the number of comparisons

when sorting. These are *completely different*, because sorting a list

does not require testing each string against every other string.

Whatever you have measured, it has nothing to do with the Big Oh

complexity of string comparisons.

It seems to me that you have misunderstood the purpose of Big Oh

notation. It gives an asymptotic upper bound on the amount of work done,

ignoring all lower terms and multiplicative factors, not an exact formula.

If something is O(N), then this tells you:

1) the number of algorithmic steps is proportional to N, plus or

minus some terms which are less than N (e.g. sqrt(N), or log(N),

or 1/N, or some constant, etc.);

2) if you ignore those other terms, and keep all other influences

equal, then doubling N will *at worst* double the number of

algorithmic steps;

3) if all those algorithmic steps take exactly the same amount of

time, then doubling N will take twice as much time.

But of course *none* of those assumptions hold for measurements of actual

elapsed time. In real life, operations rarely take constant time, you

don't always see worst case behaviour, and the lower terms are not

necessarily insignificant enough to ignore.

[...]

So, interestingly, in this real-world case(tm), the complexity does not

scale with O(n). It actually goes down (which is probably due to the

limit amount of words of higher length).

I would guess that it probably has more to do with the ability of the

timsort algorithm to exploit local order within a list and avoid

performing comparisons.

In summary, let's please not debate "feelings" or such.

I have no idea what feelings you are referring to.