Comparing strings from the back?

Roy Smith · Sep 4, 2012

There's been a bunch of threads lately about string implementations, and
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

It doesn't seem un-plausible that this is the case. For example, most
of the filenames I work with begin with "/home/roy/". Most of the
strings I use as memcache keys have one of a small number of prefixes.
Most of the strings I use as logger names have common leading
substrings. Things like credit card and telephone numbers tend to have
much more entropy in the trailing digits.

On the other hand, hostnames (and thus email addresses) exhibit the
opposite pattern.

Anyway, it's just a thought. Has anybody studied this for real-life
usage patterns?

I'm also not sure how this work with all the possible UCS/UTF encodings.
With some of them, you may get the encoding semantics wrong if you don't
start from the front.

Chris Angelico · Sep 4, 2012

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning?

I'm also not sure how this work with all the possible UCS/UTF encodings.
With some of them, you may get the encoding semantics wrong if you don't
start from the front.

No problem there; Python uses only fixed-width encodings. Also, any
canonical encoding can be safely compared byte-for-byte; two identical
Unicode strings will be bit-wise identical in (say) UTF-8.

There's issues of cache locality and such that quite possibly mean
it's not going to be faster overall, but it wouldn't be difficult to
tweak the Python sources, recompile, and run some tests. I'm sure
python-dev or python-list will be most happy to discuss some benchmark
figures!

ChrisA

Steven D'Aprano · Sep 4, 2012

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

And if the strings have the same end and different beginnings:

running
jumping
cooking

then you'll find out quicker by starting at the beginning.

No general-purpose string comparison operator can make assumptions about
the content of the strings, like "they are nearly always pathnames on
Unix-like systems like /home/user/x and /home/user/y".

I suppose you could have two different operators, == for "test from the
front" and === for "test from the back", and push that decision onto the
user, who will then spend hours angsting about the "most efficient"
equality test and days painstakingly profiling his data to save four
nanoseconds at runtime.

Besides, then somebody will say "Yes, but what about the cases where the
prefix and the suffix are both equal, but the middle will be different?"
and propose a third string-equality operator ==== and then there will be
bloodshed.

On average, string equality needs to check half the characters in the
string. Any individual comparisons may require fewer, or more, than that,
but whichever way you scan the string, some comparisons will take more
and some will take fewer. With no general way of telling ahead of time
which will be which, on average you can't do better than O(N/2) and no
complexity justification for picking "start from the end" from "start
from the start".

Dan Sommers · Sep 4, 2012

Besides, then somebody will say "Yes, but what about the cases where
the prefix and the suffix are both equal, but the middle will be
different?" and propose a third string-equality operator ==== and
then there will be bloodshed.

Lisp has several equality forms, although they're for different kinds or
levels of equality rather than for different algorithms for determining
that equality. I imagine that a discussion similar to this one spawned
many of those forms. I agree with Steven: decisions such as this
belong to the application developer, not the standard library. Don't
forget the fourth string-equality operator for case-insensitive
comparisons and the fifth string-equality operator for strings that are
likely to be palindromes.

That said, if I really wanted bloodshed, I would propose = for the third
string-equality operator! ;-)

Dan

Terry Reedy · Sep 4, 2012

There's been a bunch of threads lately about string implementations, and
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning?

I posted the 3.3 str (in)equality compare a few days ago. It first
compares length and then kind, then the actual bytes in whatever chunks.
If the latter uses the C/system mem compare, that is faster than
anything coded in Python or even C.

Mark Lawrence · Sep 4, 2012

That said, if I really wanted bloodshed, I would propose = for the third
string-equality operator! ;-)

Dan

Dan "agent provocateur" Sommers?

Mark Lawrence · Sep 4, 2012

There's been a bunch of threads lately about string implementations, and
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

It doesn't seem un-plausible that this is the case. For example, most
of the filenames I work with begin with "/home/roy/". Most of the
strings I use as memcache keys have one of a small number of prefixes.
Most of the strings I use as logger names have common leading
substrings. Things like credit card and telephone numbers tend to have
much more entropy in the trailing digits.

On the other hand, hostnames (and thus email addresses) exhibit the
opposite pattern.

Anyway, it's just a thought. Has anybody studied this for real-life
usage patterns?

I'm also not sure how this work with all the possible UCS/UTF encodings.
With some of them, you may get the encoding semantics wrong if you don't
start from the front.

Good grief, what timing!!! The dust has yet to settle over the
"Flexible string representation, unicode, typography, ..." thread and
you ask this. I'll just cross my fingers and hope that people stick
with facts, realistic benchmarks etc, and not good old fashioned FUD.

Alain Ketterlin · Sep 4, 2012

Roy Smith said:
There's been a bunch of threads lately about string implementations, and
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

I guess we all have examples with specific entropy distribution. Going
backwards on long strings can be profitable with file paths. If that is
the case, and if you spend a lot of time comparing strings, why not
store them in reverse?

It doesn't seem un-plausible that this is the case. For example, most
of the filenames I work with begin with "/home/roy/". Most of the
strings I use as memcache keys have one of a small number of prefixes.
Most of the strings I use as logger names have common leading
substrings.

In that case I would split the paths, and use some sort of string-id, or
use intern() and then compare components with "is" instead of "==".
(Basically, turning the string of chars into a strings of
ids/ints/pointers.)

Things like credit card and telephone numbers tend to have much more
entropy in the trailing digits. On the other hand, hostnames (and thus
email addresses) exhibit the opposite pattern.

Yes, real-world is a mess.

Anyway, it's just a thought. Has anybody studied this for real-life
usage patterns?

I've tried the "intern()" trick several times with lists of strings, and
was satisfied, but this was in specific cases (with a finite set of
strings). For "short" strings of chars (up to, say a hundred of
characters), I've never found anything significantly faster than the
default sweep (that was in C/C++, not python). For longer and/or more
structured strings/lists, making use of the structure is a good idea,
but I see no general pattern here (as you note).

I'm also not sure how this work with all the possible UCS/UTF encodings.
With some of them, you may get the encoding semantics wrong if you don't
start from the front.

If you're testing for equality, I can't see how this could matter, even
with variable-length encodings.

If you're comparing different encodings, then you need different access
methods to random characters (but this doesn't affect the algorithm). If
you're using variable-length encoding, e.g., UTF-8, accessing at a
specific position is not possible.

-- Alain.

Johannes Bauer · Sep 4, 2012

On average, string equality needs to check half the characters in the
string.

How do you arrive at that conclusion? When comparing two random strings,
I just derived

n = (256 / 255) * (1 - 256 ^ (-c))

where n is the average number of character comparisons and c. The
rationale as follows: The first character has to be compared in any
case. The second with a probability of 1/256, the third with 1/(256^2)
and so on.

Best regards,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Steven D'Aprano · Sep 4, 2012

How do you arrive at that conclusion?

Take two non-empty strings of the same length, N. If the strings are
equal, you have to make N comparisons to be sure they are equal (you
check all N pairs of characters). If they are unequal, you have to check
each pair of characters up to the first pair that are different:

def string_equality(astr, bstr):
# Assumes lengths are equal.
for achr, bchr in zip(astr, bstr):
if achr != bchr:
return False
return True

For the unequal case, how many comparisons do you do? Ahead of time, we
know very little about the strings. We don't even know how many possible
characters there are (are they English alphanumeric, ASCII, Greek, Thai,
or from the full range of 1114111 Unicode code points?) or what their
probability distribution is.

A reasonable, conservative assumption is to calculate the largest
possible value of the average for random strings. That largest value
occurs when the alphabet is as small as possible, namely two characters.
In practice, strings come from a larger alphabet, up to 1114111 different
characters for full Unicode strings, so the average for them will be less
than the average we calculate now.

So for unequal strings, the number of comparisons is equally likely to be
1, 2, 3, ..., N. The average then is:

sum([1, 2, 3, ..., N])/N

which by a bit of simple algebra works out to be (N+1)/2, or half the
characters as I said.

(Note that this average assumes the strings are completely random. In
practice, strings tend to come from strongly non-uniform distributions of
characters. For instance, in English texts, 'e' is much more likely than
'q' and most characters are vanishingly rare, and in practice, strings
are likely to be highly non-random.)

Chris Angelico · Sep 4, 2012

How do you arrive at that conclusion? When comparing two random strings,
I just derived

n = (256 / 255) * (1 - 256 ^ (-c))

where n is the average number of character comparisons and c. The
rationale as follows: The first character has to be compared in any
case. The second with a probability of 1/256, the third with 1/(256^2)
and so on.

That would be for comparing two random areas of memory. Python strings
don't have 256 options per character; and in terms of actual strings,
there's so many possibilities. The strings that a program is going to
compare for equality are going to use a vastly restricted alphabet;
for a lot of cases, there might be only a few dozen plausible
characters.

But even so, it's going to scale approximately linearly with the
string length. If they're really random, then yes, there's little
chance that either a 1MB string or a 2MB string will be the same, but
with real data, they might very well have a long common prefix. So
it's still going to be more or less O(n).

ChrisA

MRAB · Sep 5, 2012

Roy Smith:

Most people write loops that go forwards. This leads to the
processor designers prioritizing detection mechanisms like cache
prefetching for that case.

However, its not always the situation: a couple of years ago Intel
contributed a memcpy implementation to glibc that went backwards for a
performance improvement. An explanation from a related patch involves
speculative and committed operations and address bits on some processors
and quickly lost me:
paragraph 3 of
http://lists.freedesktop.org/archives/pixman/2010-August/000423.html

The memcpy patch was controversial as it broke Adobe Flash which
assumed memcpy was safe like memmove.

I don't think I've ever use memcpy, only memmove.

Roy Smith · Sep 5, 2012

Neil Hodgson said:
The memcpy patch was controversial as it broke Adobe Flash

An added benefit!

Peter Otten · Sep 5, 2012

Roy said:
There's been a bunch of threads lately about string implementations, and
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

It doesn't seem un-plausible that this is the case. For example, most
of the filenames I work with begin with "/home/roy/". Most of the
strings I use as memcache keys have one of a small number of prefixes.
Most of the strings I use as logger names have common leading
substrings. Things like credit card and telephone numbers tend to have
much more entropy in the trailing digits.

But do you actually compare them with each other? Even if you do I'd guess
that Python looks at the hash values most of the time.

On the other hand, hostnames (and thus email addresses) exhibit the
opposite pattern.

Nor do english words:

$ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8

comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
1: 477222 ************************************************************
2: 18870 **
3: 2870
4: 435
5: 74
6: 17
7: 12

tail
1: 386633 ************************************************************
2: 74966 ***********
3: 29698 ****
4: 6536 *
5: 1475
6: 154
7: 28
8: 10

Chris Angelico · Sep 5, 2012

comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
1: 477222 ************************************************************
2: 18870 **
...

Not understanding this. What are the statistics, and what (if it's not
obvious from the previous answer) do they prove?

ChrisA

Johannes Bauer · Sep 5, 2012

A reasonable, conservative assumption is to calculate the largest
possible value of the average for random strings. That largest value
occurs when the alphabet is as small as possible, namely two characters.
In practice, strings come from a larger alphabet, up to 1114111 different
characters for full Unicode strings, so the average for them will be less
than the average we calculate now.

So for unequal strings, the number of comparisons is equally likely to be
1, 2, 3, ..., N. The average then is:

That is exactly what I don't agree with. For unequal random strings, the
distribution is *not* equally likely to have the same amount of
comparisons. For demonstration, let's look at bitstrings and the string
"1010". There 15 bitstrings of same length which are unequal and I'll
put the number of comparisons needed next to them going left to right:

0000 1
0001 1
0010 1
0011 1
0100 1
0101 1
0110 1
0111 1
1100 2
1101 2
1110 2
1111 2
1000 3
1001 3
1011 4

i.e. about 1.73 (26/15) comparisons in average with random strings
compared to the 2.5 comparisons you end up with. My formula does respect
lazy termination (because the probabilty of higher comparisons gets
exponentially lower).

sum([1, 2, 3, ..., N])/N

which by a bit of simple algebra works out to be (N+1)/2, or half the
characters as I said.

Yes, but it's a flawed assumption you're making since you're neglecting
lazy evaluation and early abort of comparison.

Best regards,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Johannes Bauer · Sep 5, 2012

That would be for comparing two random areas of memory. Python strings
don't have 256 options per character; and in terms of actual strings,
there's so many possibilities.

The unit of comparison is completely arbitraty and can equally be used
to compare bitstrings if changed accordingly.

The strings that a program is going to
compare for equality are going to use a vastly restricted alphabet;
for a lot of cases, there might be only a few dozen plausible
characters.

This is very true. But if so, I would like to see the assumtion that
you're making (which might very well be plausible) in order to arrive at
a average of N/2 comparisons (which just seems *way* off).

But even so, it's going to scale approximately linearly with the
string length. If they're really random, then yes, there's little
chance that either a 1MB string or a 2MB string will be the same, but
with real data, they might very well have a long common prefix. So
it's still going to be more or less O(n).

I understand what you're saying and surely this is true to some extent
(someone mentioned filenames, which is an excellent example). However in
order to derive a *formula* you need to formularize your assumtions.
With random data, it's definitely not O(n). And even in reality (with a
more limited alphabet) it usually will early abort:

Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
assumes O(1) comparisons! If the comparison operation itself were in
O(n), the total sorting complexity would be O(n^2 log(n)), which is
definitely false. Most comparisons will abort *very* early (after the
first character).

Best regards,
Johannes

--

Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Johannes Bauer · Sep 5, 2012

Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
assumes O(1) comparisons! If the comparison operation itself were in
O(n), the total sorting complexity would be O(n^2 log(n)), which is
definitely false. Most comparisons will abort *very* early (after the
first character).

I've just written a program to demonstrate this (below it's attached). I
used my aspell dictionary, which contains about 100k words. To have
complexity down I went for only words wich occur most often (in this
case 9 characters or ~13k words). Here's the results (note it's
randomized, so they'll always differ a little, but are pretty stable)

Max Cmp: 100
Sum Cmp: 32111
Avg Cmp: 2.393842254361115
Num words : 13414
Min wordlen: 9
Max wordlen: 9
Avg wordlen: 9.0

Going up in wordlength (only showing part now)

Avg Cmp: 2.2374929735806632
Num words : 10674
Avg wordlen: 10.0

Avg Cmp: 2.261727078891258
Num words : 7504
Avg wordlen: 11.0

Avg Cmp: 2.2335647202939977
Num words : 4898
Avg wordlen: 12.0

Avg Cmp: 2.1743341404358354
Num words : 2891
Avg wordlen: 13.0

Avg Cmp: 2.156782549420586
Num words : 1467
Avg wordlen: 14.0

Avg Cmp: 2.112449799196787
Num words : 747
Avg wordlen: 15.0

So, interestingly, in this real-world case(tm), the complexity does not
scale with O(n). It actually goes down (which is probably due to the
limit amount of words of higher length).

In summary, let's please not debate "feelings" or such. Either
measurements (with clear indiciation on what data they're based on and
how the test was conducted) or derivations (with an explanation of the
assumtions). Feelings can be very misleading in cases like this.

Best regards,
Johannes

#!/usr/bin/python3
import random

class Word():
def __init__(self, s):
self._s = s
self._c = 0

def __lt__(self, other):
if len(self) < len(other):
return True
elif len(self) > len(other):
return False

for i in range(len(self)):
self._c += 1
if self._s < other._s:
return True
return False

def __repr__(self):
return "%s<%d>" % (self._s, self._c)

def __len__(self):
return len(self._s)

def getcmp(self):
return self._c

wordlist = [ ]
for line in open("dict", "r"):
word = Word(line[:-1])
if len(word) == 9:
wordlist.append(word)

random.shuffle(wordlist)
wordlist.sort()

comparisons = [ word.getcmp() for word in wordlist ]
print("Max Cmp:", max(comparisons))
print("Sum Cmp:", sum(comparisons))
print("Avg Cmp:", sum(comparisons) / len(comparisons))

wordlengths = [ len(word) for word in wordlist ]
print("Min wordlen:", min(wordlengths))
print("Max wordlen:", max(wordlengths))
print("Avg wordlen:", sum(wordlengths) / len(wordlengths))

--

Zumindest nicht öffentlich!

Click to expand...

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Peter Otten · Sep 5, 2012

tail
1: 386633 ************************************************************
2: 74966 ***********

Not understanding this. What are the statistics,

I measured how many chars they have in common for all combinations of 1000
words taken from /usr/share/dict/words.

477222 pairs have one char in common, 18870 pairs have two chars in common
when compared from the *beginning* of the string.

386633 pairs have one char in common, 74966 pairs have two chars in common
when compared from the *end* of the string.

and what (if it's not obvious from the previous answer) do they prove?

They demonstrate that for two words from that particular corpus it is likely
that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average)
characters into account than a == b, i. e. comparing from the back should be
slower rather than faster.

If that doesn't help, here's the code

import random
import itertools
from contextlib import contextmanager
from collections import defaultdict

@contextmanager
def open_corpus(args):
with open(args.name) as instream:
words = (line.strip() for line in instream)
#words = (word for word in words if not word.endswith("'s"))
yield words

def pick(items, count):
items = list(items)
return random.sample(items, count)

def count_common(a, b):
i = 0
for i, (x, y) in enumerate(zip(a, b), 1):
if x != y:
break
return i

def corpus(args):
with open_corpus(args) as words:
wordlens = (len(line.strip()) for line in words)
freq = defaultdict(int)
for wordlen in wordlens:
freq[wordlen] += 1
show_histogram(freq)

def show_histogram(freq):
peak = max(freq.values())
freq_width = len(str(peak))
max_freq = max(freq)
len_width = len(str(max_freq))
for i in range(min(freq), max_freq+1):
value = freq
print("{:{}}: {:{}} {}".format(
i, len_width, value,
freq_width, "*" * (60 * value // peak)))

def compare(args):
with open_corpus(args) as words:
words = pick(
(word for word in words if len(word) == args.word_len),
args.sample_size)
n = 0
head_common = defaultdict(int)
tail_common = defaultdict(int)
n_tail = n_head = 0
for n, (a, b) in enumerate(itertools.combinations(words, 2), 1):
cc = count_common(a, b)
n_head += cc
head_common[cc] += 1
ccr = count_common(a[::-1], b[::-1])
n_tail += ccr
tail_common[ccr] += 1

print(n, "combinations")
print("average common chars (head) {:.3}".format(n_head/n))
print("average common chars (tail) {:.3}".format(n_tail/n))

print("comparing every pair in a "
"sample of {sample} {len}-char words\n"
"taken from {name!r}".format(
sample=args.sample_size,
len=args.word_len,
name=args.name))
print()
print("head")
show_histogram(head_common)
print()
print("tail")
show_histogram(tail_common)

def main():
import argparse
parser = argparse.ArgumentParser()
parsers = parser.add_subparsers()

pcorpus = parsers.add_parser("corpus")
pcorpus.add_argument("name")
pcorpus.set_defaults(func=corpus)

pcompare = parsers.add_parser("compare")
pcompare.add_argument("name")
pcompare.add_argument("--sample-size", type=int, default=1000)
pcompare.add_argument("--word-len", type=int, default=8)
pcompare.set_defaults(func=compare)
args = parser.parse_args()
args.func(args)

if __name__ == "__main__":
main()

Steven D'Aprano · Sep 5, 2012

This shows some significant confusion about Big Oh complexity analysis
here. When you say that sorting the list of words is O(N log N), the N
refers to the number of words in the dictionary. But when you speak about
comparisons being O(N), the N doesn't refer to the size of the dict, but
the size of the individual strings being compared. There is no reasonable
comparison algorithm where the effort required to compare two strings
depends on the size of the dictionary they have come from.

Since these are *completely different Ns*, you can't combine them to get
O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer
nonsense. If you state it like this:

sorting the list of words is O(N log N) where N = length of the list
comparing the words is O(M) where M = the average length of the words

then the overall complexity is O(N log N)*O(M), but since M is a constant
for any particular dictionary (its just the average length of the words)
that reduces down to O(N log N).

To say that sorting a dictionary is O(N log N) does *not* imply that
string comparisons are O(1). It only implies that string comparisons
don't depend on the size of the dictionary. Comparing:

s1 = 'a'*30002
s2 = 'a'*30001 + 'b'

takes the same amount of time whether they are in a dictionary of 100
words or one of 100000 words. For the purposes of calculating the
algorithmic complexity of sorting a large list of words, we can treat
comparisons as an elementary operation even though they actually aren't.

You are making unjustified assumptions about the distribution of letters
in the words. This might be a list of long chemical compounds where the
words typically differ only in their suffix. It might be a list of people
with titles:

Herr Professor Frederick Schmidt
Herr Professor Frederick Wagner
....

But either way, it doesn't matter for the Big Oh behaviour of sorting the
words.

I've just written a program to demonstrate this (below it's attached).

I have no idea why you think this program demonstrates anything about
string equality comparisons. What you are counting is not the average
number of comparisons for string equality, but the number of comparisons
when sorting. These are *completely different*, because sorting a list
does not require testing each string against every other string.

Whatever you have measured, it has nothing to do with the Big Oh
complexity of string comparisons.

It seems to me that you have misunderstood the purpose of Big Oh
notation. It gives an asymptotic upper bound on the amount of work done,
ignoring all lower terms and multiplicative factors, not an exact formula.

If something is O(N), then this tells you:

1) the number of algorithmic steps is proportional to N, plus or
minus some terms which are less than N (e.g. sqrt(N), or log(N),
or 1/N, or some constant, etc.);

2) if you ignore those other terms, and keep all other influences
equal, then doubling N will *at worst* double the number of
algorithmic steps;

3) if all those algorithmic steps take exactly the same amount of
time, then doubling N will take twice as much time.

But of course *none* of those assumptions hold for measurements of actual
elapsed time. In real life, operations rarely take constant time, you
don't always see worst case behaviour, and the lower terms are not
necessarily insignificant enough to ignore.

[...]

So, interestingly, in this real-world case(tm), the complexity does not
scale with O(n). It actually goes down (which is probably due to the
limit amount of words of higher length).

I would guess that it probably has more to do with the ability of the
timsort algorithm to exploit local order within a list and avoid
performing comparisons.

In summary, let's please not debate "feelings" or such.

I have no idea what feelings you are referring to.

Comparing strings from within strings	3	Oct 21, 2003
strings to/from enumerations mapping	16	Sep 1, 2009
strip() using strings instead of chars	4	Jul 11, 2008
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Programming algorithms with strings in C++, and transition from C by example!	5	Mar 30, 2007
Living with a Swarm of Boids - A report from the front	14	Mar 20, 2010
Elise Mooney reports on Channel 9 about Maths Worldwide and the fraudthat it is	1	Apr 17, 2010
On the development of C	211	Mar 9, 2009

Comparing strings from the back?

Roy Smith

Chris Angelico

Steven D'Aprano

Dan Sommers

Terry Reedy

Mark Lawrence

Mark Lawrence

Alain Ketterlin

Johannes Bauer

Steven D'Aprano

Chris Angelico

MRAB

Roy Smith

Peter Otten

Chris Angelico

Johannes Bauer

Johannes Bauer

Johannes Bauer

Peter Otten

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads