Comparing strings from the back?

Discussion in 'Python' started by Roy Smith, Sep 4, 2012.

  1. Roy Smith

    Roy Smith Guest

    There's been a bunch of threads lately about string implementations, and
    that got me thinking (which is often a dangerous thing).

    Let's assume you're testing two strings for equality. You've already
    done the obvious quick tests (i.e they're the same length), and you're
    down to the O(n) part of comparing every character.

    I'm wondering if it might be faster to start at the ends of the strings
    instead of at the beginning? If the strings are indeed equal, it's the
    same amount of work starting from either end. But, if it turns out that
    for real-life situations, the ends of strings have more entropy than the
    beginnings, the odds are you'll discover that they're unequal quicker by
    starting at the end.

    It doesn't seem un-plausible that this is the case. For example, most
    of the filenames I work with begin with "/home/roy/". Most of the
    strings I use as memcache keys have one of a small number of prefixes.
    Most of the strings I use as logger names have common leading
    substrings. Things like credit card and telephone numbers tend to have
    much more entropy in the trailing digits.

    On the other hand, hostnames (and thus email addresses) exhibit the
    opposite pattern.

    Anyway, it's just a thought. Has anybody studied this for real-life
    usage patterns?

    I'm also not sure how this work with all the possible UCS/UTF encodings.
    With some of them, you may get the encoding semantics wrong if you don't
    start from the front.
     
    Roy Smith, Sep 4, 2012
    #1
    1. Advertising

  2. On Tue, Sep 4, 2012 at 11:54 AM, Roy Smith <> wrote:
    > I'm wondering if it might be faster to start at the ends of the strings
    > instead of at the beginning?


    > I'm also not sure how this work with all the possible UCS/UTF encodings.
    > With some of them, you may get the encoding semantics wrong if you don't
    > start from the front.


    No problem there; Python uses only fixed-width encodings. Also, any
    canonical encoding can be safely compared byte-for-byte; two identical
    Unicode strings will be bit-wise identical in (say) UTF-8.

    There's issues of cache locality and such that quite possibly mean
    it's not going to be faster overall, but it wouldn't be difficult to
    tweak the Python sources, recompile, and run some tests. I'm sure
    python-dev or python-list will be most happy to discuss some benchmark
    figures!

    ChrisA
     
    Chris Angelico, Sep 4, 2012
    #2
    1. Advertising

  3. On Mon, 03 Sep 2012 21:54:01 -0400, Roy Smith wrote:

    > Let's assume you're testing two strings for equality. You've already
    > done the obvious quick tests (i.e they're the same length), and you're
    > down to the O(n) part of comparing every character.
    >
    > I'm wondering if it might be faster to start at the ends of the strings
    > instead of at the beginning? If the strings are indeed equal, it's the
    > same amount of work starting from either end. But, if it turns out that
    > for real-life situations, the ends of strings have more entropy than the
    > beginnings, the odds are you'll discover that they're unequal quicker by
    > starting at the end.


    And if the strings have the same end and different beginnings:

    running
    jumping
    cooking

    then you'll find out quicker by starting at the beginning.

    No general-purpose string comparison operator can make assumptions about
    the content of the strings, like "they are nearly always pathnames on
    Unix-like systems like /home/user/x and /home/user/y".

    I suppose you could have two different operators, == for "test from the
    front" and === for "test from the back", and push that decision onto the
    user, who will then spend hours angsting about the "most efficient"
    equality test and days painstakingly profiling his data to save four
    nanoseconds at runtime.

    Besides, then somebody will say "Yes, but what about the cases where the
    prefix and the suffix are both equal, but the middle will be different?"
    and propose a third string-equality operator ==== and then there will be
    bloodshed.

    On average, string equality needs to check half the characters in the
    string. Any individual comparisons may require fewer, or more, than that,
    but whichever way you scan the string, some comparisons will take more
    and some will take fewer. With no general way of telling ahead of time
    which will be which, on average you can't do better than O(N/2) and no
    complexity justification for picking "start from the end" from "start
    from the start".



    --
    Steven
     
    Steven D'Aprano, Sep 4, 2012
    #3
  4. Roy Smith

    Dan Sommers Guest

    On 2012-09-04 at 02:17:30 +0000,
    Steven D'Aprano <> wrote:

    > Besides, then somebody will say "Yes, but what about the cases where
    > the prefix and the suffix are both equal, but the middle will be
    > different?" and propose a third string-equality operator ==== and
    > then there will be bloodshed.


    Lisp has several equality forms, although they're for different kinds or
    levels of equality rather than for different algorithms for determining
    that equality. I imagine that a discussion similar to this one spawned
    many of those forms. I agree with Steven: decisions such as this
    belong to the application developer, not the standard library. Don't
    forget the fourth string-equality operator for case-insensitive
    comparisons and the fifth string-equality operator for strings that are
    likely to be palindromes.

    That said, if I really wanted bloodshed, I would propose = for the third
    string-equality operator! ;-)

    Dan
     
    Dan Sommers, Sep 4, 2012
    #4
  5. Roy Smith

    Terry Reedy Guest

    On 9/3/2012 9:54 PM, Roy Smith wrote:
    > There's been a bunch of threads lately about string implementations, and
    > that got me thinking (which is often a dangerous thing).
    >
    > Let's assume you're testing two strings for equality. You've already
    > done the obvious quick tests (i.e they're the same length), and you're
    > down to the O(n) part of comparing every character.
    >
    > I'm wondering if it might be faster to start at the ends of the strings
    > instead of at the beginning?


    I posted the 3.3 str (in)equality compare a few days ago. It first
    compares length and then kind, then the actual bytes in whatever chunks.
    If the latter uses the C/system mem compare, that is faster than
    anything coded in Python or even C.

    --
    Terry Jan Reedy
     
    Terry Reedy, Sep 4, 2012
    #5
  6. On 04/09/2012 05:56, Dan Sommers wrote:
    >
    > That said, if I really wanted bloodshed, I would propose = for the third
    > string-equality operator! ;-)
    >
    > Dan
    >


    Dan "agent provocateur" Sommers? :)

    --
    Cheers.

    Mark Lawrence.
     
    Mark Lawrence, Sep 4, 2012
    #6
  7. On 04/09/2012 02:54, Roy Smith wrote:
    > There's been a bunch of threads lately about string implementations, and
    > that got me thinking (which is often a dangerous thing).
    >
    > Let's assume you're testing two strings for equality. You've already
    > done the obvious quick tests (i.e they're the same length), and you're
    > down to the O(n) part of comparing every character.
    >
    > I'm wondering if it might be faster to start at the ends of the strings
    > instead of at the beginning? If the strings are indeed equal, it's the
    > same amount of work starting from either end. But, if it turns out that
    > for real-life situations, the ends of strings have more entropy than the
    > beginnings, the odds are you'll discover that they're unequal quicker by
    > starting at the end.
    >
    > It doesn't seem un-plausible that this is the case. For example, most
    > of the filenames I work with begin with "/home/roy/". Most of the
    > strings I use as memcache keys have one of a small number of prefixes.
    > Most of the strings I use as logger names have common leading
    > substrings. Things like credit card and telephone numbers tend to have
    > much more entropy in the trailing digits.
    >
    > On the other hand, hostnames (and thus email addresses) exhibit the
    > opposite pattern.
    >
    > Anyway, it's just a thought. Has anybody studied this for real-life
    > usage patterns?
    >
    > I'm also not sure how this work with all the possible UCS/UTF encodings.
    > With some of them, you may get the encoding semantics wrong if you don't
    > start from the front.
    >


    Good grief, what timing!!! The dust has yet to settle over the
    "Flexible string representation, unicode, typography, ..." thread and
    you ask this. I'll just cross my fingers and hope that people stick
    with facts, realistic benchmarks etc, and not good old fashioned FUD.

    --
    Cheers.

    Mark Lawrence.
     
    Mark Lawrence, Sep 4, 2012
    #7
  8. Roy Smith <> writes:

    > There's been a bunch of threads lately about string implementations, and
    > that got me thinking (which is often a dangerous thing).
    >
    > Let's assume you're testing two strings for equality. You've already
    > done the obvious quick tests (i.e they're the same length), and you're
    > down to the O(n) part of comparing every character.
    >
    > I'm wondering if it might be faster to start at the ends of the strings
    > instead of at the beginning? If the strings are indeed equal, it's the
    > same amount of work starting from either end. But, if it turns out that
    > for real-life situations, the ends of strings have more entropy than the
    > beginnings, the odds are you'll discover that they're unequal quicker by
    > starting at the end.


    I guess we all have examples with specific entropy distribution. Going
    backwards on long strings can be profitable with file paths. If that is
    the case, and if you spend a lot of time comparing strings, why not
    store them in reverse?

    > It doesn't seem un-plausible that this is the case. For example, most
    > of the filenames I work with begin with "/home/roy/". Most of the
    > strings I use as memcache keys have one of a small number of prefixes.
    > Most of the strings I use as logger names have common leading
    > substrings.


    In that case I would split the paths, and use some sort of string-id, or
    use intern() and then compare components with "is" instead of "==".
    (Basically, turning the string of chars into a strings of
    ids/ints/pointers.)

    > Things like credit card and telephone numbers tend to have much more
    > entropy in the trailing digits. On the other hand, hostnames (and thus
    > email addresses) exhibit the opposite pattern.


    Yes, real-world is a mess.

    > Anyway, it's just a thought. Has anybody studied this for real-life
    > usage patterns?


    I've tried the "intern()" trick several times with lists of strings, and
    was satisfied, but this was in specific cases (with a finite set of
    strings). For "short" strings of chars (up to, say a hundred of
    characters), I've never found anything significantly faster than the
    default sweep (that was in C/C++, not python). For longer and/or more
    structured strings/lists, making use of the structure is a good idea,
    but I see no general pattern here (as you note).

    > I'm also not sure how this work with all the possible UCS/UTF encodings.
    > With some of them, you may get the encoding semantics wrong if you don't
    > start from the front.


    If you're testing for equality, I can't see how this could matter, even
    with variable-length encodings.

    If you're comparing different encodings, then you need different access
    methods to random characters (but this doesn't affect the algorithm). If
    you're using variable-length encoding, e.g., UTF-8, accessing at a
    specific position is not possible.

    -- Alain.
     
    Alain Ketterlin, Sep 4, 2012
    #8
  9. On 04.09.2012 04:17, Steven D'Aprano wrote:

    > On average, string equality needs to check half the characters in the
    > string.


    How do you arrive at that conclusion? When comparing two random strings,
    I just derived

    n = (256 / 255) * (1 - 256 ^ (-c))

    where n is the average number of character comparisons and c. The
    rationale as follows: The first character has to be compared in any
    case. The second with a probability of 1/256, the third with 1/(256^2)
    and so on.

    Best regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Sep 4, 2012
    #9
  10. On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote:

    > On 04.09.2012 04:17, Steven D'Aprano wrote:
    >
    >> On average, string equality needs to check half the characters in the
    >> string.

    >
    > How do you arrive at that conclusion?


    Take two non-empty strings of the same length, N. If the strings are
    equal, you have to make N comparisons to be sure they are equal (you
    check all N pairs of characters). If they are unequal, you have to check
    each pair of characters up to the first pair that are different:

    def string_equality(astr, bstr):
    # Assumes lengths are equal.
    for achr, bchr in zip(astr, bstr):
    if achr != bchr:
    return False
    return True

    For the unequal case, how many comparisons do you do? Ahead of time, we
    know very little about the strings. We don't even know how many possible
    characters there are (are they English alphanumeric, ASCII, Greek, Thai,
    or from the full range of 1114111 Unicode code points?) or what their
    probability distribution is.

    A reasonable, conservative assumption is to calculate the largest
    possible value of the average for random strings. That largest value
    occurs when the alphabet is as small as possible, namely two characters.
    In practice, strings come from a larger alphabet, up to 1114111 different
    characters for full Unicode strings, so the average for them will be less
    than the average we calculate now.

    So for unequal strings, the number of comparisons is equally likely to be
    1, 2, 3, ..., N. The average then is:

    sum([1, 2, 3, ..., N])/N

    which by a bit of simple algebra works out to be (N+1)/2, or half the
    characters as I said.

    (Note that this average assumes the strings are completely random. In
    practice, strings tend to come from strongly non-uniform distributions of
    characters. For instance, in English texts, 'e' is much more likely than
    'q' and most characters are vanishingly rare, and in practice, strings
    are likely to be highly non-random.)



    --
    Steven
     
    Steven D'Aprano, Sep 4, 2012
    #10
  11. On Wed, Sep 5, 2012 at 2:32 AM, Johannes Bauer <> wrote:
    > How do you arrive at that conclusion? When comparing two random strings,
    > I just derived
    >
    > n = (256 / 255) * (1 - 256 ^ (-c))
    >
    > where n is the average number of character comparisons and c. The
    > rationale as follows: The first character has to be compared in any
    > case. The second with a probability of 1/256, the third with 1/(256^2)
    > and so on.


    That would be for comparing two random areas of memory. Python strings
    don't have 256 options per character; and in terms of actual strings,
    there's so many possibilities. The strings that a program is going to
    compare for equality are going to use a vastly restricted alphabet;
    for a lot of cases, there might be only a few dozen plausible
    characters.

    But even so, it's going to scale approximately linearly with the
    string length. If they're really random, then yes, there's little
    chance that either a 1MB string or a 2MB string will be the same, but
    with real data, they might very well have a long common prefix. So
    it's still going to be more or less O(n).

    ChrisA
     
    Chris Angelico, Sep 4, 2012
    #11
  12. Roy Smith

    MRAB Guest

    On 05/09/2012 03:18, Neil Hodgson wrote:
    > Roy Smith:
    >
    >> I'm wondering if it might be faster to start at the ends of the strings
    >> instead of at the beginning? If the strings are indeed equal, it's the
    >> same amount of work starting from either end.

    >
    > Most people write loops that go forwards. This leads to the
    > processor designers prioritizing detection mechanisms like cache
    > prefetching for that case.
    >
    > However, its not always the situation: a couple of years ago Intel
    > contributed a memcpy implementation to glibc that went backwards for a
    > performance improvement. An explanation from a related patch involves
    > speculative and committed operations and address bits on some processors
    > and quickly lost me:
    > paragraph 3 of
    > http://lists.freedesktop.org/archives/pixman/2010-August/000423.html
    >
    > The memcpy patch was controversial as it broke Adobe Flash which
    > assumed memcpy was safe like memmove.
    >

    I don't think I've ever use memcpy, only memmove.
     
    MRAB, Sep 5, 2012
    #12
  13. Roy Smith

    Roy Smith Guest

    In article <>,
    Neil Hodgson <> wrote:

    > The memcpy patch was controversial as it broke Adobe Flash


    An added benefit!
     
    Roy Smith, Sep 5, 2012
    #13
  14. Roy Smith

    Peter Otten Guest

    Roy Smith wrote:

    > There's been a bunch of threads lately about string implementations, and
    > that got me thinking (which is often a dangerous thing).
    >
    > Let's assume you're testing two strings for equality. You've already
    > done the obvious quick tests (i.e they're the same length), and you're
    > down to the O(n) part of comparing every character.
    >
    > I'm wondering if it might be faster to start at the ends of the strings
    > instead of at the beginning? If the strings are indeed equal, it's the
    > same amount of work starting from either end. But, if it turns out that
    > for real-life situations, the ends of strings have more entropy than the
    > beginnings, the odds are you'll discover that they're unequal quicker by
    > starting at the end.
    >
    > It doesn't seem un-plausible that this is the case. For example, most
    > of the filenames I work with begin with "/home/roy/". Most of the
    > strings I use as memcache keys have one of a small number of prefixes.
    > Most of the strings I use as logger names have common leading
    > substrings. Things like credit card and telephone numbers tend to have
    > much more entropy in the trailing digits.


    But do you actually compare them with each other? Even if you do I'd guess
    that Python looks at the hash values most of the time.

    > On the other hand, hostnames (and thus email addresses) exhibit the
    > opposite pattern.


    Nor do english words:

    $ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8

    comparing every pair in a sample of 1000 8-char words
    taken from '/usr/share/dict/words'

    head
    1: 477222 ************************************************************
    2: 18870 **
    3: 2870
    4: 435
    5: 74
    6: 17
    7: 12

    tail
    1: 386633 ************************************************************
    2: 74966 ***********
    3: 29698 ****
    4: 6536 *
    5: 1475
    6: 154
    7: 28
    8: 10

    > Anyway, it's just a thought. Has anybody studied this for real-life
    > usage patterns?
    >
    > I'm also not sure how this work with all the possible UCS/UTF encodings.
    > With some of them, you may get the encoding semantics wrong if you don't
    > start from the front.
     
    Peter Otten, Sep 5, 2012
    #14
  15. On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <> wrote:
    > comparing every pair in a sample of 1000 8-char words
    > taken from '/usr/share/dict/words'
    >
    > head
    > 1: 477222 ************************************************************
    > 2: 18870 **
    > ...


    Not understanding this. What are the statistics, and what (if it's not
    obvious from the previous answer) do they prove?

    ChrisA
     
    Chris Angelico, Sep 5, 2012
    #15
  16. On 04.09.2012 20:07, Steven D'Aprano wrote:
    > A reasonable, conservative assumption is to calculate the largest
    > possible value of the average for random strings. That largest value
    > occurs when the alphabet is as small as possible, namely two characters.
    > In practice, strings come from a larger alphabet, up to 1114111 different
    > characters for full Unicode strings, so the average for them will be less
    > than the average we calculate now.
    >
    > So for unequal strings, the number of comparisons is equally likely to be
    > 1, 2, 3, ..., N. The average then is:


    That is exactly what I don't agree with. For unequal random strings, the
    distribution is *not* equally likely to have the same amount of
    comparisons. For demonstration, let's look at bitstrings and the string
    "1010". There 15 bitstrings of same length which are unequal and I'll
    put the number of comparisons needed next to them going left to right:

    0000 1
    0001 1
    0010 1
    0011 1
    0100 1
    0101 1
    0110 1
    0111 1
    1100 2
    1101 2
    1110 2
    1111 2
    1000 3
    1001 3
    1011 4

    i.e. about 1.73 (26/15) comparisons in average with random strings
    compared to the 2.5 comparisons you end up with. My formula does respect
    lazy termination (because the probabilty of higher comparisons gets
    exponentially lower).

    > sum([1, 2, 3, ..., N])/N
    >
    > which by a bit of simple algebra works out to be (N+1)/2, or half the
    > characters as I said.


    Yes, but it's a flawed assumption you're making since you're neglecting
    lazy evaluation and early abort of comparison.

    Best regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Sep 5, 2012
    #16
  17. On 04.09.2012 23:59, Chris Angelico wrote:

    >> n = (256 / 255) * (1 - 256 ^ (-c))
    >>
    >> where n is the average number of character comparisons and c. The
    >> rationale as follows: The first character has to be compared in any
    >> case. The second with a probability of 1/256, the third with 1/(256^2)
    >> and so on.

    >
    > That would be for comparing two random areas of memory. Python strings
    > don't have 256 options per character; and in terms of actual strings,
    > there's so many possibilities.


    The unit of comparison is completely arbitraty and can equally be used
    to compare bitstrings if changed accordingly.

    > The strings that a program is going to
    > compare for equality are going to use a vastly restricted alphabet;
    > for a lot of cases, there might be only a few dozen plausible
    > characters.


    This is very true. But if so, I would like to see the assumtion that
    you're making (which might very well be plausible) in order to arrive at
    a average of N/2 comparisons (which just seems *way* off).

    > But even so, it's going to scale approximately linearly with the
    > string length. If they're really random, then yes, there's little
    > chance that either a 1MB string or a 2MB string will be the same, but
    > with real data, they might very well have a long common prefix. So
    > it's still going to be more or less O(n).


    I understand what you're saying and surely this is true to some extent
    (someone mentioned filenames, which is an excellent example). However in
    order to derive a *formula* you need to formularize your assumtions.
    With random data, it's definitely not O(n). And even in reality (with a
    more limited alphabet) it usually will early abort:

    Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
    assumes O(1) comparisons! If the comparison operation itself were in
    O(n), the total sorting complexity would be O(n^2 log(n)), which is
    definitely false. Most comparisons will abort *very* early (after the
    first character).

    Best regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Sep 5, 2012
    #17
  18. On 05.09.2012 11:24, Johannes Bauer wrote:

    > Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
    > assumes O(1) comparisons! If the comparison operation itself were in
    > O(n), the total sorting complexity would be O(n^2 log(n)), which is
    > definitely false. Most comparisons will abort *very* early (after the
    > first character).


    I've just written a program to demonstrate this (below it's attached). I
    used my aspell dictionary, which contains about 100k words. To have
    complexity down I went for only words wich occur most often (in this
    case 9 characters or ~13k words). Here's the results (note it's
    randomized, so they'll always differ a little, but are pretty stable)

    Max Cmp: 100
    Sum Cmp: 32111
    Avg Cmp: 2.393842254361115
    Num words : 13414
    Min wordlen: 9
    Max wordlen: 9
    Avg wordlen: 9.0

    Going up in wordlength (only showing part now)

    Avg Cmp: 2.2374929735806632
    Num words : 10674
    Avg wordlen: 10.0

    Avg Cmp: 2.261727078891258
    Num words : 7504
    Avg wordlen: 11.0

    Avg Cmp: 2.2335647202939977
    Num words : 4898
    Avg wordlen: 12.0

    Avg Cmp: 2.1743341404358354
    Num words : 2891
    Avg wordlen: 13.0

    Avg Cmp: 2.156782549420586
    Num words : 1467
    Avg wordlen: 14.0

    Avg Cmp: 2.112449799196787
    Num words : 747
    Avg wordlen: 15.0

    So, interestingly, in this real-world case(tm), the complexity does not
    scale with O(n). It actually goes down (which is probably due to the
    limit amount of words of higher length).

    In summary, let's please not debate "feelings" or such. Either
    measurements (with clear indiciation on what data they're based on and
    how the test was conducted) or derivations (with an explanation of the
    assumtions). Feelings can be very misleading in cases like this.

    Best regards,
    Johannes



    #!/usr/bin/python3
    import random

    class Word():
    def __init__(self, s):
    self._s = s
    self._c = 0

    def __lt__(self, other):
    if len(self) < len(other):
    return True
    elif len(self) > len(other):
    return False

    for i in range(len(self)):
    self._c += 1
    if self._s < other._s:
    return True
    return False

    def __repr__(self):
    return "%s<%d>" % (self._s, self._c)

    def __len__(self):
    return len(self._s)

    def getcmp(self):
    return self._c

    wordlist = [ ]
    for line in open("dict", "r"):
    word = Word(line[:-1])
    if len(word) == 9:
    wordlist.append(word)

    random.shuffle(wordlist)
    wordlist.sort()

    comparisons = [ word.getcmp() for word in wordlist ]
    print("Max Cmp:", max(comparisons))
    print("Sum Cmp:", sum(comparisons))
    print("Avg Cmp:", sum(comparisons) / len(comparisons))

    wordlengths = [ len(word) for word in wordlist ]
    print("Min wordlen:", min(wordlengths))
    print("Max wordlen:", max(wordlengths))
    print("Avg wordlen:", sum(wordlengths) / len(wordlengths))


    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Sep 5, 2012
    #18
  19. Roy Smith

    Peter Otten Guest

    Chris Angelico wrote:

    > On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <> wrote:
    >> comparing every pair in a sample of 1000 8-char words
    >> taken from '/usr/share/dict/words'
    >>
    >> head
    >> 1: 477222 ************************************************************
    >> 2: 18870 **
    >> ...


    tail
    1: 386633 ************************************************************
    2: 74966 ***********


    > Not understanding this. What are the statistics,


    I measured how many chars they have in common for all combinations of 1000
    words taken from /usr/share/dict/words.

    477222 pairs have one char in common, 18870 pairs have two chars in common
    when compared from the *beginning* of the string.

    386633 pairs have one char in common, 74966 pairs have two chars in common
    when compared from the *end* of the string.

    and what (if it's not obvious from the previous answer) do they prove?

    They demonstrate that for two words from that particular corpus it is likely
    that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average)
    characters into account than a == b, i. e. comparing from the back should be
    slower rather than faster.

    If that doesn't help, here's the code ;)

    import random
    import itertools
    from contextlib import contextmanager
    from collections import defaultdict

    @contextmanager
    def open_corpus(args):
    with open(args.name) as instream:
    words = (line.strip() for line in instream)
    #words = (word for word in words if not word.endswith("'s"))
    yield words

    def pick(items, count):
    items = list(items)
    return random.sample(items, count)

    def count_common(a, b):
    i = 0
    for i, (x, y) in enumerate(zip(a, b), 1):
    if x != y:
    break
    return i

    def corpus(args):
    with open_corpus(args) as words:
    wordlens = (len(line.strip()) for line in words)
    freq = defaultdict(int)
    for wordlen in wordlens:
    freq[wordlen] += 1
    show_histogram(freq)

    def show_histogram(freq):
    peak = max(freq.values())
    freq_width = len(str(peak))
    max_freq = max(freq)
    len_width = len(str(max_freq))
    for i in range(min(freq), max_freq+1):
    value = freq
    print("{:{}}: {:{}} {}".format(
    i, len_width, value,
    freq_width, "*" * (60 * value // peak)))

    def compare(args):
    with open_corpus(args) as words:
    words = pick(
    (word for word in words if len(word) == args.word_len),
    args.sample_size)
    n = 0
    head_common = defaultdict(int)
    tail_common = defaultdict(int)
    n_tail = n_head = 0
    for n, (a, b) in enumerate(itertools.combinations(words, 2), 1):
    cc = count_common(a, b)
    n_head += cc
    head_common[cc] += 1
    ccr = count_common(a[::-1], b[::-1])
    n_tail += ccr
    tail_common[ccr] += 1

    print(n, "combinations")
    print("average common chars (head) {:.3}".format(n_head/n))
    print("average common chars (tail) {:.3}".format(n_tail/n))

    print("comparing every pair in a "
    "sample of {sample} {len}-char words\n"
    "taken from {name!r}".format(
    sample=args.sample_size,
    len=args.word_len,
    name=args.name))
    print()
    print("head")
    show_histogram(head_common)
    print()
    print("tail")
    show_histogram(tail_common)

    def main():
    import argparse
    parser = argparse.ArgumentParser()
    parsers = parser.add_subparsers()

    pcorpus = parsers.add_parser("corpus")
    pcorpus.add_argument("name")
    pcorpus.set_defaults(func=corpus)

    pcompare = parsers.add_parser("compare")
    pcompare.add_argument("name")
    pcompare.add_argument("--sample-size", type=int, default=1000)
    pcompare.add_argument("--word-len", type=int, default=8)
    pcompare.set_defaults(func=compare)
    args = parser.parse_args()
    args.func(args)

    if __name__ == "__main__":
    main()
     
    Peter Otten, Sep 5, 2012
    #19
  20. On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote:

    > On 05.09.2012 11:24, Johannes Bauer wrote:
    >
    >> Consider sorting a large dictionary. Sorting is in O(n log(n)), but
    >> this assumes O(1) comparisons! If the comparison operation itself were
    >> in O(n), the total sorting complexity would be O(n^2 log(n)),


    This shows some significant confusion about Big Oh complexity analysis
    here. When you say that sorting the list of words is O(N log N), the N
    refers to the number of words in the dictionary. But when you speak about
    comparisons being O(N), the N doesn't refer to the size of the dict, but
    the size of the individual strings being compared. There is no reasonable
    comparison algorithm where the effort required to compare two strings
    depends on the size of the dictionary they have come from.

    Since these are *completely different Ns*, you can't combine them to get
    O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer
    nonsense. If you state it like this:

    sorting the list of words is O(N log N) where N = length of the list
    comparing the words is O(M) where M = the average length of the words

    then the overall complexity is O(N log N)*O(M), but since M is a constant
    for any particular dictionary (its just the average length of the words)
    that reduces down to O(N log N).

    To say that sorting a dictionary is O(N log N) does *not* imply that
    string comparisons are O(1). It only implies that string comparisons
    don't depend on the size of the dictionary. Comparing:

    s1 = 'a'*30002
    s2 = 'a'*30001 + 'b'

    takes the same amount of time whether they are in a dictionary of 100
    words or one of 100000 words. For the purposes of calculating the
    algorithmic complexity of sorting a large list of words, we can treat
    comparisons as an elementary operation even though they actually aren't.


    >> which is
    >> definitely false. Most comparisons will abort *very* early (after the
    >> first character).


    You are making unjustified assumptions about the distribution of letters
    in the words. This might be a list of long chemical compounds where the
    words typically differ only in their suffix. It might be a list of people
    with titles:

    Herr Professor Frederick Schmidt
    Herr Professor Frederick Wagner
    ....

    But either way, it doesn't matter for the Big Oh behaviour of sorting the
    words.


    > I've just written a program to demonstrate this (below it's attached).


    I have no idea why you think this program demonstrates anything about
    string equality comparisons. What you are counting is not the average
    number of comparisons for string equality, but the number of comparisons
    when sorting. These are *completely different*, because sorting a list
    does not require testing each string against every other string.

    Whatever you have measured, it has nothing to do with the Big Oh
    complexity of string comparisons.

    It seems to me that you have misunderstood the purpose of Big Oh
    notation. It gives an asymptotic upper bound on the amount of work done,
    ignoring all lower terms and multiplicative factors, not an exact formula.

    If something is O(N), then this tells you:

    1) the number of algorithmic steps is proportional to N, plus or
    minus some terms which are less than N (e.g. sqrt(N), or log(N),
    or 1/N, or some constant, etc.);

    2) if you ignore those other terms, and keep all other influences
    equal, then doubling N will *at worst* double the number of
    algorithmic steps;

    3) if all those algorithmic steps take exactly the same amount of
    time, then doubling N will take twice as much time.

    But of course *none* of those assumptions hold for measurements of actual
    elapsed time. In real life, operations rarely take constant time, you
    don't always see worst case behaviour, and the lower terms are not
    necessarily insignificant enough to ignore.


    [...]
    > So, interestingly, in this real-world case(tm), the complexity does not
    > scale with O(n). It actually goes down (which is probably due to the
    > limit amount of words of higher length).


    I would guess that it probably has more to do with the ability of the
    timsort algorithm to exploit local order within a list and avoid
    performing comparisons.


    > In summary, let's please not debate "feelings" or such.


    I have no idea what feelings you are referring to.



    --
    Steven
     
    Steven D'Aprano, Sep 5, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. kaeli
    Replies:
    8
    Views:
    613
    Chris Smith
    Nov 18, 2004
  2. HS1

    Comparing two strings

    HS1, Nov 28, 2004, in forum: Java
    Replies:
    3
    Views:
    469
    Rob van der Leek
    Nov 29, 2004
  3. manzur

    comparing strings

    manzur, Mar 7, 2006, in forum: Java
    Replies:
    8
    Views:
    400
    Roedy Green
    Mar 7, 2006
  4. Rick

    Comparing strings from within strings

    Rick, Oct 21, 2003, in forum: C Programming
    Replies:
    3
    Views:
    394
    Irrwahn Grausewitz
    Oct 21, 2003
  5. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    810
    Malcolm
    Jun 24, 2006
Loading...

Share This Page