Py 3.3, unicode / upper()

Discussion in 'Python' started by wxjmfauth@gmail.com, Dec 19, 2012.

  1. Guest

    I was using the German word "Straße" (Strasse) — German
    translation from "street" — to illustrate the catastrophic and
    completely wrong-by-design Unicode handling in Py3.3, this
    time from a memory point of view (not speed):

    >>> sys.getsizeof('Straße')

    43
    >>> sys.getsizeof('STRAẞE')

    50

    instead of a sane (Py3.2)

    >>> sys.getsizeof('Straße')

    42
    >>> sys.getsizeof('STRAẞE')

    42


    But, this is not the problem.
    I was suprised to discover this:

    >>> 'Straße'.upper()

    'STRASSE'

    I really, really do not know what I should think about that.
    (It is a complex subject.) And the real question is why?

    jmf
    , Dec 19, 2012
    #1
    1. Advertising

  2. Thomas Bach Guest

    On Wed, Dec 19, 2012 at 06:23:00AM -0800, wrote:
    > I was suprised to discover this:
    >
    > >>> 'Straße'.upper()

    > 'STRASSE'
    >
    > I really, really do not know what I should think about that.
    > (It is a complex subject.) And the real question is why?


    Because there is no definition for upper-case 'ß'. 'SS' is used as the
    common replacement in this case. I think it's pretty smart! :)

    Regards,
    Thomas.
    Thomas Bach, Dec 19, 2012
    #2
    1. Advertising

  3. Stefan Krah Guest

    <> wrote:
    > But, this is not the problem.
    > I was suprised to discover this:
    >
    > >>> 'Straße'.upper()

    > 'STRASSE'
    >
    > I really, really do not know what I should think about that.
    > (It is a complex subject.) And the real question is why?


    http://de.wikipedia.org/wiki/Großes_ß#Versalsatz_ohne_gro.C3.9Fes_.C3.9F

    "Die gegenwärtigen amtlichen Regeln[6] zur neuen deutschen Rechtschreibung
    kennen keinen Großbuchstaben zum ß: Jeder Buchstabe existiert als
    Kleinbuchstabe und als Großbuchstabe (Ausnahme ß). Im Versalsatz empfehlen
    die Regeln, das ß durch SS zu ersetzen: Bei Schreibung mit Großbuchstaben
    schreibt man SS, zum Beispiel: Straße -- STRASSE."


    According to the new official spelling rules the uppercase ß does not exist.
    The recommendation is to use "SS" when writing in all-caps.


    As to why: It has always been acceptable to replace ß with "ss" when ß
    wasn't part of a character set. In the new spelling rules, ß has been
    officially replaced with "ss" in some cases:

    http://en.wiktionary.org/wiki/daß


    The uppercase ß isn't really needed, since ß does not occur at the beginning
    of a word. As far as I know, most Germans wouldn't even know that it has
    existed at some point or how to write it.



    Stefan Krah
    Stefan Krah, Dec 19, 2012
    #3
  4. On Thu, Dec 20, 2012 at 1:23 AM, <> wrote:
    > But, this is not the problem.
    > I was suprised to discover this:
    >
    >>>> 'Straße'.upper()

    > 'STRASSE'
    >
    > I really, really do not know what I should think about that.
    > (It is a complex subject.) And the real question is why?


    Not all strings can be uppercased and lowercased cleanly. Please stop
    trotting out the old Box Hill-to-Camberwell arguments[1] yet again.

    For comparison, try this string:

    'ð‡ðžð¥ð¥ð¨, ð°ð¨ð«ð¥ð!'.upper()

    And while you're at it, check out sys.getsizeof() on that sort of
    string, compare your beloved 3.2 on that. Oh, and also check out len()
    on it.

    [1] Melbourne's current ticketing system is based on zones, and
    Camberwell is in zone 1, and Box Hill in zone 2. Detractors of public
    transport point out that it costs far more to take the train from Box
    Hill to Camberwell than it does to drive a car the same distance. It's
    the same contrived example that keeps on getting trotted out time and
    time again.

    ChrisA
    Chris Angelico, Dec 19, 2012
    #4
  5. On 19.12.2012 15:23, wrote:
    > I was using the German word "Straße" (Strasse) — German
    > translation from "street" — to illustrate the catastrophic and
    > completely wrong-by-design Unicode handling in Py3.3, this
    > time from a memory point of view (not speed):
    >
    >>>> sys.getsizeof('Straße')

    > 43
    >>>> sys.getsizeof('STRAẞE')

    > 50
    >
    > instead of a sane (Py3.2)
    >
    >>>> sys.getsizeof('Straße')

    > 42
    >>>> sys.getsizeof('STRAẞE')

    > 42


    How do those arbitrary numbers prove anything at all? Why do you draw
    the conclusion that it's broken by design? What do you expect? You're
    very vague here. Just to show how ridiculously pointless your numers
    are, your example gives 84 on Python3.2 for any input of yours.

    > But, this is not the problem.
    > I was suprised to discover this:
    >
    >>>> 'Straße'.upper()

    > 'STRASSE'
    >
    > I really, really do not know what I should think about that.
    > (It is a complex subject.) And the real question is why?


    Because in the German language the uppercase "ß" is virtually dead.

    Regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
    Johannes Bauer, Dec 19, 2012
    #5
  6. On 19.12.2012 16:18, Johannes Bauer wrote:

    > How do those arbitrary numbers prove anything at all? Why do you draw
    > the conclusion that it's broken by design? What do you expect? You're
    > very vague here. Just to show how ridiculously pointless your numers
    > are, your example gives 84 on Python3.2 for any input of yours.


    ....on Python3.2 on MY system is what I meant to say (x86_64 Linux). Sorry.

    Also, further reading:

    http://de.wikipedia.org/wiki/Großes_ß
    http://en.wikipedia.org/wiki/Capital_ẞ

    Regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
    Johannes Bauer, Dec 19, 2012
    #6
  7. On Thu, Dec 20, 2012 at 2:18 AM, Johannes Bauer <> wrote:
    > On 19.12.2012 15:23, wrote:
    >> I was using the German word "Straße" (Strasse) — German
    >> translation from "street" — to illustrate the catastrophic and
    >> completely wrong-by-design Unicode handling in Py3.3, this
    >> time from a memory point of view (not speed):
    >>
    >>>>> sys.getsizeof('Straße')

    >> 43
    >>>>> sys.getsizeof('STRAẞE')

    >> 50
    >>
    >> instead of a sane (Py3.2)
    >>
    >>>>> sys.getsizeof('Straße')

    >> 42
    >>>>> sys.getsizeof('STRAẞE')

    >> 42

    >
    > How do those arbitrary numbers prove anything at all? Why do you draw
    > the conclusion that it's broken by design? What do you expect? You're
    > very vague here. Just to show how ridiculously pointless your numers
    > are, your example gives 84 on Python3.2 for any input of yours.


    You may not be familiar with jmf. He's one of our resident trolls, and
    he has a bee in his bonnet about PEP 393 strings, on the basis that
    they take up more space in memory than a narrow build of Python 3.2
    would, for a string with lots of BMP characters and one non-BMP. In
    3.2 narrow builds, strings were stored in UTF-16, with *surrogate
    pairs* for non-BMP characters. This means that len() counts them
    twice, as does string indexing/slicing. That's a major bug, especially
    as your Python code will do different things on different platforms -
    most Linux builds of 3.2 are "wide" builds, storing characters in four
    bytes each.

    PEP 393 brings wide build semantics to all Pythons, while achieving
    memory savings better than a narrow build can (with PEP 393 strings,
    any all-ASCII or all-Latin-1 strings will be stored one byte per
    character). Every now and then, though, jmf points out *yet again*
    that his beloved and buggy narrow build consumes less memory and runs
    faster than the oh so terrible 3.3 on some contrived example. It gets
    rather tiresome.

    Interestingly, IDLE on my Windows box can't handle the bolded
    characters very well...

    >>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
    >>> print(s)

    Traceback (most recent call last):
    File "<pyshell#2>", line 1, in <module>
    print(s)
    UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
    in position 0: Non-BMP character not supported in Tk

    I think this is most likely a case of "yeah, Windows XP just sucks".
    But I have no reason or inclination to get myself a newer Windows to
    find out if it's any different.

    ChrisA
    Chris Angelico, Dec 19, 2012
    #7
  8. Ian Kelly Guest

    On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <> wrote:
    > You may not be familiar with jmf. He's one of our resident trolls, and
    > he has a bee in his bonnet about PEP 393 strings, on the basis that
    > they take up more space in memory than a narrow build of Python 3.2
    > would, for a string with lots of BMP characters and one non-BMP. In
    > 3.2 narrow builds, strings were stored in UTF-16, with *surrogate
    > pairs* for non-BMP characters. This means that len() counts them
    > twice, as does string indexing/slicing. That's a major bug, especially
    > as your Python code will do different things on different platforms -
    > most Linux builds of 3.2 are "wide" builds, storing characters in four
    > bytes each.


    >From what I've been able to discern, his actual complaint about PEP

    393 stems from misguided moral concerns. With PEP-393, strings that
    can be fully represented in Latin-1 can be stored in half the space
    (ignoring fixed overhead) compared to strings containing at least one
    non-Latin-1 character. jmf thinks this optimization is unfair to
    non-English users and immoral; he wants Latin-1 strings to be treated
    exactly like non-Latin-1 strings (I don't think he actually cares
    about non-BMP strings at all; if narrow-build Unicode is good enough
    for him, then it must be good enough for everybody). Unfortunately
    for him, the Latin-1 optimization is rather trivial in the wider
    context of PEP-393, and simply removing that part alone clearly
    wouldn't be doing anybody any favors. So for him to get what he
    wants, the entire PEP has to go.

    It's rather like trying to solve the problem of wealth disparity by
    forcing everyone to dump their excess wealth into the ocean.
    Ian Kelly, Dec 19, 2012
    #8
  9. <wxjmfauth <at> gmail.com> writes:
    > I really, really do not know what I should think about that.
    > (It is a complex subject.) And the real question is why?


    Because that's what the Unicode spec says to do.
    Benjamin Peterson, Dec 19, 2012
    #9
  10. Guest

    Le mercredi 19 décembre 2012 15:52:23 UTC+1, Christian Heimes a écrit :
    > Am 19.12.2012 15:23, schrieb :
    >
    > > But, this is not the problem.

    >
    > > I was suprised to discover this:

    >
    > >

    >
    > >>>> 'Straße'.upper()

    >
    > > 'STRASSE'

    >
    > >

    >
    > > I really, really do not know what I should think about that.

    >
    > > (It is a complex subject.) And the real question is why?

    >
    >
    >
    > It's correct. LATIN SMALL LETTER SHARP S doesn't have an upper case
    >
    > form. However the unicode database specifies an upper case mapping from
    >
    > ß to SS. http://codepoints.net/U 00DF
    >
    >
    >
    > Christian


    -----

    Yes, it is correct (or can be considered as correct).
    I do not wish to discuss the typographical problematic
    of "Das Grosse Eszett". The web is full of pages on the
    subject. However, I never succeeded to find an "official
    position" from Unicode. The best information I found seem
    to indicate (to converge), U+1E9E is now the "supported"
    uppercase form of U+00DF. (see DIN).

    What is bothering me, is more the implementation. The Unicode
    documentation says roughly this: if something can not be
    honoured, there is no harm, but do not implement a workaroud.
    In that case, I'm not sure Python is doing the best.

    If "wrong", this can be considered as programmatically correct
    or logically acceptable (Py3.2)

    >>> 'Straße'.upper().lower().capitalize() == 'Straße'

    True

    while this will *always* be problematic (Py3.3)

    >>> 'Straße'.upper().lower().capitalize() == 'Straße'

    False

    jmf
    , Dec 19, 2012
    #10
  11. Guest

    Le mercredi 19 décembre 2012 15:52:23 UTC+1, Christian Heimes a écrit :
    > Am 19.12.2012 15:23, schrieb :
    >
    > > But, this is not the problem.

    >
    > > I was suprised to discover this:

    >
    > >

    >
    > >>>> 'Straße'.upper()

    >
    > > 'STRASSE'

    >
    > >

    >
    > > I really, really do not know what I should think about that.

    >
    > > (It is a complex subject.) And the real question is why?

    >
    >
    >
    > It's correct. LATIN SMALL LETTER SHARP S doesn't have an upper case
    >
    > form. However the unicode database specifies an upper case mapping from
    >
    > ß to SS. http://codepoints.net/U 00DF
    >
    >
    >
    > Christian


    -----

    Yes, it is correct (or can be considered as correct).
    I do not wish to discuss the typographical problematic
    of "Das Grosse Eszett". The web is full of pages on the
    subject. However, I never succeeded to find an "official
    position" from Unicode. The best information I found seem
    to indicate (to converge), U+1E9E is now the "supported"
    uppercase form of U+00DF. (see DIN).

    What is bothering me, is more the implementation. The Unicode
    documentation says roughly this: if something can not be
    honoured, there is no harm, but do not implement a workaroud.
    In that case, I'm not sure Python is doing the best.

    If "wrong", this can be considered as programmatically correct
    or logically acceptable (Py3.2)

    >>> 'Straße'.upper().lower().capitalize() == 'Straße'

    True

    while this will *always* be problematic (Py3.3)

    >>> 'Straße'.upper().lower().capitalize() == 'Straße'

    False

    jmf
    , Dec 19, 2012
    #11
  12. Guest

    Le mercredi 19 décembre 2012 19:27:38 UTC+1, Ian a écrit :
    > On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <> wrote:
    >
    > > You may not be familiar with jmf. He's one of our resident trolls, and

    >
    > > he has a bee in his bonnet about PEP 393 strings, on the basis that

    >
    > > they take up more space in memory than a narrow build of Python 3.2

    >
    > > would, for a string with lots of BMP characters and one non-BMP. In

    >
    > > 3.2 narrow builds, strings were stored in UTF-16, with *surrogate

    >
    > > pairs* for non-BMP characters. This means that len() counts them

    >
    > > twice, as does string indexing/slicing. That's a major bug, especially

    >
    > > as your Python code will do different things on different platforms -

    >
    > > most Linux builds of 3.2 are "wide" builds, storing characters in four

    >
    > > bytes each.

    >
    >
    >
    > >From what I've been able to discern, his actual complaint about PEP

    >
    > 393 stems from misguided moral concerns. With PEP-393, strings that
    >
    > can be fully represented in Latin-1 can be stored in half the space
    >
    > (ignoring fixed overhead) compared to strings containing at least one
    >
    > non-Latin-1 character. jmf thinks this optimization is unfair to
    >
    > non-English users and immoral; he wants Latin-1 strings to be treated
    >
    > exactly like non-Latin-1 strings (I don't think he actually cares
    >
    > about non-BMP strings at all; if narrow-build Unicode is good enough
    >
    > for him, then it must be good enough for everybody). Unfortunately
    >
    > for him, the Latin-1 optimization is rather trivial in the wider
    >
    > context of PEP-393, and simply removing that part alone clearly
    >
    > wouldn't be doing anybody any favors. So for him to get what he
    >
    > wants, the entire PEP has to go.
    >
    >
    >
    > It's rather like trying to solve the problem of wealth disparity by
    >
    > forcing everyone to dump their excess wealth into the ocean.


    ----

    latin-1 (iso-8859-1) ? are you sure ?

    >>> sys.getsizeof('a')

    26
    >>> sys.getsizeof('ab')

    27
    >>> sys.getsizeof('aé')

    39

    Time to go to bed. More complete answer tomorrow.

    jmf
    , Dec 19, 2012
    #12
  13. Guest

    Le mercredi 19 décembre 2012 19:27:38 UTC+1, Ian a écrit :
    > On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <> wrote:
    >
    > > You may not be familiar with jmf. He's one of our resident trolls, and

    >
    > > he has a bee in his bonnet about PEP 393 strings, on the basis that

    >
    > > they take up more space in memory than a narrow build of Python 3.2

    >
    > > would, for a string with lots of BMP characters and one non-BMP. In

    >
    > > 3.2 narrow builds, strings were stored in UTF-16, with *surrogate

    >
    > > pairs* for non-BMP characters. This means that len() counts them

    >
    > > twice, as does string indexing/slicing. That's a major bug, especially

    >
    > > as your Python code will do different things on different platforms -

    >
    > > most Linux builds of 3.2 are "wide" builds, storing characters in four

    >
    > > bytes each.

    >
    >
    >
    > >From what I've been able to discern, his actual complaint about PEP

    >
    > 393 stems from misguided moral concerns. With PEP-393, strings that
    >
    > can be fully represented in Latin-1 can be stored in half the space
    >
    > (ignoring fixed overhead) compared to strings containing at least one
    >
    > non-Latin-1 character. jmf thinks this optimization is unfair to
    >
    > non-English users and immoral; he wants Latin-1 strings to be treated
    >
    > exactly like non-Latin-1 strings (I don't think he actually cares
    >
    > about non-BMP strings at all; if narrow-build Unicode is good enough
    >
    > for him, then it must be good enough for everybody). Unfortunately
    >
    > for him, the Latin-1 optimization is rather trivial in the wider
    >
    > context of PEP-393, and simply removing that part alone clearly
    >
    > wouldn't be doing anybody any favors. So for him to get what he
    >
    > wants, the entire PEP has to go.
    >
    >
    >
    > It's rather like trying to solve the problem of wealth disparity by
    >
    > forcing everyone to dump their excess wealth into the ocean.


    ----

    latin-1 (iso-8859-1) ? are you sure ?

    >>> sys.getsizeof('a')

    26
    >>> sys.getsizeof('ab')

    27
    >>> sys.getsizeof('aé')

    39

    Time to go to bed. More complete answer tomorrow.

    jmf
    , Dec 19, 2012
    #13
  14. Ian Kelly Guest

    On Wed, Dec 19, 2012 at 1:55 PM, <> wrote:
    > Yes, it is correct (or can be considered as correct).
    > I do not wish to discuss the typographical problematic
    > of "Das Grosse Eszett". The web is full of pages on the
    > subject. However, I never succeeded to find an "official
    > position" from Unicode. The best information I found seem
    > to indicate (to converge), U+1E9E is now the "supported"
    > uppercase form of U+00DF. (see DIN).


    Is this link not official?

    http://unicode.org/cldr/utility/character.jsp?a=00DF

    That defines a full uppercase mapping to SS and a simple uppercase
    mapping to U+00DF itself, not U+1E9E. My understanding of the simple
    mapping is that it is not allowed to map to multiple characters,
    whereas the full mapping is so allowed.

    > What is bothering me, is more the implementation. The Unicode
    > documentation says roughly this: if something can not be
    > honoured, there is no harm, but do not implement a workaroud.
    > In that case, I'm not sure Python is doing the best.


    But this behavior is per the specification, not a workaround. I think
    the worst thing we could do in this regard would be to start diverging
    from the specification because we think we know better than the
    Unicode Consortium.


    > If "wrong", this can be considered as programmatically correct
    > or logically acceptable (Py3.2)
    >
    >>>> 'Straße'.upper().lower().capitalize() == 'Straße'

    > True
    >
    > while this will *always* be problematic (Py3.3)
    >
    >>>> 'Straße'.upper().lower().capitalize() == 'Straße'

    > False


    On the other hand (Py3.2):

    >>> 'Straße'.upper().isupper()

    False

    vs. Py3.3:

    >>> 'Straße'.upper().isupper()

    True

    There is probably no one clearly correct way to handle the problem,
    but personally this contradiction bothers me more than the example
    that you posted.
    Ian Kelly, Dec 19, 2012
    #14
  15. Ian Kelly Guest

    On Wed, Dec 19, 2012 at 2:18 PM, <> wrote:
    > latin-1 (iso-8859-1) ? are you sure ?


    Yes.

    >>>> sys.getsizeof('a')

    > 26
    >>>> sys.getsizeof('ab')

    > 27
    >>>> sys.getsizeof('aé')

    > 39


    Compare to:

    >>> sys.getsizeof('a\u0100')

    42

    The reason for the difference you posted is that pure ASCII strings
    have a further optimization, which I glossed over and which is purely
    a savings in overhead:

    >>> sys.getsizeof('abcde') - sys.getsizeof('a')

    4
    >>> sys.getsizeof('ábçdê') - sys.getsizeof('á')

    4
    Ian Kelly, Dec 19, 2012
    #15
  16. Terry Reedy Guest

    On 12/19/2012 10:40 AM, Chris Angelico wrote:

    > Interestingly, IDLE on my Windows box can't handle the bolded
    > characters very well...
    >
    >>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
    >>>> print(s)

    > Traceback (most recent call last):
    > File "<pyshell#2>", line 1, in <module>
    > print(s)
    > UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
    > in position 0: Non-BMP character not supported in Tk


    On 3.3.0 on Win7 , the expressions 's', 'repr(s)', and 'str(s)' (without
    the quotes) echo the input as entered (with \U escapes) while 'print(s)'
    gets the same traceback you did.



    --
    Terry Jan Reedy
    Terry Reedy, Dec 20, 2012
    #16
  17. On Thu, Dec 20, 2012 at 8:23 AM, Ian Kelly <> wrote:
    > On Wed, Dec 19, 2012 at 1:55 PM, <> wrote:
    >> Yes, it is correct (or can be considered as correct).
    >> I do not wish to discuss the typographical problematic
    >> of "Das Grosse Eszett". The web is full of pages on the
    >> subject. However, I never succeeded to find an "official
    >> position" from Unicode. The best information I found seem
    >> to indicate (to converge), U+1E9E is now the "supported"
    >> uppercase form of U+00DF. (see DIN).

    >
    > Is this link not official?
    >
    > http://unicode.org/cldr/utility/character.jsp?a=00DF
    >
    > That defines a full uppercase mapping to SS and a simple uppercase
    > mapping to U+00DF itself, not U+1E9E. My understanding of the simple
    > mapping is that it is not allowed to map to multiple characters,
    > whereas the full mapping is so allowed.


    Ahh, thanks, that explains why the other Unicode-aware language I
    tried behaved differently.

    Pike v7.9 release 5 running Hilfe v3.5 (Incremental Pike Frontend)
    > string s="Stra\u00dfe";
    > upper_case(s);

    (1) Result: "STRA\337E"
    > lower_case(upper_case(s));

    (2) Result: "stra\337e"
    > String.capitalize(lower_case(s));

    (3) Result: "Stra\337e"

    The output is the equivalent of repr(), and it uses octal escapes
    where possible (for brevity), so \337 is its representation of U+00DF
    (decimal 223, octal 337). Upper-casing and lower-casing this character
    result in the same thing.

    > write("Original: %s\nLower: %s\nUpper: %s\n",s,lower_case(s),upper_case(s));

    Original: Straße
    Lower: straße
    Upper: STRAßE

    It's worth noting, incidentally, that the unusual upper-case form of
    the letter (U+1E9E) does lower-case to U+00DF in both Python 3.3 and
    Pike 7.9.5:

    > lower_case("Stra\u1E9Ee");

    (9) Result: "stra\337e"

    >>> ord("\u1e9e".lower())

    223

    So both of them are behaving in a compliant manner, even though
    they're not quite identical.

    ChrisA
    Chris Angelico, Dec 20, 2012
    #17
  18. On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <> wrote:
    > From what I've been able to discern, [jmf's] actual complaint about PEP
    > 393 stems from misguided moral concerns. With PEP-393, strings that
    > can be fully represented in Latin-1 can be stored in half the space
    > (ignoring fixed overhead) compared to strings containing at least one
    > non-Latin-1 character. jmf thinks this optimization is unfair to
    > non-English users and immoral; he wants Latin-1 strings to be treated
    > exactly like non-Latin-1 strings (I don't think he actually cares
    > about non-BMP strings at all; if narrow-build Unicode is good enough
    > for him, then it must be good enough for everybody).


    Not entirely; most of his complaints are based on performance (speed
    and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
    edge cases to prove how much worse 3.3 is, while utterly ignoring the
    fact that, in those self-same edge cases, 3.2 is buggy.

    ChrisA
    Chris Angelico, Dec 20, 2012
    #18
  19. Terry Reedy Guest

    On 12/19/2012 9:03 PM, Chris Angelico wrote:
    > On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <> wrote:
    >> From what I've been able to discern, [jmf's] actual complaint about PEP
    >> 393 stems from misguided moral concerns. With PEP-393, strings that
    >> can be fully represented in Latin-1 can be stored in half the space
    >> (ignoring fixed overhead) compared to strings containing at least one
    >> non-Latin-1 character. jmf thinks this optimization is unfair to
    >> non-English users and immoral; he wants Latin-1 strings to be treated
    >> exactly like non-Latin-1 strings (I don't think he actually cares
    >> about non-BMP strings at all; if narrow-build Unicode is good enough
    >> for him, then it must be good enough for everybody).

    >
    > Not entirely; most of his complaints are based on performance (speed
    > and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
    > edge cases to prove how much worse 3.3 is, while utterly ignoring the
    > fact that, in those self-same edge cases, 3.2 is buggy.


    And the fact that stringbench.py is overall about as fast with 3.3 as
    with 3.2 *on the same Windows 7 machine* (which uses narrow build in
    3.2), and that unicode operations are not far from bytes operations when
    the same thing can be done with both.

    --
    Terry Jan Reedy
    Terry Reedy, Dec 20, 2012
    #19
  20. On Wed, Dec 19, 2012 at 09:54:20PM -0500, Terry Reedy wrote:
    > On 12/19/2012 9:03 PM, Chris Angelico wrote:
    > >On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <> wrote:
    > >> From what I've been able to discern, [jmf's] actual complaint about PEP
    > >>393 stems from misguided moral concerns. With PEP-393, strings that
    > >>can be fully represented in Latin-1 can be stored in half the space
    > >>(ignoring fixed overhead) compared to strings containing at least one
    > >>non-Latin-1 character. jmf thinks this optimization is unfair to
    > >>non-English users and immoral; he wants Latin-1 strings to be treated
    > >>exactly like non-Latin-1 strings (I don't think he actually cares
    > >>about non-BMP strings at all; if narrow-build Unicode is good enough
    > >>for him, then it must be good enough for everybody).

    > >
    > >Not entirely; most of his complaints are based on performance (speed
    > >and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
    > >edge cases to prove how much worse 3.3 is, while utterly ignoring the
    > >fact that, in those self-same edge cases, 3.2 is buggy.

    >
    > And the fact that stringbench.py is overall about as fast with 3.3
    > as with 3.2 *on the same Windows 7 machine* (which uses narrow build
    > in 3.2), and that unicode operations are not far from bytes
    > operations when the same thing can be done with both.
    >
    > --
    > Terry Jan Reedy


    Really, why should we be so obsessed with speed anyways? Isn't
    improving the language and fixing bugs far more important?
    Westley Martínez, Dec 20, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    4
    Views:
    698
    Jürgen Exner
    Dec 7, 2004
  2. Jason Stitt
    Replies:
    1
    Views:
    417
    George Sakkis
    Oct 20, 2005
  3. Replies:
    4
    Views:
    393
    Kent Johnson
    May 25, 2006
  4. Replies:
    0
    Views:
    278
  5. BlackHelicopter
    Replies:
    0
    Views:
    499
    BlackHelicopter
    Jan 31, 2013
Loading...

Share This Page