Unicode 7

Discussion in 'Python' started by wxjmfauth@gmail.com, Apr 29, 2014.

  1. Guest

    Let see how Python is ready for the next Unicode version
    (Unicode 7.0.0.Beta).


    >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'")

    [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
    >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = '\u0fce'")

    [5.462776291480395, 5.4479432055423445, 5.447874284053398]
    >>>
    >>>
    >>> # more interesting
    >>> timeit.repeat("(x*1000 + y)[:-1]",\

    .... setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')")
    [1.3496489533188765, 1.328654286266783, 1.3300913977710707]
    >>>


    Note 1: "lookup" is not the problem.

    Note 2: From Unicode.org : "[...] We strongly encourage [...] and test
    them with their programs [...]"

    -> Done.

    jmf
    , Apr 29, 2014
    #1
    1. Advertising

  2. Tim Chase Guest

    On 2014-04-29 10:37, wrote:
    > >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'")

    > [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
    > >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y =
    > >>> '\u0fce'")

    > [5.462776291480395, 5.4479432055423445, 5.447874284053398]
    > >>>
    > >>>
    > >>> # more interesting
    > >>> timeit.repeat("(x*1000 + y)[:-1]",\

    > ... setup="x = 'abc'.encode('utf-8'); y =
    > '\u0fce'.encode('utf-8')") [1.3496489533188765, 1.328654286266783,
    > 1.3300913977710707]
    > >>>


    While I dislike feeding the troll, what I see here is: on your
    machine, all unicode manipulations in the test should take ~5.4
    seconds. But Python notices that some of your strings *don't*
    require a full 32-bits and thus optimizes those operations, cutting
    about 75% of the processing time (wow...4-bytes-per-char to
    1-byte-per-char, I wonder where that 75% savings comes from).

    So rather than highlight any *problem* with Python, your [mostly
    worthless microbenchmark non-realworld] tests show that Python's
    unicode implementation is awesome.

    Still waiting to see an actual bug-report as mentioned on the other
    thread.

    -tkc
    Tim Chase, Apr 29, 2014
    #2
    1. Advertising

  3. MRAB Guest

    On 2014-04-29 18:37, wrote:
    > Let see how Python is ready for the next Unicode version
    > (Unicode 7.0.0.Beta).
    >
    >
    >>>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'")

    > [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
    >>>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = '\u0fce'")

    > [5.462776291480395, 5.4479432055423445, 5.447874284053398]
    >>>>
    >>>>
    >>>> # more interesting
    >>>> timeit.repeat("(x*1000 + y)[:-1]",\

    > ... setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')")
    > [1.3496489533188765, 1.328654286266783, 1.3300913977710707]
    >>>>

    >

    Although the third example is the fastest, it's also the wrong way to
    handle Unicode:

    >>> x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')
    >>> t = (x*1000 + y)[:-1].decode('utf-8')

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf-8' codec can't decode bytes in position
    3000-3001: unex
    pected end of data

    > Note 1: "lookup" is not the problem.
    >
    > Note 2: From Unicode.org : "[...] We strongly encourage [...] and test
    > them with their programs [...]"
    >
    > -> Done.
    >
    > jmf
    >
    MRAB, Apr 29, 2014
    #3
  4. Rustom Mody Guest

    On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
    > While I dislike feeding the troll, what I see here is:


    <snipped>

    Since its Unicode-troll time, here's my contribution
    http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

    :)

    More seriously, since Ive quoted some esteemed members of this list
    explicitly (Steven) and the list in general, please let me know if
    something is inaccurate or inappropriate
    Rustom Mody, Apr 30, 2014
    #4
  5. Guest

    @ Time Chase

    I'm perfectly aware about what I'm doing.


    @ MRAB

    "...Although the third example is the fastest, it's also the wrong
    way to handle Unicode: ..."

    Maybe that's exactly the opposite. It illustrates very well,
    the quality of coding schemes endorsed by Unicode.org.
    I deliberately choose utf-8.


    >>> sys.getsizeof('\u0fce')

    40
    >>> sys.getsizeof('\u0fce'.encode('utf-8'))

    20
    >>> sys.getsizeof('\u0fce'.encode('utf-16-be'))

    19
    >>> sys.getsizeof('\u0fce'.encode('utf-32-be'))

    21
    >>>


    Q. How to save memory without wasting time in encoding?
    By using products using natively the unicode coding schemes?

    Are you understanding unicode? Or are you understanding
    unicode via Python?

    ---

    A Tibetan monk [*] using Py32:

    >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'")

    [2.3394840182882186, 2.3145832750782653, 2.3207231951529685]
    >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = '\u0fce'")

    [2.328517624800078, 2.3169403900011076, 2.317586282812048]
    >>>


    [*] Your curiosity has certainly shown, what this code point means.
    For the others:
    U+0FCE TIBETAN SIGN RDEL NAG RDEL DKAR
    signifies good luck earlier, bad luck later


    (My comment: Good luck with Python or bad luck with Python)

    jmf
    , Apr 30, 2014
    #5
  6. Tim Chase Guest

    On 2014-04-30 00:06, wrote:
    > @ Time Chase
    >
    > I'm perfectly aware about what I'm doing.


    Apparently, you're quite adept at appending superfluous characters to
    sensible strings...did you benchmark your email composition, too? ;-)

    -tkc (aka "Tim", not "Time")
    Tim Chase, Apr 30, 2014
    #6
  7. On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote:

    > On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
    >> While I dislike feeding the troll, what I see here is:

    >
    > <snipped>
    >
    > Since its Unicode-troll time, here's my contribution
    > http://blog.languager.org/2014/04/unicode-and-unix-assumption.html



    I disagree with much of your characterisation of the Unix assumption, and
    I point out that out of the two most widespread flavours of OS today,
    Linux/Unix and Windows, it is *Windows* and not Unix which still
    regularly uses legacy encodings.

    Also your link to Joel On Software mistakenly links to me instead of Joel.

    There's a missing apostrophe in "Ive" [sic] in Acknowledgment #2.

    I didn't notice any other typos.


    --
    Steven
    Steven D'Aprano, May 1, 2014
    #7
  8. Guest

    Le mercredi 30 avril 2014 20:48:48 UTC+2, Tim Chase a écrit :
    > On 2014-04-30 00:06, wrote:
    >
    > > @ Time Chase

    >
    > >

    >
    > > I'm perfectly aware about what I'm doing.

    >
    >
    >
    > Apparently, you're quite adept at appending superfluous characters to
    >
    > sensible strings...did you benchmark your email composition, too? ;-)
    >
    >
    >
    > -tkc (aka "Tim", not "Time")


    Mea culpa, ...
    , May 1, 2014
    #8
  9. Rustom Mody Guest

    On Thursday, May 1, 2014 10:30:43 AM UTC+5:30, Steven D'Aprano wrote:
    > On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote:


    > > On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
    > >> While I dislike feeding the troll, what I see here is:

    > > Since its Unicode-troll time, here's my contribution
    > > http://blog.languager.org/2014/04/unicode-and-unix-assumption.html


    > Also your link to Joel On Software mistakenly links to me instead of Joel.
    > There's a missing apostrophe in "Ive" [sic] in Acknowledgment #2.


    Done, Done.

    > I didn't notice any other typos.


    Thank you sir!

    > I point out that out of the two most widespread flavours of OS today,
    > Linux/Unix and Windows, it is *Windows* and not Unix which still
    > regularly uses legacy encodings.


    Not sure what you are suggesting...
    That (I am suggesting that) 8859 is legacy and 1252 is not?

    > I disagree with much of your characterisation of the Unix assumption,


    I'd be interested to know the details -- Contents? Details? Tone? Tenor? Blaspheming the sacred scripture?
    (if you are so inclined of course)
    Rustom Mody, May 1, 2014
    #9
  10. Terry Reedy Guest

    On 5/1/2014 2:04 PM, Rustom Mody wrote:

    >>> Since its Unicode-troll time, here's my contribution
    >>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html


    I will not comment on the Unix-assumption part, but I think you go wrong
    with this: "Unicode is a Headache". The major headache is that unicode
    and its very few encodings are not universally used. The headache is all
    the non-unicode legacy encodings still being used. So you better title
    this section 'Non-Unicode is a Headache'.

    The first sentence is this misleading tautology: "With ASCII, data is
    ASCII whether its file, core, terminal, or network; ie "ABC" is
    65,66,67." Let me translate: "If all text is ASCII encoded, then text
    data is ASCII, whether ..." But it was never the case that all text was
    ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe
    still uses the latter. Other mainframe makers used other encodings of
    A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never
    universal. You could have just as well said "With EBCDIC, data is
    EBCDIC, whether ..."

    https://en.wikipedia.org/wiki/Ascii
    https://en.wikipedia.org/wiki/EBCDIC

    A crucial step in the spread of Ascii was its use for microcomputers,
    including the IBM PC. The latter was considered a toy by the mainframe
    guys. If they had known that PCs would partly take over the computing
    world, they might have suggested or insisted that the it use EBCDIC.

    "With unicode there are:
    encodings"
    where 'encodings' is linked to
    https://en.wikipedia.org/wiki/Character_encodings_in_HTML

    If html 'always' used utf-8 (like xml), as has become common but not
    universal, all of the problems with *non-unicode* character sets and
    encodings would disappear. The pre-unicode declarations could then
    disappear. More truthful: "without unicode there are 100s of encodings
    and with unicode only 3 that we should worry about.

    "in-memory formats"

    These are not the concern of the using programmer as long as they do not
    introduce bugs or limitations (as do all the languages stuck on UCS-2
    and many using UTF-16, including old Python narrow builds). Using what
    should generally be the universal transmission format, UFT-8, as the
    internal format means either losing indexing and slicing, having those
    operations slow from O(1) to O(len(string)), or adding an index table
    that is not part of the unicode standard. Using UTF-32 avoids the above
    but usually wasted space -- up to 75%.

    "strange beasties like python's FSR"

    Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
    is an *internal optimization* that benefits most unicode operations that
    people actually perform. It uses UTF-32 by default but adapts to the
    strings users create by compressing the internal format. The compression
    is trivial -- simple dropping leading null bytes common to all
    characters -- so each character is still readable as is. The string
    headers records how many bytes are left. Is the idea of algorithms that
    adapt to inputs really strange to you?

    Like good adaptive algorthms, the FSR is invisible to the user except
    for reducing space or time or maybe both. Unicode operations are
    otherwise the same as with previous wide builds. People who used to use
    narrow-builds also benefit from bug elimination. The only 'headaches'
    involved might have been those of the developers who optimized previous
    wide builds.

    CPython has many other functions with special-case optimizations and
    'fast paths' for common, simple cases. For instance, (some? all?) number
    operations are optimized for pairs of integers. Do you call these
    'strange beasties'?

    PyPy is faster than CPython, when it is, because it is even more
    adaptable to particular computations by creating new fast paths. The
    mechanism to create these 'strange beasties' might have been a headache
    for the writers, but when it works, which it now seems to, it is not for
    the users.

    --
    Terry Jan Reedy
    Terry Reedy, May 1, 2014
    #10
  11. MRAB Guest

    On 2014-05-01 23:38, Terry Reedy wrote:
    > On 5/1/2014 2:04 PM, Rustom Mody wrote:
    >
    >>>> Since its Unicode-troll time, here's my contribution
    >>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

    >
    > I will not comment on the Unix-assumption part, but I think you go wrong
    > with this: "Unicode is a Headache". The major headache is that unicode
    > and its very few encodings are not universally used. The headache is all
    > the non-unicode legacy encodings still being used. So you better title
    > this section 'Non-Unicode is a Headache'.
    >

    [snip]
    I think he's right when he says "Unicode is a headache", but only
    because it's being used to handle languages which are, themselves, a
    "headache": left-to-right versus right-to-left, sometimes on the same
    line; diacritics, possibly several on a glyph; etc.
    MRAB, May 2, 2014
    #11
  12. Rustom Mody Guest

    On Friday, May 2, 2014 5:03:21 AM UTC+5:30, MRAB wrote:
    > On 2014-05-01 23:38, Terry Reedy wrote:
    > > On 5/1/2014 2:04 PM, Rustom Mody wrote:
    > >>>> Since its Unicode-troll time, here's my contribution
    > >>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

    > > I will not comment on the Unix-assumption part, but I think you go wrong
    > > with this: "Unicode is a Headache". The major headache is that unicode
    > > and its very few encodings are not universally used. The headache is all
    > > the non-unicode legacy encodings still being used. So you better title
    > > this section 'Non-Unicode is a Headache'.

    > [snip]
    > I think he's right when he says "Unicode is a headache", but only
    > because it's being used to handle languages which are, themselves, a
    > "headache": left-to-right versus right-to-left, sometimes on the same
    > line; diacritics, possibly several on a glyph; etc.


    Yes, the headaches go a little further back than Unicode.
    There is a certain large old book...
    In which is described the building of a 'tower that reached up to heaven'....

    At which point 'it was decided'¶ to do something to prevent that.

    And our headaches started.

    I dont know how one causally connects the 'headaches' but Ive seen
    - mojibake
    - unicode 'number-boxes' (what are these called?)
    - Worst of all what we *dont* see -- how many others dont see what we see?

    I never knew of any of this in the good ol days of ASCII

    ¶ Passive voice is often the best choice in the interests of political correctness

    It would be a pleasant surprise if everyone sees a pilcrow at start of lineabove
    Rustom Mody, May 2, 2014
    #12
  13. Rustom Mody Guest

    On Friday, May 2, 2014 4:08:35 AM UTC+5:30, Terry Reedy wrote:
    > On 5/1/2014 2:04 PM, Rustom Mody wrote:


    > >>> Since its Unicode-troll time, here's my contribution
    > >>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html


    > I will not comment on the Unix-assumption part, but I think you go wrong
    > with this: "Unicode is a Headache". The major headache is that unicode
    > and its very few encodings are not universally used. The headache is all
    > the non-unicode legacy encodings still being used. So you better title
    > this section 'Non-Unicode is a Headache'.


    > The first sentence is this misleading tautology: "With ASCII, data is
    > ASCII whether its file, core, terminal, or network; ie "ABC" is
    > 65,66,67." Let me translate: "If all text is ASCII encoded, then text
    > data is ASCII, whether ..." But it was never the case that all text was
    > ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe
    > still uses the latter. Other mainframe makers used other encodings of
    > A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never
    > universal. You could have just as well said "With EBCDIC, data is
    > EBCDIC, whether ..."


    > https://en.wikipedia.org/wiki/Ascii
    > https://en.wikipedia.org/wiki/EBCDIC


    > A crucial step in the spread of Ascii was its use for microcomputers,
    > including the IBM PC. The latter was considered a toy by the mainframe
    > guys. If they had known that PCs would partly take over the computing
    > world, they might have suggested or insisted that the it use EBCDIC.


    > "With unicode there are:
    > encodings"
    > where 'encodings' is linked to
    > https://en.wikipedia.org/wiki/Character_encodings_in_HTML


    > If html 'always' used utf-8 (like xml), as has become common but not
    > universal, all of the problems with *non-unicode* character sets and
    > encodings would disappear. The pre-unicode declarations could then
    > disappear. More truthful: "without unicode there are 100s of encodings
    > and with unicode only 3 that we should worry about.


    > "in-memory formats"


    > These are not the concern of the using programmer as long as they do not
    > introduce bugs or limitations (as do all the languages stuck on UCS-2
    > and many using UTF-16, including old Python narrow builds). Using what
    > should generally be the universal transmission format, UFT-8, as the
    > internal format means either losing indexing and slicing, having those
    > operations slow from O(1) to O(len(string)), or adding an index table
    > that is not part of the unicode standard. Using UTF-32 avoids the above
    > but usually wasted space -- up to 75%.


    > "strange beasties like python's FSR"


    > Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
    > is an *internal optimization* that benefits most unicode operations that
    > people actually perform. It uses UTF-32 by default but adapts to the
    > strings users create by compressing the internal format. The compression
    > is trivial -- simple dropping leading null bytes common to all
    > characters -- so each character is still readable as is. The string
    > headers records how many bytes are left. Is the idea of algorithms that
    > adapt to inputs really strange to you?


    > Like good adaptive algorthms, the FSR is invisible to the user except
    > for reducing space or time or maybe both. Unicode operations are
    > otherwise the same as with previous wide builds. People who used to use
    > narrow-builds also benefit from bug elimination. The only 'headaches'
    > involved might have been those of the developers who optimized previous
    > wide builds.


    > CPython has many other functions with special-case optimizations and
    > 'fast paths' for common, simple cases. For instance, (some? all?) number
    > operations are optimized for pairs of integers. Do you call these
    > 'strange beasties'?


    Here is an instance of someone who would like a certain optimization to be
    dis-able-able

    https://mail.python.org/pipermail/python-list/2014-February/667169.html

    To the best of my knowledge its nothing to do with unicode or with jmf.

    Why if optimizations are always desirable do C compilers have:
    -O0 O1 O2 O3 and zillions of more specific flags?

    JFTR I have no issue with FSR. What we have to hand to jmf - willingly
    or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them]

    I dont even know whether jmf has a real
    technical (as he calls it 'mathematical') issue or its entirely political:

    "Why should I pay more for a EURO sign than a $ sign?"

    Well perhaps that is more related to the exchange rate than to python!
    Rustom Mody, May 2, 2014
    #13
  14. Rustom Mody Guest

    On Friday, May 2, 2014 7:59:55 AM UTC+5:30, Rustom Mody wrote:
    > "Why should I pay more for a EURO sign than a $ sign?"


    A unicode 'headache' there:
    I typed the Euro sign (trying again € ) not EURO

    Somebody -- I guess its GG in overhelpful mode -- converted it
    And made my post:
    Content-Type: text/plain; charset=ISO-8859-1

    Will some devanagarari vowels help it stop being helpful?
    अ आ इ ई उ ऊ ठà¤
    Rustom Mody, May 2, 2014
    #14
  15. Rustom Mody Guest

    On Friday, May 2, 2014 8:09:44 AM UTC+5:30, Ben Finney wrote:
    > Rustom Mody writes:


    > > Yes, the headaches go a little further back than Unicode.


    > Okay, so can you change your article to reflect the fact that the
    > headaches both pre-date Unicode, and are made much easier by Unicode?


    Predate: Yes
    Made easier: No

    > > There is a certain large old book...


    > Ah yes, the neo-Sumerian story "Enmerkar_and_the_Lord_of_Aratta"
    > <URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta>.
    > Probably inspired by stories older than that, of course.


    Thanks for that link

    > > In which is described the building of a 'tower that reached up to heaven'...
    > > At which point 'it was decided'¶ to do something to prevent that.
    > > And our headaches started.


    > And other myths with fantastic reasons for the diversity of language
    > <URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language>.


    This one takes the cake - see 1st para
    http://hilgart.org/enformy/BronsonRekindling.pdf


    > > I never knew of any of this in the good ol days of ASCII


    > Yes, by ignoring all other writing systems except one's own - and
    > thereby excluding most of the world's people - the system can be made
    > simpler.


    > Hopefully the proportion of programmers who still feel they can make
    > such a parochial choice is rapidly shrinking.


    See link above: Ethnic differences and chauvinism are invariably linked
    Rustom Mody, May 2, 2014
    #15
  16. On Fri, May 2, 2014 at 12:29 PM, Rustom Mody <> wrote:
    > Here is an instance of someone who would like a certain optimization to be
    > dis-able-able
    >
    > https://mail.python.org/pipermail/python-list/2014-February/667169.html
    >
    > To the best of my knowledge its nothing to do with unicode or with jmf.


    It doesn't, and it has only to do with testing. I've had similar
    issues at times; for instance, trying to benchmark one language or
    language construct against another often means fighting against an
    optimizer. (How, for instance, do you figure out what loop overhead
    is, when an empty loop is completely optimized out?) This is nothing
    whatsoever to do with Unicode, nor to do with the optimization that
    Python and Pike (and maybe other languages) do with the storage of
    Unicode strings.

    ChrisA
    Chris Angelico, May 2, 2014
    #16
  17. On Thu, 01 May 2014 18:38:35 -0400, Terry Reedy wrote:

    > "strange beasties like python's FSR"
    >
    > Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
    > is an *internal optimization* that benefits most unicode operations that
    > people actually perform. It uses UTF-32 by default but adapts to the
    > strings users create by compressing the internal format. The compression
    > is trivial -- simple dropping leading null bytes common to all
    > characters -- so each character is still readable as is.


    For anyone who, like me, wasn't convinced that Unicode worked that way,
    you can see for yourself that it does. You don't need Python 3.3, any
    version of 3.x will work. In Python 2.7, it should work if you just
    change the calls from "chr()" to "unichr()":

    py> for i in range(256):
    .... c = chr(i)
    .... u = c.encode('utf-32-be')
    .... assert u[:3] == b'\0\0\0'
    .... assert u[3:] == c.encode('latin-1')
    ....
    py> for i in range(256, 0xFFFF+1):
    .... c = chr(i)
    .... u = c.encode('utf-32-be')
    .... assert u[:2] == b'\0\0'
    .... assert u[2:] == c.encode('utf-16-be')
    ....
    py>


    So Terry is correct: dropping leading zeroes, and treating the remainder
    as either Latin-1 or UTF-16, works fine, and potentially saves a lot of
    memory.


    --
    Steven D'Aprano
    http://import-that.dreamwidth.org/
    Steven D'Aprano, May 2, 2014
    #17
  18. Rustom Mody Guest

    On Friday, May 2, 2014 8:31:56 AM UTC+5:30, Chris Angelico wrote:
    > On Fri, May 2, 2014 at 12:29 PM, Rustom Mody wrote:
    > > Here is an instance of someone who would like a certain optimization to be
    > > dis-able-able
    > > https://mail.python.org/pipermail/python-list/2014-February/667169.html
    > > To the best of my knowledge its nothing to do with unicode or with jmf.


    > It doesn't, and it has only to do with testing. I've had similar
    > issues at times; for instance, trying to benchmark one language or
    > language construct against another often means fighting against an
    > optimizer. (How, for instance, do you figure out what loop overhead
    > is, when an empty loop is completely optimized out?) This is nothing
    > whatsoever to do with Unicode, nor to do with the optimization that
    > Python and Pike (and maybe other languages) do with the storage of
    > Unicode strings.


    This was said in response to Terry's

    > CPython has many other functions with special-case optimizations and
    > 'fast paths' for common, simple cases. For instance, (some? all?) number
    > operations are optimized for pairs of integers. Do you call these
    > 'strange beasties'?


    which evidently vanished -- optimized out :D -- in multiple levels of quoting
    Rustom Mody, May 2, 2014
    #18
  19. Terry Reedy Guest

    On 5/1/2014 7:33 PM, MRAB wrote:
    > On 2014-05-01 23:38, Terry Reedy wrote:
    >> On 5/1/2014 2:04 PM, Rustom Mody wrote:
    >>
    >>>>> Since its Unicode-troll time, here's my contribution
    >>>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

    >>
    >> I will not comment on the Unix-assumption part, but I think you go wrong
    >> with this: "Unicode is a Headache". The major headache is that unicode
    >> and its very few encodings are not universally used. The headache is all
    >> the non-unicode legacy encodings still being used. So you better title
    >> this section 'Non-Unicode is a Headache'.
    >>

    > [snip]
    > I think he's right when he says "Unicode is a headache", but only
    > because it's being used to handle languages which are, themselves, a
    > "headache": left-to-right versus right-to-left, sometimes on the same
    > line;


    Handling that without unicode is even worse.

    > diacritics, possibly several on a glyph; etc.


    Ditto.

    --
    Terry Jan Reedy
    Terry Reedy, May 2, 2014
    #19
  20. Rustom Mody Guest

    On Friday, May 2, 2014 9:46:36 AM UTC+5:30, Terry Reedy wrote:
    > On 5/1/2014 7:33 PM, MRAB wrote:
    > > On 2014-05-01 23:38, Terry Reedy wrote:
    > >> On 5/1/2014 2:04 PM, Rustom Mody wrote:
    > >>>>> Since its Unicode-troll time, here's my contribution
    > >>>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
    > >> I will not comment on the Unix-assumption part, but I think you go wrong
    > >> with this: "Unicode is a Headache". The major headache is that unicode
    > >> and its very few encodings are not universally used. The headache is all
    > >> the non-unicode legacy encodings still being used. So you better title
    > >> this section 'Non-Unicode is a Headache'.

    > > [snip]
    > > I think he's right when he says "Unicode is a headache", but only
    > > because it's being used to handle languages which are, themselves, a
    > > "headache": left-to-right versus right-to-left, sometimes on the same
    > > line;


    > Handling that without unicode is even worse.


    > > diacritics, possibly several on a glyph; etc.


    > Ditto.


    Whats the best cure for headache?

    Cut off the head

    Whats the best cure for Unicode?

    Ascii

    Saying however that there is no headache in unicode does not make the headache
    go away:

    http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

    No I am not saying that the contents/style/tone are right.
    However people are evidently suffering the transition.
    Denying it is not a help.

    And unicode consortium's ways are not exactly helpful to its own cause:
    Imagine the C standard committee deciding that adding mandatory garbage collection
    to C is a neat idea

    Unicode consortium's going from old BMP to current (6.0) SMPs to who-knows-what
    in the future is similar.
    Rustom Mody, May 2, 2014
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,921
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    548
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    520
    Gabriele *darkbard* Farina
    May 16, 2006
  4. gabor
    Replies:
    13
    Views:
    552
    Leo Kislov
    Nov 18, 2006
  5. Jean-Paul Calderone
    Replies:
    23
    Views:
    676
    Leo Kislov
    Nov 21, 2006
Loading...

Share This Page