break unichr instead of fix ord?

Discussion in 'Python' started by rurpy@yahoo.com, Aug 25, 2009.

  1. Guest

    In Python 2.5 on Windows I could do [*1]:

    # Create a unicode character outside of the BMP.
    >>> a = u'\U00010040'


    # On Windows it is represented as a surogate pair.
    >>> len(a)

    2
    >>> a[0],a[1]

    (u'\ud800', u'\udc40')

    # Create the same character with the unichr() function.
    >>> a = unichr (65600)
    >>> a[0],a[1]

    (u'\ud800', u'\udc40')

    # Although the unichr() function works fine, its
    # inverse, ord(), doesn't.
    >>> ord (a)

    TypeError: ord() expected a character, but string of length 2 found

    On Python 2.6, unichr() was "fixed" (using the word
    loosely) so that it too now fails with characters outside
    the BMP.

    >>> a = unichr (65600)

    ValueError: unichr() arg not in range(0x10000) (narrow Python build)

    Why was this done rather than changing ord() to accept a
    surrogate pair?

    Does not this effectively make unichr() and ord() useless
    on Windows for all but a subset of unicode characters?
     
    , Aug 25, 2009
    #1
    1. Advertising

  2. 25-08-2009 o 21:45:49 <> wrote:

    > In Python 2.5 on Windows I could do [*1]:
    >
    > # Create a unicode character outside of the BMP.
    > >>> a = u'\U00010040'

    >
    > # On Windows it is represented as a surogate pair.

    [snip]
    > On Python 2.6, unichr() was "fixed" (using the word
    > loosely) so that it too now fails with characters outside
    > the BMP.

    [snip]
    > Does not this effectively make unichr() and ord() useless
    > on Windows for all but a subset of unicode characters?


    Are you sure, you couldn't have UCS-4-compiled Python distro
    for Windows?? :-O

    *j

    --
    Jan Kaliszewski (zuo) <>
     
    Jan Kaliszewski, Aug 26, 2009
    #2
    1. Advertising

  3. Mark Tolonen Guest

    <> wrote in message
    news:...
    > In Python 2.5 on Windows I could do [*1]:
    >
    > # Create a unicode character outside of the BMP.
    > >>> a = u'\U00010040'

    >
    > # On Windows it is represented as a surogate pair.
    > >>> len(a)

    > 2
    > >>> a[0],a[1]

    > (u'\ud800', u'\udc40')
    >
    > # Create the same character with the unichr() function.
    > >>> a = unichr (65600)
    > >>> a[0],a[1]

    > (u'\ud800', u'\udc40')
    >
    > # Although the unichr() function works fine, its
    > # inverse, ord(), doesn't.
    > >>> ord (a)

    > TypeError: ord() expected a character, but string of length 2 found
    >
    > On Python 2.6, unichr() was "fixed" (using the word
    > loosely) so that it too now fails with characters outside
    > the BMP.
    >
    > >>> a = unichr (65600)

    > ValueError: unichr() arg not in range(0x10000) (narrow Python build)
    >
    > Why was this done rather than changing ord() to accept a
    > surrogate pair?
    >
    > Does not this effectively make unichr() and ord() useless
    > on Windows for all but a subset of unicode characters?


    Switch to Python 3?

    >>> x='\U00010040'
    >>> import unicodedata
    >>> unicodedata.name(x)

    'LINEAR B SYLLABLE B025 A2'
    >>> ord(x)

    65600
    >>> hex(ord(x))

    '0x10040'
    >>> unicodedata.name(chr(0x10040))

    'LINEAR B SYLLABLE B025 A2'
    >>> ord(chr(0x10040))

    65600
    >>> print(ascii(chr(0x10040)))

    '\ud800\udc40'

    -Mark
     
    Mark Tolonen, Aug 26, 2009
    #3
  4. 2009/8/25 <>:
    > In Python 2.5 on Windows I could do [*1]:
    >
    >  # Create a unicode character outside of the BMP.
    >  >>> a = u'\U00010040'
    >
    >  # On Windows it is represented as a surogate pair.
    >  >>> len(a)
    >  2
    >  >>> a[0],a[1]
    >  (u'\ud800', u'\udc40')
    >
    >  # Create the same character with the unichr() function.
    >  >>> a = unichr (65600)
    >  >>> a[0],a[1]
    >  (u'\ud800', u'\udc40')
    >
    >  # Although the unichr() function works fine, its
    >  # inverse, ord(), doesn't.
    >  >>> ord (a)
    >  TypeError: ord() expected a character, but string of length 2 found
    >
    > On Python 2.6, unichr() was "fixed" (using the word
    > loosely) so that it too now fails with characters outside
    > the BMP.
    >
    >  >>> a = unichr (65600)
    >  ValueError: unichr() arg not in range(0x10000) (narrow Python build)
    >
    > Why was this done rather than changing ord() to accept a
    > surrogate pair?
    >
    > Does not this effectively make unichr() and ord() useless
    > on Windows for all but a subset of unicode characters?
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >


    Hi,
    I'm not sure about the exact reasons for this behaviour on narrow
    builds either (maybe the consistency of the input/ output data to
    exactly one character?).

    However, if I need these functions for higher unicode planes, the
    following rather hackish replacements seem to work. I presume, there
    might be smarter ways of dealing with this, but anyway...

    hth,
    vbr

    #### not (systematically) tested #####################################

    import sys

    def wide_ord(char):
    try:
    return ord(char)
    except TypeError:
    if len(char) == 2 and 0xD800 <= ord(char[0]) <= 0xDBFF and
    0xDC00 <= ord(char[1]) <= 0xDFFF:
    return (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) -
    0xDC00) + 0x10000
    else:
    raise TypeError("invalid character input")


    def wide_unichr(i):
    if i <= sys.maxunicode:
    return unichr(i)
    else:
    return ("\U"+str(hex(i))[2:].zfill(8)).decode("unicode-escape")
     
    Vlastimil Brom, Aug 26, 2009
    #4
  5. > In Python 2.5 on Windows I could do [*1]:
    >
    > >>> a = unichr (65600)
    > >>> a[0],a[1]

    > (u'\ud800', u'\udc40')


    I can't reproduce that. My copy of Python on Windows gives

    Traceback (most recent call last):
    File "<pyshell#0>", line 1, in <module>
    unichr(65600)
    ValueError: unichr() arg not in range(0x10000) (narrow Python build)

    This is

    Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
    (Intel)] on win32

    Regards,
    Martin
     
    Martin v. Löwis, Aug 26, 2009
    #5
  6. Guest

    On 08/26/2009 03:10 PM, "Martin v. Löwis" wrote:
    >> >> In Python 2.5 on Windows I could do [*1]:
    >> >>
    >> >> >>> a = unichr (65600)
    >> >> >>> a[0],a[1]
    >> >> (u'\ud800', u'\udc40')

    > >
    > > I can't reproduce that. My copy of Python on Windows gives
    > >
    > > Traceback (most recent call last):
    > > File "<pyshell#0>", line 1, in<module>
    > > unichr(65600)
    > > ValueError: unichr() arg not in range(0x10000) (narrow Python build)
    > >
    > > This is
    > >
    > > Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
    > > (Intel)] on win32


    My apologies for the red herring. I was working from
    a comment in my replacement ord() function. I dug up
    an old copy of Python 2.4.3 and could not reproduce it
    there either so I have no explanation for the comment
    (which I wrote). Python 2.3 maybe?

    But regardless, the significant question is, what is
    the reason for having ord() (and unichr) not work for
    surrogate pairs and thus not usable with a large number
    of unicode characters that Python otherwise supports?
     
    , Aug 27, 2009
    #6
  7. Guest

    On Aug 25, 9:53 pm, "Mark Tolonen" <> wrote:
    > <> wrote in message
    >
    > news:...
    >
    >
    >
    > > In Python 2.5 on Windows I could do [*1]:

    >
    > >  # Create a unicode character outside of the BMP.
    > >  >>> a = u'\U00010040'

    >
    > >  # On Windows it is represented as a surogate pair.
    > >  >>> len(a)
    > >  2
    > >  >>> a[0],a[1]
    > >  (u'\ud800', u'\udc40')

    >
    > >  # Create the same character with the unichr() function.
    > >  >>> a = unichr (65600)
    > >  >>> a[0],a[1]
    > >  (u'\ud800', u'\udc40')

    >
    > >  # Although the unichr() function works fine, its
    > >  # inverse, ord(), doesn't.
    > >  >>> ord (a)
    > >  TypeError: ord() expected a character, but string of length 2 found

    >
    > > On Python 2.6, unichr() was "fixed" (using the word
    > > loosely) so that it too now fails with characters outside
    > > the BMP.

    >
    > >  >>> a = unichr (65600)
    > >  ValueError: unichr() arg not in range(0x10000) (narrow Python build)

    >
    > > Why was this done rather than changing ord() to accept a
    > > surrogate pair?

    >
    > > Does not this effectively make unichr() and ord() useless
    > > on Windows for all but a subset of unicode characters?

    >
    > Switch to Python 3?
    >
    > >>> x='\U00010040'
    > >>> import unicodedata
    > >>> unicodedata.name(x)

    >
    > 'LINEAR B SYLLABLE B025 A2'>>> ord(x)
    > 65600
    > >>> hex(ord(x))

    > '0x10040'
    > >>> unicodedata.name(chr(0x10040))

    >
    > 'LINEAR B SYLLABLE B025 A2'>>> ord(chr(0x10040))
    > 65600
    > >>> print(ascii(chr(0x10040)))

    >
    > '\ud800\udc40'
    >
    > -Mark


    I am still a long way away from moving to Python 3
    but I am looking forward to hopefully more rational
    unicode handling there. Thanks for the info.
     
    , Aug 27, 2009
    #7
  8. Guest

    On Aug 26, 2:05 am, Vlastimil Brom <> wrote:
    >[...]
    > Hi,
    > I'm not sure about the exact reasons for this behaviour on narrow
    > builds either (maybe the consistency of the input/ output data to
    > exactly one character?).
    >
    > However, if I need these functions for higher unicode planes, the
    > following rather hackish replacements seem to work. I presume, there
    > might be smarter ways of dealing with this, but anyway...
    >
    > hth,
    >    vbr
    >
    >[...code snipped...]


    Thanks, I wrote a replacement ord function nearly identical
    to yours but will steal your unichr function if that's ok. :)

    But I still wonder why all this is neccessary.
     
    , Aug 27, 2009
    #8
  9. On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote:

    > But regardless, the significant question is, what is the reason for
    > having ord() (and unichr) not work for surrogate pairs and thus not
    > usable with a large number of unicode characters that Python otherwise
    > supports?



    I'm no expert on Unicode, but my guess is that the reason is out of a
    desire for simplicity: unichr() should always return a single char, not a
    pair of chars, and similarly ord() should take as input a single char,
    not two, and return a single number.

    Otherwise it would be ambiguous whether ord(surrogate_pair) should return
    a pair of ints representing the codes for each item in the pair, or a
    single int representing the code point for the whole pair.

    E.g. given your earlier example:

    >>> a = u'\U00010040'
    >>> len(a)

    2
    >>> a[0]

    u'\ud800'
    >>> a[1]

    u'\udc40'

    would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040? If the
    latter, what about ord(u'ab')?

    Remember that a unicode string can contain code points that aren't valid
    characters:

    >>> ord(u'\ud800') # reserved for surrogates, not a character

    55296

    so if ord() sees a surrogate pair, it can't assume it's meant to be
    treated as a surrogate pair rather than a pair of code points that just
    happens to match a surrogate pair.

    None of this means you can't deal with surrogate pairs, it just means you
    can't deal with them using ord() and unichr().

    The above is just my guess, I'd be interested to hear what others say.


    --
    Steven
     
    Steven D'Aprano, Aug 27, 2009
    #9
  10. Guest

    On 08/26/2009 08:52 PM, Steven D'Aprano wrote:
    > On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote:
    >
    >> But regardless, the significant question is, what is the reason for
    >> having ord() (and unichr) not work for surrogate pairs and thus not
    >> usable with a large number of unicode characters that Python otherwise
    >> supports?

    >
    >
    > I'm no expert on Unicode, but my guess is that the reason is out of a
    > desire for simplicity: unichr() should always return a single char, not a
    > pair of chars, and similarly ord() should take as input a single char,
    > not two, and return a single number.
    >
    > Otherwise it would be ambiguous whether ord(surrogate_pair) should return
    > a pair of ints representing the codes for each item in the pair, or a
    > single int representing the code point for the whole pair.
    >
    > E.g. given your earlier example:
    >
    >>>> a = u'\U00010040'
    >>>> len(a)

    > 2
    >>>> a[0]

    > u'\ud800'
    >>>> a[1]

    > u'\udc40'
    >
    > would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040?


    The latter.

    > If the
    > latter, what about ord(u'ab')?


    I would expect a TypeError* (as ord() currently raises) because
    the string length is not 1 and 'ab' is not a surrogate pair.

    *Actually I would have expected ValueError but I'm not going
    to lose sleep over it.

    > Remember that a unicode string can contain code points that aren't valid
    > characters:
    >
    >>>> ord(u'\ud800') # reserved for surrogates, not a character

    > 55296
    >
    > so if ord() sees a surrogate pair, it can't assume it's meant to be
    > treated as a surrogate pair rather than a pair of code points that just
    > happens to match a surrogate pair.


    Well, actually, yes it can. :)

    Python has already made a strong statement that such a pair
    the representation of a character:

    >>> a = ''.join([u'\ud800',u'\udc40'])
    >>> a

    u'\U00010040'

    That is, Python prints, and treats in nearly all other contexts,
    that combination as a character.

    This is related to the practicality argument: what is the ratio
    of need treat a surrogate pair as character consistent with
    with the rest of Python, vs the need to treat it as a string
    of two separate (and invalid in the unicode sense?) characters?

    And if you want to treat each half of the pair separately
    it's not exactly hard: ord(a[0]), ord(a[1]).

    > None of this means you can't deal with surrogate pairs, it just means you
    > can't deal with them using ord() and unichr().


    Kind of like saying, it doesn't mean you can't deal
    with integers larger that 2**32, you just can't multiply
    and divide them.

    > The above is just my guess, I'd be interested to hear what others say.
     
    , Aug 27, 2009
    #10
  11. > My apologies for the red herring. I was working from
    > a comment in my replacement ord() function. I dug up
    > an old copy of Python 2.4.3 and could not reproduce it
    > there either so I have no explanation for the comment
    > (which I wrote). Python 2.3 maybe?


    No. The behavior you observed would only happen on
    a wide Unicode build (e.g. on Unix).

    > But regardless, the significant question is, what is
    > the reason for having ord() (and unichr) not work for
    > surrogate pairs and thus not usable with a large number
    > of unicode characters that Python otherwise supports?


    See PEP 261, http://www.python.org/dev/peps/pep-0261/
    It specifies all this.

    Regards,
    Martin
     
    Martin v. Löwis, Aug 27, 2009
    #11
  12. Guest

    On 08/26/2009 11:51 PM, "Martin v. Löwis" wrote:
    >[...]
    >> But regardless, the significant question is, what is
    >> the reason for having ord() (and unichr) not work for
    >> surrogate pairs and thus not usable with a large number
    >> of unicode characters that Python otherwise supports?

    >
    > See PEP 261, http://www.python.org/dev/peps/pep-0261/
    > It specifies all this.


    The PEP (AFAICT) says only what we already know... that
    on narrow builds unichr() will raise an exception with
    an argument >= 0x10000, and ord() is unichr()'s inverse.

    I have read the PEP twice now and still see no justification
    for that decision, it appears to have been made by fiat.[*1]

    Could you or someone please point me to specific justification
    for having unichr and ord work only for a subset of unicode
    characters on narrow builds, as opposed to the more general
    and IMO useful behavior proposed earlier in this thread?

    ----------------------------------------------------------
    [*1]
    The PEP says:
    * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
    length-one string.

    * unichr(i) for 2**16 <= i <= TOPCHAR will return a
    length-one string on wide Python builds. On narrow
    builds it will raise ValueError.
    and
    * ord() is always the inverse of unichr()

    which of course we know; that is the current behavior. But
    there is no reason given for that behavior.

    Under the second *unicode bullet point, there are two issues
    raised:
    1) Should surrogate pairs be disallowed on narrow builds?
    That appears to have been answered in the negative and is
    not relevant to my question.
    2) Should access to code points above TOPCHAR be allowed?
    Not relevant to my question.

    * every Python Unicode character represents exactly
    one Unicode code point (i.e. Python Unicode
    Character = Abstract Unicode character)

    I'm not sure what this means (what's an abstract unicode
    character?). If it mandates that u'\ud800\udc40' be
    treated as a len() 2 string, that is that current case
    but does not say anything about how unichr and ord
    should behave. If it mandates that that string must
    always be treated as two separate code points then
    Python itself violates by printing that string as
    u'\U00010040' rather than u'\ud800\udc40'.

    Finally we read:

    * There is a convention in the Unicode world for
    encoding a 32-bit code point in terms of two
    16-bit code points. These are known as
    "surrogate pairs". Python's codecs will adopt
    this convention.

    Is a distinction made between Python and Python
    codecs with only the latter having any knowledge of
    surrogate pairs? I guess that would explain why
    Python prints a surrogate pair as a single character.
    But this seems arbitrary and counter-useful if
    applied to ord() and unichr(). What possible
    use-case is there for *not* recognizing surrogate
    pairs in those two functions?

    Nothing else in the PEP seems remotely relevant.
     
    , Aug 28, 2009
    #12
  13. > The PEP says:
    > * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
    > length-one string.
    >
    > * unichr(i) for 2**16 <= i <= TOPCHAR will return a
    > length-one string on wide Python builds. On narrow
    > builds it will raise ValueError.
    > and
    > * ord() is always the inverse of unichr()
    >
    > which of course we know; that is the current behavior. But
    > there is no reason given for that behavior.


    Sure there is, right above the list:

    "Most things will behave identically in the wide and narrow worlds."

    That's the reason: scripts should work the same as much as possible
    in wide and narrow builds.

    What you propose would break the property "unichr(i) always returns
    a string of length one, if it returns anything at all".

    > 1) Should surrogate pairs be disallowed on narrow builds?
    > That appears to have been answered in the negative and is
    > not relevant to my question.


    It is, as it does lead to inconsistencies between wide and narrow
    builds. OTOH, it also allows the same source code to work on both
    versions, so it also preserves the uniformity in a different way.

    > * every Python Unicode character represents exactly
    > one Unicode code point (i.e. Python Unicode
    > Character = Abstract Unicode character)
    >
    > I'm not sure what this means (what's an abstract unicode
    > character?).


    I don't think this is actually the case, but I may be confusing
    Unicode terminology here - "abstract character" is a term from
    the Unicode standard.

    > Finally we read:
    >
    > * There is a convention in the Unicode world for
    > encoding a 32-bit code point in terms of two
    > 16-bit code points. These are known as
    > "surrogate pairs". Python's codecs will adopt
    > this convention.
    >
    > Is a distinction made between Python and Python
    > codecs with only the latter having any knowledge of
    > surrogate pairs?


    No. In the end, the Unicode type represents code units,
    not code points, i.e. half surrogates are individually
    addressable. Codecs need to adjust to that; in particular
    the UTF-8 and the UTF-32 codec in narrow builds, and the
    UTF-16 codec in wide builds (which didn't exist when the
    PEP was written).

    > Nothing else in the PEP seems remotely relevant.


    Except for the motivation, of course :)

    In addition: your original question was "why has this
    been changed", to which the answer is "it hasn't".
    Then, the next question is "why is it implemented that
    way", to which the answer is "because the PEP says so".
    Only *then* the question is "what is the rationale for
    the PEP specifying things the way it does". The PEP is
    relevant so that we can both agree that Python behaves
    correctly (in the sense of behaving as specified).

    Regards,
    Martin
     
    Martin v. Löwis, Aug 28, 2009
    #13
  14. Guest

    On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:

    [I reordered the quotes from your previous post to try
    and get the responses in a more coherent order. No
    intent to take anything out of context...]

    >> Nothing else in the PEP seems remotely relevant.

    [to providing justification for the behavior of
    unichr/ord]
    >
    > Except for the motivation, of course :)
    >
    > In addition: your original question was "why has this
    > been changed", to which the answer is "it hasn't".


    My original interest was two-fold: can unichr/ord be
    changed to work in a more general and helpful way? That
    seemed remotely possible until it was pointed out that
    the two behave consistently, and that behavior is accurately
    documented. Second, why would they work the way they do
    when they could have been generalized to cover the full
    unicode space? An inadequate answer to this would have
    provided support for the first point but remains interesting
    to me for the reason below.

    > Then, the next question is "why is it implemented that
    > way", to which the answer is "because the PEP says so".


    Not at all a satisfying answer unless one believes
    in PEPal infallibility. :)

    > Only *then* the question is "what is the rationale for
    > the PEP specifying things the way it does". The PEP is
    > relevant so that we can both agree that Python behaves
    > correctly (in the sense of behaving as specified).


    But my question had become: why that behavior, when a
    slightly different behavior would be more general with
    little apparent downside?

    To clarify, my interest in the justification for the
    current behavior is this:

    I think the best feature of python is not, as commonly
    stated, the clean syntax, but rather the pretty complete
    and orthogonal libraries. I often find, after I have
    written some code, that due to the right library functions
    being available, it turns out much shorter and concise
    than I expected.

    Nevertheless, every now and then, perhaps more than in some
    other languages (I'm not sure), I run into something that
    requires what seems to be excessive coding -- I have to
    do something it seems to me that a library function should
    have done for me. Sometimes this is because I don't under-
    stand the reason the library function needs to works the
    way it does. Other times it is one of the countless trade-
    off made in the design of the language, which didn't happen
    to go the way that would have been beneficial to me in a
    particular coding situation.

    But sometimes (and it feels too often) it seems as though,
    zen not withstanding, that purity -- adherence to some
    philosophic ideal -- beat practicality.
    unichr/ord seems such as case to me, But I want to be
    sure I am not missing something.

    The reasons for the current behavior so far:

    1.
    > What you propose would break the property "unichr(i) always returns
    > a string of length one, if it returns anything at all".


    Yes. And i don't see the problem with that. Why is
    that property more desirable than the non-existent
    property that a Unicode literal always produces one
    python character? It would only occur on a narrow
    build with a unicode character outside of the bmp,
    exactly the condition a unicode literal can "behave
    differently" by producing two python characters.

    2.
    > > But there is no reason given [in the PEP] for that behavior.

    > Sure there is, right above the list:
    > "Most things will behave identically in the wide and narrow worlds."
    > That's the reason: scripts should work the same as much as possible
    > in wide and narrow builds.


    So what else would work "differently"? My point was
    that extending unichr/ord to work with all unicode
    characters reduces differences far more often than
    it increase them.

    3.
    >> * There is a convention in the Unicode world for
    >> encoding a 32-bit code point in terms of two
    >> 16-bit code points. These are known as
    >> "surrogate pairs". Python's codecs will adopt
    >> this convention.
    >>
    >> Is a distinction made between Python and Python
    >> codecs with only the latter having any knowledge of
    >> surrogate pairs?

    >
    > No. In the end, the Unicode type represents code units,
    > not code points, i.e. half surrogates are individually
    > addressable. Codecs need to adjust to that; in particular
    > the UTF-8 and the UTF-32 codec in narrow builds, and the
    > UTF-16 codec in wide builds (which didn't exist when the
    > PEP was written).


    OK, so that is not a reason either.

    4.
    I'll speculate a little.
    If surrogate handling was added to ord/unichr, it would
    be the top of a slippery slope leading to demands that
    other string functions also handle surrogates.

    But this is not true -- there is a strong distinction
    between ord/unichr and other string methods. The latter
    deal with strings of multiple characters. But the former
    deals only with single characters (taking a surrogate
    pair as a single unicode character.)

    The behavior of ord/unichr is independent of the other
    string methods -- if they were changed with regard to
    surrogate handling they would all have to be changed to
    maintain consistent behavior. Unichr/str affect only
    each other.

    The functions of ord/unichr -- to map characters to
    numbers -- are fundamental string operations, akin to
    indexing or extracting a substring. So why would
    one want to limit them to a subset of characters if
    not absolutely necessary?

    To reiterate, I am not advocating for any change. I
    simply want to understand if there is a good reason
    for limiting the use of unchr/ord on narrow builds to
    a subset of the unicode characters that Python otherwise
    supports. So far, it seems not and that unichr/ord
    is a poster child for "purity beats practicality".
     
    , Aug 29, 2009
    #14
  15. On Sat, 29 Aug 2009 07:38:51 -0700, rurpy wrote:

    > > Then, the next question is "why is it implemented that way", to which
    > > the answer is "because the PEP says so".

    >
    > Not at all a satisfying answer unless one believes in PEPal
    > infallibility. :)


    Not at all. You don't have to believe that PEPs are infallible to accept
    the answer, you just have to understand that major changes to Python
    aren't made arbitrarily, they have to go through a PEP first. Even Guido
    himself has to write a PEP before making any major changes to the
    language. But PEPs aren't infallible, they can be challenged, rejected,
    withdrawn or made obsolete by new PEPs.


    > The reasons for the current behavior so far:
    >
    > 1.
    >> What you propose would break the property "unichr(i) always returns a
    >> string of length one, if it returns anything at all".

    >
    > Yes. And i don't see the problem with that. Why is that property more
    > desirable than the non-existent property that a Unicode literal always
    > produces one python character?


    What do you mean? Unicode literals don't always produce one character,
    e.g. u'abcd' is a Unicode literal with four characters.

    I think it's fairly self-evident that a function called uniCHR [emphasis
    added] should return a single character (technically a single code
    point). But even if you can come up with a reason for unichr() to return
    two or more characters, this would break code that relies on the
    documented promise that the length of the output of unichr() is always
    one.

    > It would only occur on a narrow build
    > with a unicode character outside of the bmp, exactly the condition a
    > unicode literal can "behave differently" by producing two python
    > characters.



    > 2.
    >> > But there is no reason given [in the PEP] for that behavior.

    >> Sure there is, right above the list:
    >> "Most things will behave identically in the wide and narrow worlds."
    >> That's the reason: scripts should work the same as much as possible in
    >> wide and narrow builds.

    >
    > So what else would work "differently"?


    unichr(n) sometimes would return one character and sometimes two; ord(c)
    would sometimes accept two characters and sometimes raise an exception.
    That's a fairly major difference.


    > My point was that extending
    > unichr/ord to work with all unicode characters reduces differences far
    > more often than it increase them.


    I don't see that at all. What differences do you think it would reduce?


    > 3.
    >>> * There is a convention in the Unicode world for
    >>> encoding a 32-bit code point in terms of two 16-bit code
    >>> points. These are known as "surrogate pairs". Python's codecs
    >>> will adopt this convention.
    >>>
    >>> Is a distinction made between Python and Python codecs with only the
    >>> latter having any knowledge of surrogate pairs?

    >>
    >> No. In the end, the Unicode type represents code units, not code
    >> points, i.e. half surrogates are individually addressable. Codecs need
    >> to adjust to that; in particular the UTF-8 and the UTF-32 codec in
    >> narrow builds, and the UTF-16 codec in wide builds (which didn't exist
    >> when the PEP was written).

    >
    > OK, so that is not a reason either.


    I think it is a very important reason. Python supports code points, so it
    has to support surrogate codes individually. Python can't tell if the
    pair of code points u'\ud800\udc40' represents the single character
    \U00010040 or a pair of code points \ud800 and \udc40.


    > 4.
    > I'll speculate a little.
    > If surrogate handling was added to ord/unichr, it would be the top of a
    > slippery slope leading to demands that other string functions also
    > handle surrogates.
    >
    > But this is not true -- there is a strong distinction between ord/unichr
    > and other string methods. The latter deal with strings of multiple
    > characters. But the former deals only with single characters (taking a
    > surrogate pair as a single unicode character.)


    Strictly speaking, unichr() deals with code points, not characters,
    although the distinction is very fine.

    >>> c = unichr(56384)
    >>> len(c)

    1
    >>> import unicodedata
    >>> unicodedata.category(c)

    'Cs'

    Cs is the general category for "Other, Surrogate", so \udc40 is not
    strictly speaking a character. Nevertheless, Python treats it as one.


    > To reiterate, I am not advocating for any change. I simply want to
    > understand if there is a good reason for limiting the use of unchr/ord
    > on narrow builds to a subset of the unicode characters that Python
    > otherwise supports. So far, it seems not and that unichr/ord is a
    > poster child for "purity beats practicality".


    On the contrary, it seems pretty impractical to me for ord() to sometimes
    successfully accept strings of length two and sometimes to raise an
    exception. I would much rather see a pair of new functions, wideord() and
    widechr() used for converting between surrogate pairs and numbers.



    --
    Steven
     
    Steven D'Aprano, Aug 29, 2009
    #15
  16. 2009/8/29 <>:
    > On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:
    >
    > So far, it seems not and that unichr/ord
    > is a poster child for "purity beats practicality".
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >


    As Mark Tolonen pointed out earlier in this thread, in Python 3 the
    practicality apparently beat purity in this aspect:

    Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit
    (Intel)] on win32
    Type "copyright", "credits" or "license()" for more information.

    >>> goth_urus_1 = '\U0001033f'
    >>> list(goth_urus_1)

    ['\ud800', '\udf3f']
    >>> len(goth_urus_1)

    2
    >>> ord(goth_urus_1)

    66367
    >>> goth_urus_2 = chr(66367)
    >>> len(goth_urus_2)

    2
    >>> import unicodedata
    >>> unicodedata.name(goth_urus_1)

    'GOTHIC LETTER URUS'
    >>> goth_urus_3 = unicodedata.lookup("GOTHIC LETTER URUS")
    >>> goth_urus_4 = "\N{GOTHIC LETTER URUS}"
    >>> goth_urus_1 == goth_urus_2 == goth_urus_3 == goth_urus_4

    True
    >>>


    As for the behaviour in python 2.x, it's probably good enough, that
    the surrogates aren't prohibited and the eventually needed behaviour
    can be easily added via custom functions.

    vbr
     
    Vlastimil Brom, Aug 29, 2009
    #16
  17. Guest

    On 08/29/2009 12:06 PM, Steven D'Aprano wrote:
    [...]
    >> The reasons for the current behavior so far:
    >>
    >> 1.
    >>> What you propose would break the property "unichr(i) always returns a
    >>> string of length one, if it returns anything at all".

    >>
    >> Yes. And i don't see the problem with that. Why is that property more
    >> desirable than the non-existent property that a Unicode literal always
    >> produces one python character?

    >
    > What do you mean? Unicode literals don't always produce one character,
    > e.g. u'abcd' is a Unicode literal with four characters.


    I'm sorry, I should have been clearer. I meant the literal
    representation of a *single* unicode character. u'\u4000'
    which results in a string of length 1, vs u'\U00010040' which
    results in a string of length 2. In both case the literal
    represents a single unicode code point.

    > I think it's fairly self-evident that a function called uniCHR [emphasis
    > added] should return a single character (technically a single code
    > point).


    There are two concepts of characters here, the 16-bit things
    that encodes a character in a Python unicode string (in a
    narrow build Python), and a character in the sense of one
    of the ~2**10 unicode characters. Python has chosen to
    represent the latter (when outside the BMP) as a pair of
    surrogate characters from the former. I don't see why one
    would assume that CHR would mean the python 16-bit
    character concept rather than the full unicode character
    concept. In fact, rather the opposite.

    > But even if you can come up with a reason for unichr() to return
    > two or more characters,


    I've given a number of reasons why it should return a two
    character representation of a non-BMP character, one of
    which is that that is how Python has chosen to represent
    such characters internally. I won't repeat the other
    reasons again.

    I'm not sure why you think more than two characters
    would ever be possible.

    > this would break code that relies on the
    > documented promise that the length of the output of unichr() is always
    > one.


    Ah, OK. This is the good reason I was looking for.
    I did not realize (until prompted by your remark
    to go back and look at the early docs) that unichr
    had been documented to return a single character
    since 2.0 and that wide character support was added
    in 2.2. Martin v. Loewis also implied that, I now
    see, although the implication was too deep for me
    to pick up.

    So although it leads to a suboptimal situation, I
    agree that maintaining the documented behavior was
    necessary.

    [...]
    > I would much rather see a pair of new functions, wideord() and
    > widechr() used for converting between surrogate pairs and numbers.


    I guess if it were still 2001 and Python 2.2 was
    coming out I would be in favor of this too. :)
     
    , Aug 30, 2009
    #17
  18. Guest

    On 08/29/2009 01:43 PM, Vlastimil Brom wrote:
    > > 2009/8/29<>:
    >> >> On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:
    >> >>
    >> >> So far, it seems not and that unichr/ord
    >> >> is a poster child for "purity beats practicality".
    >> >> --
    >> >> http://mail.python.org/mailman/listinfo/python-list
    >> >>

    > >
    > > As Mark Tolonen pointed out earlier in this thread, in Python 3 the
    > > practicality apparently beat purity in this aspect:
    > >
    > > Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit
    > > (Intel)] on win32
    > > Type "copyright", "credits" or "license()" for more information.
    > >
    >>>> >>>> goth_urus_1 = '\U0001033f'
    >>>> >>>> list(goth_urus_1)

    > > ['\ud800', '\udf3f']
    >>>> >>>> len(goth_urus_1)

    > > 2
    >>>> >>>> ord(goth_urus_1)

    > > 66367
    >>>> >>>> goth_urus_2 = chr(66367)
    >>>> >>>> len(goth_urus_2)

    > > 2
    >>>> >>>> import unicodedata
    >>>> >>>> unicodedata.name(goth_urus_1)

    > > 'GOTHIC LETTER URUS'
    >>>> >>>> goth_urus_3 = unicodedata.lookup("GOTHIC LETTER URUS")
    >>>> >>>> goth_urus_4 = "\N{GOTHIC LETTER URUS}"
    >>>> >>>> goth_urus_1 == goth_urus_2 == goth_urus_3 == goth_urus_4

    > > True
    >>>> >>>>


    Yes, that certainly seems like much more sensible behavior.

    > > As for the behaviour in python 2.x, it's probably good enough, that
    > > the surrogates aren't prohibited and the eventually needed behaviour
    > > can be easily added via custom functions.


    Yes, I agree that given the current behavior is well documented
    and further, is fixed in python 3, it can't be changed.

    I would a nit though with "can be easily added via custom
    functions."
    I don't think that is a good criterion for rejection of functionality
    from the library because it is not sufficient; their are many
    functions
    in the library that fail that test. I think the criterion should
    be more like a ratio: (how often needed) / (ease of writing).
    [where "ease" is not just the line count but also the obviousness
    to someone who is not a python expert yet.]

    And I would also dispute that the generalized unichr/ord functions
    are "easily" added. When I ran into the TypeError in ord(), I
    thought "surrogate pairs" were something used in sex therapy. :)
    It took a lot of reading and research before I was able to write
    a generalized ord() function.
     
    , Aug 30, 2009
    #18
  19. "Martin v. Löwis" <> writes on Fri, 28 Aug 2009 10:12:34 +0200:
    > > The PEP says:
    > > * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
    > > length-one string.
    > >
    > > * unichr(i) for 2**16 <= i <= TOPCHAR will return a
    > > length-one string on wide Python builds. On narrow
    > > builds it will raise ValueError.
    > > and
    > > * ord() is always the inverse of unichr()
    > >
    > > which of course we know; that is the current behavior. But
    > > there is no reason given for that behavior.

    >
    > Sure there is, right above the list:
    >
    > "Most things will behave identically in the wide and narrow worlds."
    >
    > That's the reason: scripts should work the same as much as possible
    > in wide and narrow builds.
    >
    > What you propose would break the property "unichr(i) always returns
    > a string of length one, if it returns anything at all".


    But getting a "ValueError" in some builds (and not in others)
    is rather worse than getting unicode strings of different length....

    > > 1) Should surrogate pairs be disallowed on narrow builds?
    > > That appears to have been answered in the negative and is
    > > not relevant to my question.

    >
    > It is, as it does lead to inconsistencies between wide and narrow
    > builds. OTOH, it also allows the same source code to work on both
    > versions, so it also preserves the uniformity in a different way.


    Do you not have the inconsistencies in any case?
    .... "ValueError" in some builds and not in others ...
     
    Dieter Maurer, Aug 30, 2009
    #19
  20. > To reiterate, I am not advocating for any change. I
    > simply want to understand if there is a good reason
    > for limiting the use of unchr/ord on narrow builds to
    > a subset of the unicode characters that Python otherwise
    > supports. So far, it seems not and that unichr/ord
    > is a poster child for "purity beats practicality".


    I think that's actually the case. I went back to the discussions,
    and found that early 2.2 alpha releases did return two-byte
    strings from unichr, and that this was changed because Marc-Andre
    Lemburg insisted. Here are a few relevant messages from the
    archives (search for unichr)

    http://mail.python.org/pipermail/python-dev/2001-June/015649.html
    http://mail.python.org/pipermail/python-dev/2001-July/015662.html
    http://mail.python.org/pipermail/python-dev/2001-July/016110.html
    http://mail.python.org/pipermail/python-dev/2001-July/016153.html
    http://mail.python.org/pipermail/python-dev/2001-July/016155.html
    http://mail.python.org/pipermail/python-dev/2001-July/016186.html

    Eventually, in r28142, MAL changed it to give it its current
    state.

    Regards,
    Martin
     
    Martin v. Löwis, Aug 30, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee
    Replies:
    22
    Views:
    1,185
    Tim Roberts
    Mar 21, 2006
  2. Ezequiel, Justin

    unichr() question

    Ezequiel, Justin, Oct 16, 2003, in forum: Python
    Replies:
    1
    Views:
    1,495
    Martin v. =?iso-8859-15?q?L=F6wis?=
    Oct 16, 2003
  3. Ezequiel, Justin

    RE: unichr() question

    Ezequiel, Justin, Nov 5, 2003, in forum: Python
    Replies:
    1
    Views:
    404
    Martin v. =?iso-8859-15?q?L=F6wis?=
    Nov 5, 2003
  4. Xah Lee
    Replies:
    23
    Views:
    1,129
    Tim Roberts
    Mar 21, 2006
  5. Xah Lee
    Replies:
    21
    Views:
    852
    Tim Roberts
    Mar 21, 2006
Loading...

Share This Page