break unichr instead of fix ord?

rurpy · Aug 25, 2009

In Python 2.5 on Windows I could do [*1]:

# Create a unicode character outside of the BMP.
# On Windows it is represented as a surogate pair.

>>> len(a) 2
>>> a[0],a[1]

Click to expand...

Click to expand...

(u'\ud800', u'\udc40')

# Create the same character with the unichr() function.

>>> a = unichr (65600)
>>> a[0],a[1]

Click to expand...

Click to expand...

(u'\ud800', u'\udc40')

# Although the unichr() function works fine, its
# inverse, ord(), doesn't. TypeError: ord() expected a character, but string of length 2 found

On Python 2.6, unichr() was "fixed" (using the word
loosely) so that it too now fails with characters outside
the BMP.
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Why was this done rather than changing ord() to accept a
surrogate pair?

Does not this effectively make unichr() and ord() useless
on Windows for all but a subset of unicode characters?

Jan Kaliszewski · Aug 26, 2009

25-08-2009 o 21:45:49 said:
In Python 2.5 on Windows I could do [*1]:

# Create a unicode character outside of the BMP.
# On Windows it is represented as a surogate pair. [snip]
On Python 2.6, unichr() was "fixed" (using the word
loosely) so that it too now fails with characters outside
the BMP. [snip]
Does not this effectively make unichr() and ord() useless
on Windows for all but a subset of unicode characters?

Are you sure, you couldn't have UCS-4-compiled Python distro
for Windows?? :-O

*j

Mark Tolonen · Aug 26, 2009

In Python 2.5 on Windows I could do [*1]:

# Create a unicode character outside of the BMP.
# On Windows it is represented as a surogate pair.

len(a) 2
a[0],a[1]

Click to expand...

Click to expand...

(u'\ud800', u'\udc40')

# Create the same character with the unichr() function.

a = unichr (65600)
a[0],a[1]

Click to expand...

Click to expand...

(u'\ud800', u'\udc40')

# Although the unichr() function works fine, its
# inverse, ord(), doesn't.TypeError: ord() expected a character, but string of length 2 found

On Python 2.6, unichr() was "fixed" (using the word
loosely) so that it too now fails with characters outside
the BMP.
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Why was this done rather than changing ord() to accept a
surrogate pair?

Does not this effectively make unichr() and ord() useless
on Windows for all but a subset of unicode characters?

Switch to Python 3?
'\ud800\udc40'

-Mark

Vlastimil Brom · Aug 26, 2009

2009/8/25 said:
In Python 2.5 on Windows I could do [*1]:

# Create a unicode character outside of the BMP.
>>> a = u'\U00010040'

# On Windows it is represented as a surogate pair.
>>> len(a)
2
>>> a[0],a[1]
(u'\ud800', u'\udc40')

# Create the same character with the unichr() function.
>>> a = unichr (65600)
>>> a[0],a[1]
(u'\ud800', u'\udc40')

# Although the unichr() function works fine, its
# inverse, ord(), doesn't.
>>> ord (a)
TypeError: ord() expected a character, but string of length 2 found

On Python 2.6, unichr() was "fixed" (using the word
loosely) so that it too now fails with characters outside
the BMP.

>>> a = unichr (65600)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Why was this done rather than changing ord() to accept a
surrogate pair?

Does not this effectively make unichr() and ord() useless
on Windows for all but a subset of unicode characters?

Hi,
I'm not sure about the exact reasons for this behaviour on narrow
builds either (maybe the consistency of the input/ output data to
exactly one character?).

However, if I need these functions for higher unicode planes, the
following rather hackish replacements seem to work. I presume, there
might be smarter ways of dealing with this, but anyway...

hth,
vbr

#### not (systematically) tested #####################################

import sys

def wide_ord(char):
try:
return ord(char)
except TypeError:
if len(char) == 2 and 0xD800 <= ord(char[0]) <= 0xDBFF and
0xDC00 <= ord(char[1]) <= 0xDFFF:
return (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) -
0xDC00) + 0x10000
else:
raise TypeError("invalid character input")

def wide_unichr(i):
if i <= sys.maxunicode:
return unichr(i)
else:
return ("\U"+str(hex(i))[2:].zfill(8)).decode("unicode-escape")

Martin v. Löwis · Aug 26, 2009

In Python 2.5 on Windows I could do [*1]:

a = unichr (65600)
a[0],a[1]

Click to expand...

Click to expand...

(u'\ud800', u'\udc40')

I can't reproduce that. My copy of Python on Windows gives

Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
unichr(65600)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

This is

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32

Regards,
Martin

rurpy · Aug 27, 2009

In Python 2.5 on Windows I could do [*1]:

a = unichr (65600)
a[0],a[1]
(u'\ud800', u'\udc40')

Click to expand...

I can't reproduce that. My copy of Python on Windows gives

Traceback (most recent call last):
File "<pyshell#0>", line 1, in<module>
unichr(65600)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

This is

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32

Click to expand...

My apologies for the red herring. I was working from
a comment in my replacement ord() function. I dug up
an old copy of Python 2.4.3 and could not reproduce it
there either so I have no explanation for the comment
(which I wrote). Python 2.3 maybe?

But regardless, the significant question is, what is
the reason for having ord() (and unichr) not work for
surrogate pairs and thus not usable with a large number
of unicode characters that Python otherwise supports?

rurpy · Aug 27, 2009

In Python 2.5 on Windows I could do [*1]:

Click to expand...

# Create a unicode character outside of the BMP.
>>> a = u'\U00010040'

Click to expand...

# On Windows it is represented as a surogate pair.
>>> len(a)
2
>>> a[0],a[1]
(u'\ud800', u'\udc40')

Click to expand...

# Create the same character with the unichr() function.
>>> a = unichr (65600)
>>> a[0],a[1]
(u'\ud800', u'\udc40')

Click to expand...

# Although the unichr() function works fine, its
# inverse, ord(), doesn't.
>>> ord (a)
TypeError: ord() expected a character, but string of length 2 found

Click to expand...

On Python 2.6, unichr() was "fixed" (using the word
loosely) so that it too now fails with characters outside
the BMP.

Click to expand...

>>> a = unichr (65600)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Click to expand...

Why was this done rather than changing ord() to accept a
surrogate pair?

Click to expand...

Does not this effectively make unichr() and ord() useless
on Windows for all but a subset of unicode characters?

Click to expand...

Switch to Python 3?

'LINEAR B SYLLABLE B025 A2'>>> ord(x)
65600
'LINEAR B SYLLABLE B025 A2'>>> ord(chr(0x10040))
65600
'\ud800\udc40'

-Mark

I am still a long way away from moving to Python 3
but I am looking forward to hopefully more rational
unicode handling there. Thanks for the info.

rurpy · Aug 27, 2009

[...]
Hi,
I'm not sure about the exact reasons for this behaviour on narrow
builds either (maybe the consistency of the input/ output data to
exactly one character?).

However, if I need these functions for higher unicode planes, the
following rather hackish replacements seem to work. I presume, there
might be smarter ways of dealing with this, but anyway...

hth,
vbr

[...code snipped...]

Thanks, I wrote a replacement ord function nearly identical
to yours but will steal your unichr function if that's ok.

But I still wonder why all this is neccessary.

Steven D'Aprano · Aug 27, 2009

But regardless, the significant question is, what is the reason for
having ord() (and unichr) not work for surrogate pairs and thus not
usable with a large number of unicode characters that Python otherwise
supports?

I'm no expert on Unicode, but my guess is that the reason is out of a
desire for simplicity: unichr() should always return a single char, not a
pair of chars, and similarly ord() should take as input a single char,
not two, and return a single number.

Otherwise it would be ambiguous whether ord(surrogate_pair) should return
a pair of ints representing the codes for each item in the pair, or a
single int representing the code point for the whole pair.

E.g. given your earlier example:

a = u'\U00010040'
len(a) 2
a[0] u'\ud800'
a[1]

Click to expand...

Click to expand...

u'\udc40'

would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040? If the
latter, what about ord(u'ab')?

Remember that a unicode string can contain code points that aren't valid
characters:
55296

so if ord() sees a surrogate pair, it can't assume it's meant to be
treated as a surrogate pair rather than a pair of code points that just
happens to match a surrogate pair.

None of this means you can't deal with surrogate pairs, it just means you
can't deal with them using ord() and unichr().

The above is just my guess, I'd be interested to hear what others say.

rurpy · Aug 27, 2009

But regardless, the significant question is, what is the reason for
having ord() (and unichr) not work for surrogate pairs and thus not
usable with a large number of unicode characters that Python otherwise
supports?

Click to expand...

I'm no expert on Unicode, but my guess is that the reason is out of a
desire for simplicity: unichr() should always return a single char, not a
pair of chars, and similarly ord() should take as input a single char,
not two, and return a single number.

Otherwise it would be ambiguous whether ord(surrogate_pair) should return
a pair of ints representing the codes for each item in the pair, or a
single int representing the code point for the whole pair.

E.g. given your earlier example:

a = u'\U00010040'
len(a) 2
a[0] u'\ud800'
a[1]

Click to expand...

Click to expand...

u'\udc40'

would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040?

The latter.

If the
latter, what about ord(u'ab')?

I would expect a TypeError* (as ord() currently raises) because
the string length is not 1 and 'ab' is not a surrogate pair.

*Actually I would have expected ValueError but I'm not going
to lose sleep over it.

Remember that a unicode string can contain code points that aren't valid
characters:

55296

so if ord() sees a surrogate pair, it can't assume it's meant to be
treated as a surrogate pair rather than a pair of code points that just
happens to match a surrogate pair.

Well, actually, yes it can.

Python has already made a strong statement that such a pair
the representation of a character:

a = ''.join([u'\ud800',u'\udc40'])
a

Click to expand...

Click to expand...

u'\U00010040'

That is, Python prints, and treats in nearly all other contexts,
that combination as a character.

This is related to the practicality argument: what is the ratio
of need treat a surrogate pair as character consistent with
with the rest of Python, vs the need to treat it as a string
of two separate (and invalid in the unicode sense?) characters?

And if you want to treat each half of the pair separately
it's not exactly hard: ord(a[0]), ord(a[1]).

None of this means you can't deal with surrogate pairs, it just means you
can't deal with them using ord() and unichr().

Kind of like saying, it doesn't mean you can't deal
with integers larger that 2**32, you just can't multiply
and divide them.

Martin v. Löwis · Aug 27, 2009

My apologies for the red herring. I was working from

a comment in my replacement ord() function. I dug up
an old copy of Python 2.4.3 and could not reproduce it
there either so I have no explanation for the comment
(which I wrote). Python 2.3 maybe?

No. The behavior you observed would only happen on
a wide Unicode build (e.g. on Unix).

But regardless, the significant question is, what is
the reason for having ord() (and unichr) not work for
surrogate pairs and thus not usable with a large number
of unicode characters that Python otherwise supports?

See PEP 261, http://www.python.org/dev/peps/pep-0261/
It specifies all this.

Regards,
Martin

rurpy · Aug 28, 2009

>[...]

>> But regardless, the significant question is, what is
>> the reason for having ord() (and unichr) not work for
>> surrogate pairs and thus not usable with a large number
>> of unicode characters that Python otherwise supports?

Click to expand...

>
> See PEP 261, http://www.python.org/dev/peps/pep-0261/
> It specifies all this.

The PEP (AFAICT) says only what we already know... that
on narrow builds unichr() will raise an exception with
an argument >= 0x10000, and ord() is unichr()'s inverse.

I have read the PEP twice now and still see no justification
for that decision, it appears to have been made by fiat.[*1]

Could you or someone please point me to specific justification
for having unichr and ord work only for a subset of unicode
characters on narrow builds, as opposed to the more general
and IMO useful behavior proposed earlier in this thread?

----------------------------------------------------------
[*1]
The PEP says:
* unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
length-one string.

* unichr(i) for 2**16 <= i <= TOPCHAR will return a
length-one string on wide Python builds. On narrow
builds it will raise ValueError.
and
* ord() is always the inverse of unichr()

which of course we know; that is the current behavior. But
there is no reason given for that behavior.

Under the second *unicode bullet point, there are two issues
raised:
1) Should surrogate pairs be disallowed on narrow builds?
That appears to have been answered in the negative and is
not relevant to my question.
2) Should access to code points above TOPCHAR be allowed?
Not relevant to my question.

* every Python Unicode character represents exactly
one Unicode code point (i.e. Python Unicode
Character = Abstract Unicode character)

I'm not sure what this means (what's an abstract unicode
character?). If it mandates that u'\ud800\udc40' be
treated as a len() 2 string, that is that current case
but does not say anything about how unichr and ord
should behave. If it mandates that that string must
always be treated as two separate code points then
Python itself violates by printing that string as
u'\U00010040' rather than u'\ud800\udc40'.

Finally we read:

* There is a convention in the Unicode world for
encoding a 32-bit code point in terms of two
16-bit code points. These are known as
"surrogate pairs". Python's codecs will adopt
this convention.

Is a distinction made between Python and Python
codecs with only the latter having any knowledge of
surrogate pairs? I guess that would explain why
Python prints a surrogate pair as a single character.
But this seems arbitrary and counter-useful if
applied to ord() and unichr(). What possible
use-case is there for *not* recognizing surrogate
pairs in those two functions?

Nothing else in the PEP seems remotely relevant.

Martin v. Löwis · Aug 28, 2009

The PEP says:

* unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
length-one string.

* unichr(i) for 2**16 <= i <= TOPCHAR will return a
length-one string on wide Python builds. On narrow
builds it will raise ValueError.
and
* ord() is always the inverse of unichr()

which of course we know; that is the current behavior. But
there is no reason given for that behavior.

Sure there is, right above the list:

"Most things will behave identically in the wide and narrow worlds."

That's the reason: scripts should work the same as much as possible
in wide and narrow builds.

What you propose would break the property "unichr(i) always returns
a string of length one, if it returns anything at all".

1) Should surrogate pairs be disallowed on narrow builds?
That appears to have been answered in the negative and is
not relevant to my question.

It is, as it does lead to inconsistencies between wide and narrow
builds. OTOH, it also allows the same source code to work on both
versions, so it also preserves the uniformity in a different way.

* every Python Unicode character represents exactly
one Unicode code point (i.e. Python Unicode
Character = Abstract Unicode character)

I'm not sure what this means (what's an abstract unicode
character?).

I don't think this is actually the case, but I may be confusing
Unicode terminology here - "abstract character" is a term from
the Unicode standard.

Finally we read:

* There is a convention in the Unicode world for
encoding a 32-bit code point in terms of two
16-bit code points. These are known as
"surrogate pairs". Python's codecs will adopt
this convention.

Is a distinction made between Python and Python
codecs with only the latter having any knowledge of
surrogate pairs?

No. In the end, the Unicode type represents code units,
not code points, i.e. half surrogates are individually
addressable. Codecs need to adjust to that; in particular
the UTF-8 and the UTF-32 codec in narrow builds, and the
UTF-16 codec in wide builds (which didn't exist when the
PEP was written).

Nothing else in the PEP seems remotely relevant.

Except for the motivation, of course

In addition: your original question was "why has this
been changed", to which the answer is "it hasn't".
Then, the next question is "why is it implemented that
way", to which the answer is "because the PEP says so".
Only *then* the question is "what is the rationale for
the PEP specifying things the way it does". The PEP is
relevant so that we can both agree that Python behaves
correctly (in the sense of behaving as specified).

Regards,
Martin

rurpy · Aug 29, 2009

On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:

[I reordered the quotes from your previous post to try
and get the responses in a more coherent order. No
intent to take anything out of context...]
[to providing justification for the behavior of
unichr/ord]

>
> Except for the motivation, of course
>
> In addition: your original question was "why has this
> been changed", to which the answer is "it hasn't".

My original interest was two-fold: can unichr/ord be
changed to work in a more general and helpful way? That
seemed remotely possible until it was pointed out that
the two behave consistently, and that behavior is accurately
documented. Second, why would they work the way they do
when they could have been generalized to cover the full
unicode space? An inadequate answer to this would have
provided support for the first point but remains interesting
to me for the reason below.

> Then, the next question is "why is it implemented that
> way", to which the answer is "because the PEP says so".

Not at all a satisfying answer unless one believes
in PEPal infallibility.

> Only *then* the question is "what is the rationale for
> the PEP specifying things the way it does". The PEP is
> relevant so that we can both agree that Python behaves
> correctly (in the sense of behaving as specified).

But my question had become: why that behavior, when a
slightly different behavior would be more general with
little apparent downside?

To clarify, my interest in the justification for the
current behavior is this:

I think the best feature of python is not, as commonly
stated, the clean syntax, but rather the pretty complete
and orthogonal libraries. I often find, after I have
written some code, that due to the right library functions
being available, it turns out much shorter and concise
than I expected.

Nevertheless, every now and then, perhaps more than in some
other languages (I'm not sure), I run into something that
requires what seems to be excessive coding -- I have to
do something it seems to me that a library function should
have done for me. Sometimes this is because I don't under-
stand the reason the library function needs to works the
way it does. Other times it is one of the countless trade-
off made in the design of the language, which didn't happen
to go the way that would have been beneficial to me in a
particular coding situation.

But sometimes (and it feels too often) it seems as though,
zen not withstanding, that purity -- adherence to some
philosophic ideal -- beat practicality.
unichr/ord seems such as case to me, But I want to be
sure I am not missing something.

The reasons for the current behavior so far:

1.

What you propose would break the property "unichr(i) always returns
a string of length one, if it returns anything at all".

Yes. And i don't see the problem with that. Why is
that property more desirable than the non-existent
property that a Unicode literal always produces one
python character? It would only occur on a narrow
build with a unicode character outside of the bmp,
exactly the condition a unicode literal can "behave
differently" by producing two python characters.

2.

But there is no reason given [in the PEP] for that behavior.

Click to expand...

Sure there is, right above the list:
"Most things will behave identically in the wide and narrow worlds."
That's the reason: scripts should work the same as much as possible
in wide and narrow builds.

So what else would work "differently"? My point was
that extending unichr/ord to work with all unicode
characters reduces differences far more often than
it increase them.

3.

No. In the end, the Unicode type represents code units,
not code points, i.e. half surrogates are individually
addressable. Codecs need to adjust to that; in particular
the UTF-8 and the UTF-32 codec in narrow builds, and the
UTF-16 codec in wide builds (which didn't exist when the
PEP was written).

OK, so that is not a reason either.

4.
I'll speculate a little.
If surrogate handling was added to ord/unichr, it would
be the top of a slippery slope leading to demands that
other string functions also handle surrogates.

But this is not true -- there is a strong distinction
between ord/unichr and other string methods. The latter
deal with strings of multiple characters. But the former
deals only with single characters (taking a surrogate
pair as a single unicode character.)

The behavior of ord/unichr is independent of the other
string methods -- if they were changed with regard to
surrogate handling they would all have to be changed to
maintain consistent behavior. Unichr/str affect only
each other.

The functions of ord/unichr -- to map characters to
numbers -- are fundamental string operations, akin to
indexing or extracting a substring. So why would
one want to limit them to a subset of characters if
not absolutely necessary?

To reiterate, I am not advocating for any change. I
simply want to understand if there is a good reason
for limiting the use of unchr/ord on narrow builds to
a subset of the unicode characters that Python otherwise
supports. So far, it seems not and that unichr/ord
is a poster child for "purity beats practicality".

Steven D'Aprano · Aug 29, 2009

Not at all a satisfying answer unless one believes in PEPal
infallibility.

Not at all. You don't have to believe that PEPs are infallible to accept
the answer, you just have to understand that major changes to Python
aren't made arbitrarily, they have to go through a PEP first. Even Guido
himself has to write a PEP before making any major changes to the
language. But PEPs aren't infallible, they can be challenged, rejected,
withdrawn or made obsolete by new PEPs.

The reasons for the current behavior so far:

1.

Yes. And i don't see the problem with that. Why is that property more
desirable than the non-existent property that a Unicode literal always
produces one python character?

What do you mean? Unicode literals don't always produce one character,
e.g. u'abcd' is a Unicode literal with four characters.

I think it's fairly self-evident that a function called uniCHR [emphasis
added] should return a single character (technically a single code
point). But even if you can come up with a reason for unichr() to return
two or more characters, this would break code that relies on the
documented promise that the length of the output of unichr() is always
one.

It would only occur on a narrow build
with a unicode character outside of the bmp, exactly the condition a
unicode literal can "behave differently" by producing two python
characters.

2.

But there is no reason given [in the PEP] for that behavior.

Click to expand...

Sure there is, right above the list:
"Most things will behave identically in the wide and narrow worlds."
That's the reason: scripts should work the same as much as possible in
wide and narrow builds.

Click to expand...

So what else would work "differently"?

unichr(n) sometimes would return one character and sometimes two; ord(c)
would sometimes accept two characters and sometimes raise an exception.
That's a fairly major difference.

My point was that extending
unichr/ord to work with all unicode characters reduces differences far
more often than it increase them.

I don't see that at all. What differences do you think it would reduce?

3.

OK, so that is not a reason either.

I think it is a very important reason. Python supports code points, so it
has to support surrogate codes individually. Python can't tell if the
pair of code points u'\ud800\udc40' represents the single character
\U00010040 or a pair of code points \ud800 and \udc40.

4.
I'll speculate a little.
If surrogate handling was added to ord/unichr, it would be the top of a
slippery slope leading to demands that other string functions also
handle surrogates.

But this is not true -- there is a strong distinction between ord/unichr
and other string methods. The latter deal with strings of multiple
characters. But the former deals only with single characters (taking a
surrogate pair as a single unicode character.)

Strictly speaking, unichr() deals with code points, not characters,
although the distinction is very fine.
'Cs'

Cs is the general category for "Other, Surrogate", so \udc40 is not
strictly speaking a character. Nevertheless, Python treats it as one.

To reiterate, I am not advocating for any change. I simply want to
understand if there is a good reason for limiting the use of unchr/ord
on narrow builds to a subset of the unicode characters that Python
otherwise supports. So far, it seems not and that unichr/ord is a
poster child for "purity beats practicality".

On the contrary, it seems pretty impractical to me for ord() to sometimes
successfully accept strings of length two and sometimes to raise an
exception. I would much rather see a pair of new functions, wideord() and
widechr() used for converting between surrogate pairs and numbers.

Vlastimil Brom · Aug 29, 2009

2009/8/29 said:
On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:

So far, it seems not and that unichr/ord
is a poster child for "purity beats practicality".

As Mark Tolonen pointed out earlier in this thread, in Python 3 the
practicality apparently beat purity in this aspect:

Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.

goth_urus_1 = '\U0001033f'
list(goth_urus_1) ['\ud800', '\udf3f']
len(goth_urus_1) 2
ord(goth_urus_1) 66367
goth_urus_2 = chr(66367)
len(goth_urus_2) 2
import unicodedata
unicodedata.name(goth_urus_1) 'GOTHIC LETTER URUS'
goth_urus_3 = unicodedata.lookup("GOTHIC LETTER URUS")
goth_urus_4 = "\N{GOTHIC LETTER URUS}"
goth_urus_1 == goth_urus_2 == goth_urus_3 == goth_urus_4 True

Click to expand...

Click to expand...

As for the behaviour in python 2.x, it's probably good enough, that
the surrogates aren't prohibited and the eventually needed behaviour
can be easily added via custom functions.

vbr

rurpy · Aug 30, 2009

On 08/29/2009 12:06 PM, Steven D'Aprano wrote:
[...]

What do you mean? Unicode literals don't always produce one character,
e.g. u'abcd' is a Unicode literal with four characters.

I'm sorry, I should have been clearer. I meant the literal
representation of a *single* unicode character. u'\u4000'
which results in a string of length 1, vs u'\U00010040' which
results in a string of length 2. In both case the literal
represents a single unicode code point.

I think it's fairly self-evident that a function called uniCHR [emphasis
added] should return a single character (technically a single code
point).

There are two concepts of characters here, the 16-bit things
that encodes a character in a Python unicode string (in a
narrow build Python), and a character in the sense of one
of the ~2**10 unicode characters. Python has chosen to
represent the latter (when outside the BMP) as a pair of
surrogate characters from the former. I don't see why one
would assume that CHR would mean the python 16-bit
character concept rather than the full unicode character
concept. In fact, rather the opposite.

But even if you can come up with a reason for unichr() to return
two or more characters,

I've given a number of reasons why it should return a two
character representation of a non-BMP character, one of
which is that that is how Python has chosen to represent
such characters internally. I won't repeat the other
reasons again.

I'm not sure why you think more than two characters
would ever be possible.

this would break code that relies on the
documented promise that the length of the output of unichr() is always
one.

Ah, OK. This is the good reason I was looking for.
I did not realize (until prompted by your remark
to go back and look at the early docs) that unichr
had been documented to return a single character
since 2.0 and that wide character support was added
in 2.2. Martin v. Loewis also implied that, I now
see, although the implication was too deep for me
to pick up.

So although it leads to a suboptimal situation, I
agree that maintaining the documented behavior was
necessary.

[...]

I would much rather see a pair of new functions, wideord() and
widechr() used for converting between surrogate pairs and numbers.

I guess if it were still 2001 and Python 2.2 was
coming out I would be in favor of this too.

rurpy · Aug 30, 2009

2009/8/29 said:
2009/8/29 said:

On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:

So far, it seems not and that unichr/ord
is a poster child for "purity beats practicality".

Click to expand...

As Mark Tolonen pointed out earlier in this thread, in Python 3 the
practicality apparently beat purity in this aspect:

Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.

goth_urus_1 = '\U0001033f'
list(goth_urus_1) ['\ud800', '\udf3f']
len(goth_urus_1) 2

ord(goth_urus_1)

Click to expand...

66367
goth_urus_2 = chr(66367)
len(goth_urus_2) 2
import unicodedata
unicodedata.name(goth_urus_1) 'GOTHIC LETTER URUS'
goth_urus_3 = unicodedata.lookup("GOTHIC LETTER URUS")
goth_urus_4 = "\N{GOTHIC LETTER URUS}"
goth_urus_1 == goth_urus_2 == goth_urus_3 == goth_urus_4 True

Click to expand...

Click to expand...

Yes, that certainly seems like much more sensible behavior.

Yes, I agree that given the current behavior is well documented
and further, is fixed in python 3, it can't be changed.

I would a nit though with "can be easily added via custom
functions."
I don't think that is a good criterion for rejection of functionality
from the library because it is not sufficient; their are many
functions
in the library that fail that test. I think the criterion should
be more like a ratio: (how often needed) / (ease of writing).
[where "ease" is not just the line count but also the obviousness
to someone who is not a python expert yet.]

And I would also dispute that the generalized unichr/ord functions
are "easily" added. When I ran into the TypeError in ord(), I
thought "surrogate pairs" were something used in sex therapy.

It took a lot of reading and research before I was able to write
a generalized ord() function.

Dieter Maurer · Aug 30, 2009

Martin v. Löwis said:
Sure there is, right above the list:

"Most things will behave identically in the wide and narrow worlds."

That's the reason: scripts should work the same as much as possible
in wide and narrow builds.

What you propose would break the property "unichr(i) always returns
a string of length one, if it returns anything at all".

But getting a "ValueError" in some builds (and not in others)
is rather worse than getting unicode strings of different length....

It is, as it does lead to inconsistencies between wide and narrow
builds. OTOH, it also allows the same source code to work on both
versions, so it also preserves the uniformity in a different way.

Do you not have the inconsistencies in any case?
.... "ValueError" in some builds and not in others ...

Martin v. Löwis · Aug 30, 2009

To reiterate, I am not advocating for any change. I

simply want to understand if there is a good reason
for limiting the use of unchr/ord on narrow builds to
a subset of the unicode characters that Python otherwise
supports. So far, it seems not and that unichr/ord
is a poster child for "purity beats practicality".

I think that's actually the case. I went back to the discussions,
and found that early 2.2 alpha releases did return two-byte
strings from unichr, and that this was changed because Marc-Andre
Lemburg insisted. Here are a few relevant messages from the
archives (search for unichr)

http://mail.python.org/pipermail/python-dev/2001-June/015649.html
http://mail.python.org/pipermail/python-dev/2001-July/015662.html
http://mail.python.org/pipermail/python-dev/2001-July/016110.html
http://mail.python.org/pipermail/python-dev/2001-July/016153.html
http://mail.python.org/pipermail/python-dev/2001-July/016155.html
http://mail.python.org/pipermail/python-dev/2001-July/016186.html

Eventually, in r28142, MAL changed it to give it its current
state.

Regards,
Martin

unicode string alteration	0	Aug 12, 2010
Wrong unichr docstring in 2.7	3	Aug 22, 2010
python tr equivalent (non-ascii)	3	Aug 13, 2008
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Python wide-python-build unicode for Windows	1	Apr 29, 2011
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Regex for unicode letter characters	4	Jan 11, 2009
Wide Unicode build for Windows available somewhere?	1	Jan 11, 2005

break unichr instead of fix ord?

rurpy

Jan Kaliszewski

Mark Tolonen

Vlastimil Brom

Martin v. Löwis

rurpy

rurpy

rurpy

Steven D'Aprano

rurpy

Martin v. Löwis

rurpy

Martin v. Löwis

rurpy

Steven D'Aprano

Vlastimil Brom

rurpy

rurpy

Dieter Maurer

Martin v. Löwis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads