Python Unicode handling wins again -- mostly

S

Steven D'Aprano

There's a recent blog post complaining about the lousy support for
Unicode text in most programming languages:

http://mortoray.com/2013/11/27/the-string-type-is-broken/

The author, Mortoray, gives nine basic tests to understand how well the
string type in a language works. The first four involve "user-perceived
characters", also known as grapheme clusters.


(1) Does the decomposed string "noe\u0308l" print correctly? Notice that
the accented letter ë has been decomposed into a pair of code points,
U+0065 (LATIN SMALL LETTER E) and U+0308 (COMBINING DIAERESIS).

Python 3.3 passes this test:

py> print("noe\u0308l")
noël

although I expect that depends on the terminal you are running in.


(2) If you reverse that string, does it give "lëon"? The implication of
this question is that strings should operate on grapheme clusters rather
than code points. Python fails this test:

py> print("noe\u0308l"[::-1])
leon

Some terminals may display the umlaut over the l, or following the l.

I'm not completely sure it is fair to expect a string type to operate on
grapheme clusters (collections of decomposed characters) as the author
expects. I think that is going above and beyond what a basic string type
should be expected to do. I would expect a solid Unicode implementation
to include support for grapheme clusters, and in that regard Python is
lacking functionality.


(3) What are the first three characters? The author suggests that the
answer should be "noë", in which case Python fails again:

py> print("noe\u0308l"[:3])
noe

but again I'm not convinced that slicing should operate across decomposed
strings in this way. Surely the point of decomposing the string like that
is in order to count the base character e and the accent "\u0308"
separately?


(4) Likewise, what is the length of the decomposed string? The author
expects 4, but Python gives 5:

py> len("noe\u0308l")
5

So far, Python passes only one of the four tests, but I'm not convinced
that the three failed tests are fair for a string type. If strings
operated on grapheme clusters, these would be good tests, but it is not a
given that strings should.

The next few tests have to do with characters in the Supplementary
Multilingual Planes, and this is where Python 3.3 shines. (In older
versions, wide builds would also pass, but narrow builds would fail.)

(5) What is the length of "😸😾"?

Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E
(POUTING CAT FACE) are outside the Basic Multilingual Plane, which means
they require more than two bytes each. Most programming languages using
UTF-16 encodings internally (including Javascript and Java) fail this
test. Python 3.3 passes:

py> s = '😸😾'
py> len(s)
2

(Older versions of Python distinguished between *narrow builds*, which
used UTF-16 internally and *wide builds*, which used UTF-32. Narrow
builds would also fail this test.)

This makes Python one of a very few programming languages which can
easily handle so-called "astral characters" from the Supplementary
Multilingual Planes while still having O(1) indexing operations.


(6) What is the substring after the first character? The right answer is
a single character POUTING CAT FACE, and Python gets that correct:

py> unicodedata.name(s[1:])
'POUTING CAT FACE'

UTF-16 languages invariable end up with broken, invalid strings
containing half of a surrogate pair.


(7) What is the reverse of the string?

Python passes this test too:

py> print(s[::-1])
😾😸
py> for c in s[::-1]:
.... unicodedata.name(c)
....
'POUTING CAT FACE'
'GRINNING CAT FACE WITH SMILING EYES'

UTF-16 based languages typically break, again getting invalid strings
containing surrogate pairs in the wrong order.


The next test involves ligatures. Ligatures are pairs, or triples, of
characters which have been moved closer together in order to look better.
Normally you would expect the type-setter to handle ligatures by
adjusting the spacing between characters, but there are a few pairs (such
as "fi" <=> "ï¬" where type designers provided them as custom-designed
single characters, and Unicode includes them as legacy characters.

(8) What's the uppercase of "baffle" spelled with an ffl ligature?

Like most other languages, Python 3.2 fails:

py> 'baffle'.upper()
'BAfflE'

but Python 3.3 passes:

py> 'baffle'.upper()
'BAFFLE'


Lastly, Mortoray returns to noël, and compares the composed and
decomposed versions of the string:

(9) Does "noël" equal "noe\u0308l"?

Python (correctly, in my opinion) reports that they do not:

py> "noël" == "noe\u0308l"
False

Again, one might argue whether a string type should report these as equal
or not, I believe Python is doing the right thing here. As the author
points out, any decent Unicode-aware language should at least offer the
ability to convert between normalisation forms, and Python passes this
test:

py> unicodedata.normalize("NFD", "noël") == "noe\u0308l"
True
py> "noël" == unicodedata.normalize("NFC", "noe\u0308l")
True


Out of the nine tests, Python 3.3 passes six, with three tests being
failures or dubious. If you believe that the native string type should
operate on code-points, then you'll think that Python does the right
thing. If you think it should operate on grapheme clusters, as the author
of the blog post does, then you'll think Python fails those three tests.


A call to arms
==============

As the Unicode Consortium itself acknowledges, sometimes you want to
operate on an array of code points, and sometimes on an array of
graphemes ("user-perceived characters"). Python 3.3 is now halfway there,
having excellent support for code-points across the entire Unicode
character set, not just the BMP.

The next step is to provide either a data type, or a library, for working
on grapheme clusters. The Unicode Consortium provides a detailed
discussion of this issue here:

http://www.unicode.org/reports/tr29/

If anyone is looking for a meaty project to work on, providing support
for grapheme clusters could be it. And if not, hopefully you've learned
something about Unicode and the limitations of Python's Unicode support.
 
R

Roy Smith

Steven D'Aprano said:
(8) What's the uppercase of "baffle" spelled with an ffl ligature?

Like most other languages, Python 3.2 fails:

py> 'baffle'.upper()
'BAfflE'

but Python 3.3 passes:

py> 'baffle'.upper()
'BAFFLE'

I disagree.

The whole idea of ligatures like fi is purely typographic. The crossbar
on the "f" (at least in some fonts) runs into the dot on the "i".
Likewise, the top curl on an "f" run into the serif on top of the "l"
(and similarly for ffl).

There is no such thing as a "FFL" ligature, because the upper case
letterforms don't run into each other like the lower case ones do.
Thus, I would argue that it's wrong to say that calling upper() on an
ffl ligature should yield FFL.

I would certainly expect, x.lower() == x.upper().lower(), to be True for
all values of x over the set of valid unicode codepoints. Having
u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) ==
len(x.upper()) to be True.
 
C

Chris Angelico

I would certainly expect, x.lower() == x.upper().lower(), to be True for
all values of x over the set of valid unicode codepoints. Having
u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) ==
len(x.upper()) to be True.

That's a nice theory, but the Unicode consortium disagrees with you on
both points.

ChrisA
 
D

Dave Angel

And they were already false long before Unicode. I don’t know
specifics but there are many cases where there are no uppercase
equivalents for a particular lowercase character. And others where
the uppercase equivalent takes multiple characters.
 
S

Steven D'Aprano

You edited my text to remove the ligature? That's... unfortunate.


I disagree.

The whole idea of ligatures like fi is purely typographic.

In English, that's correct. I'm not sure if we can generalise that to all
languages that have ligatures. It also partly depends on how you define
ligatures. For example, would you consider that ampersand & to be a
ligature? These days, I would consider & to be a distinct character, but
originally it began as a ligature for "et" (Latin for "and").

But let's skip such corner cases, as they provide much heat but no
illumination, and I'll agree that when it comes to ligatures like fl, fi
and ffl, they are purely typographic.

The crossbar
on the "f" (at least in some fonts) runs into the dot on the "i".
Likewise, the top curl on an "f" run into the serif on top of the "l"
(and similarly for ffl).

There is no such thing as a "FFL" ligature, because the upper case
letterforms don't run into each other like the lower case ones do. Thus,
I would argue that it's wrong to say that calling upper() on an ffl
ligature should yield FFL.

Your conclusion doesn't follow from the argument you are making. Since
the ffl ligature ffl is purely a typographical feature, then it should
uppercase to FFL (there being no typographic feature for uppercase FFL
ligature).

Consider the examples shown above, where you or your software
unfortunately edited out the ligature and replaced it with ASCII "ffl".
Or perhaps I should say *fortunately*, since it demonstrates the problem.

Since we agree that the ffl ligature is merely a typographic artifact of
some type-designers whimsy, we can expect that the word "baffle" is
semantically exactly the same as the word "baffle". How foolish Python
would look if it did this:

py> 'baffle'.upper()
'BAfflE'


Replace the 'ffl' with the ligature, and the conclusion remains:

py> 'baffle'.upper()
'BAfflE'

would be equally wrong.

Now, I accept that this picture isn't entirely black and white. For
example, we might argue that if ffl is purely typographical in nature,
surely we would also want 'baffle' == 'baffle' too? Or maybe not. This
indicates that capturing *all* the rules for text across the many
languages, writing systems and conventions is impossible.

There are some circumstances where we would want 'baffle' and 'baffle' to
compare equal, and others where we would want them to compare the same.
Python gives us both:

py> "bapy> "baffle" == "baffle"
False
ffle" == unicodedata.normalize("NFKC", "baffle")
True


but frankly I'm baffled *wink* that you think there are any circumstances
where you would want the uppercase of ffl to be anything but FFL.

I would certainly expect, x.lower() == x.upper().lower(), to be True for
all values of x over the set of valid unicode codepoints.

You would expect wrongly. You are over-generalising from English, and if
you include ligatures and other special cases, not even all of English.

See, for example:

http://www.unicode.org/faq/casemap_charprop.html#7a

Apart from ligatures, some examples of troublesome characters with regard
to case are:

* German Eszett (sharp-S) ß can be uppercased to SS, SZ or ẞ depending
on context, particular when dealing with placenames and family names.

(That last character, LATIN CAPITAL LETTER SHARP S, goes back to at
least the 1930s, although the official rules of German orthography
still insist on uppercasing ß to SS.)

* The English long-s Å¿ is uppercased to regular S.

* Turkish dotted and dotless I (İ and i, I and ı) uses the same Latin
letters I and i but the case conversion rules are different.

* Both the Greek sigma σ and final sigma ς uppercase to Σ.


That last one is especially interesting: Python 3.3 gets it right, while
older Pythons do not. In Python 3.2:

py> 'ὈδυσσεÏÏ‚ (Odysseus)'.upper().title()
'ὈδυσσεÏσ (Odysseus)'

while in 3.3 it roundtrips correctly:

py> 'ὈδυσσεÏÏ‚ (Odysseus)'.upper().title()
'ὈδυσσεÏÏ‚ (Odysseus)'


So... case conversions are not as simple as they appear at first glance.
They aren't always reversible, nor do they always roundtrip. Titlecase is
not necessarily the same as "uppercase the first letter and lowercase the
rest". Case conversions can be context or locale sensitive.

Anyway... even if you disagree with everything I have said, it is a fact
that Python has committed to following the Unicode standard, and the
Unicode standard requires that certain ligatures, including FFL, FL and
FI, are decomposed when converted to uppercase.
 
Z

Zero Piraeus

:

The whole idea of ligatures like fi is purely typographic.

In English, that's correct. I'm not sure if we can generalise that to
all languages that have ligatures. It also partly depends on how you
define ligatures. For example, would you consider that ampersand & to
be a ligature? These days, I would consider & to be a distinct
character, but originally it began as a ligature for "et" (Latin for
"and").

But let's skip such corner cases, as they provide much heat but no
illumination, [...]

In the interest of warmth (I know it's winter in some parts of the
world) ...

As I understand it, "&" has always been used to replace the word "et"
specifically, rather than the letter-pair e,t (no-one has ever written
"k&tle" other than ironically), which makes it a logogram rather than a
ligature (like "@").

(I happen to think the presence of ligatures in Unicode is insane, but
my dictator-of-the-world certificate appears to have gotten lost in the
post, so fixing that will have to wait).

-[]z.
 
G

Gene Heskett

The whole idea of ligatures like fi is purely typographic.

In English, that's correct. I'm not sure if we can generalise that to
all languages that have ligatures. It also partly depends on how you
define ligatures. For example, would you consider that ampersand & to
be a ligature? These days, I would consider & to be a distinct
character, but originally it began as a ligature for "et" (Latin for
"and").

But let's skip such corner cases, as they provide much heat but no
illumination, [...]

In the interest of warmth (I know it's winter in some parts of the
world) ...

As I understand it, "&" has always been used to replace the word "et"
specifically, rather than the letter-pair e,t (no-one has ever written
"k&tle" other than ironically), which makes it a logogram rather than a
ligature (like "@").

Whereas in these here parts, the "&" has always been read as a single
character shortcut for the word "and".
(I happen to think the presence of ligatures in Unicode is insane, but
my dictator-of-the-world certificate appears to have gotten lost in the
post, so fixing that will have to wait).

-[]z.


Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

"I remember when I was a kid I used to come home from Sunday School and
my mother would get drunk and try to make pancakes."
-- George Carlin
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.
 
R

Roy Smith

The whole idea of ligatures like fi is purely typographic.

In English, that's correct. I'm not sure if we can generalise that to all
languages that have ligatures. It also partly depends on how you define
ligatures.[/QUOTE]

I was speaking specifically of "ligatures like fi" (or, if you prefer,
"ligatures like ό". By which I mean those things printers invented
because some letter combinations look funny when typeset as two distinct
letters.

There are other kinds of ligatures. For example, Å“ is a dipthong. It
makes sense (well, to me, anyway) that upper case œ is Έ.

Well, anyway, that's the truth according to me. Apparently the Unicode
Consortium disagrees. So, who am I to argue with the people who decided
that I needed to be able to type a "PILE OF POO" character. Which, by
the way, I can find in my "Character Viewer" input helper, but which MT
Newswatcher doesn't appear to be willing to insert into text. I guess
Basic Multilingual Poo would have been OK but Astral Poo is too much for
it.
 
I

Ian Kelly

I was speaking specifically of "ligatures like fi" (or, if you prefer,
"ligatures like ü". By which I mean those things printers invented
because some letter combinations look funny when typeset as two distinct
letters.

I think the encoding of your email is incorrect, because GREEK SMALL
LETTER OMICRON WITH TONOS is not a ligature.
There are other kinds of ligatures. For example, oe is a dipthong. It
makes sense (well, to me, anyway) that upper case oe is ¸.

As above. I can't fathom why would it make sense for the upper case of
LATIN SMALL LIGATURE OE to be GREEK CAPITAL LETTER EPSILON WITH TONOS.
 
S

Steven D'Aprano

(I happen to think the presence of ligatures in Unicode is insane, but
my dictator-of-the-world certificate appears to have gotten lost in the
post, so fixing that will have to wait).

You're probably right, but we live in an insane world of dozens of insane
legacy encodings, and part of the mission of Unicode is to include every
single character that those legacy encodings did. Since some of them
included ligatures, so must Unicode. Sad but true.

(Unicode is intended as a replacement for the insanity of dozens of
multiply incompatible character sets. It cannot hope to replace them if
it cannot represent every distinct character they represent.)
 
S

Steven D'Aprano

I think the encoding of your email is incorrect, because GREEK SMALL
LETTER OMICRON WITH TONOS is not a ligature.

Roy's post, which is sent via Usenet not email, doesn't have an encoding
set. Since he's sending from a Mac, his software may believe that the
entire universe understands the Mac Roman encoding, which makes a certain
amount of sense since if I recall correctly the fi and fl ligatures
originally appeared in early Mac fonts.

I'm going to give Roy the benefit of the doubt and assume he actually
entered the fi ligature at his end. If his software was using Mac Roman,
it would insert a single byte DE into the message:

py> '\N{LATIN SMALL LIGATURE FI}'.encode('macroman')
b'\xde'


But that's not what his post includes. The message actually includes two
bytes CF8C, in other words:

'\N{LATIN SMALL LIGATURE FI}'.encode('who the hell knows')
=> b'\xCF\x8C'


Since nearly all of his post is in single bytes, it's some variable-width
encoding, but not UTF-8.

With no encoding set, our newsreader software starts off assuming that
the post uses UTF-8 ('cos that's the only sensible default), and those
two bytes happen to encode to ό GREEK SMALL LETTER OMICRON WITH TONOS.

I'm not surprised that Roy has a somewhat jaundiced view of Unicode, when
the tools he uses are apparently so broken. But it isn't Unicode's fault,
its the tools.

The really bizarre thing is that apparently Roy's software, MT-
NewsWatcher, knows enough Unicode to normalise ffl LATIN SMALL LIGATURE FFL
(sent in UTF-8 and therefore appearing as bytes b'\xef\xac\x84') to the
ASCII letters "ffl". That's astonishingly weird.

That is really a bizarre error. I suppose it is not entirely impossible
that the software is actually being clever rather than dumb. Having
correctly decoded the UTF-8 bytes, perhaps it realised that there was no
glyph for the ligature, and rather than display a MISSING CHAR glyph
(usually one of those empty boxes you sometimes see), it normalized it to
ASCII. But if it's that clever, why the hell doesn't it set an encoding
line in posts?????
 
S

Steven D'Aprano

So, who am I to argue with the people who decided that I needed to be
able to type a "PILE OF POO" character.

Blame the Japanese for that. Apparently some of the biggest users of
Unicode are the various Japanese mobile phone manufacturers, TV stations,
map makers and similar. So there's a large number of symbols and emoji
(emoticons) specifically added for them, presumably because they pay big
dollars to the Unicode Consortium and therefore get a lot of influence in
what gets added.
 
W

wxjmfauth

Le samedi 30 novembre 2013 03:08:49 UTC+1, Roy Smith a écrit :
The whole idea of ligatures like fi is purely typographic. The crossbar

on the "f" (at least in some fonts) runs into the dot on the "i".

Likewise, the top curl on an "f" run into the serif on top of the "l"

(and similarly for ffl).


And do you know the origin of this typographical feature?
Because, mechanically, the dot of the "i" broke too often.

I cann't proof that's the truth, I read this many times in
the literature speaking about typography and about unicode.

In my opinion, a very plausible explanation.

jmf
 
G

Gregory Ewing

And do you know the origin of this typographical feature?
Because, mechanically, the dot of the "i" broke too often.

In my opinion, a very plausible explanation.

It doesn't sound very plausible to me, because there
are a lot more stand-alone 'i's in English text than
there are ones following an f. What is there to stop
them from breaking?

It's more likely to be simply a kerning issue. You
want to get the stems of the f and the i close together,
and the only practical way to do that with mechanical
type is to merge them into one piece of metal.

Which makes it even sillier to have an 'ffi' character
in this day and age, when you can simply space the
characters so that they overlap.
 
G

Gregory Ewing

Steven said:
Blame the Japanese for that. Apparently some of the biggest users of
Unicode are the various Japanese mobile phone manufacturers, TV stations,
map makers and similar.

Also there's apparently a pun in Japanese involving the
words for 'poo' and 'luck'. So putting a poo symbol in
your text message means 'good luck'. Given that, it's
not *quite* as silly as it seems.
 
N

Ned Batchelder

It doesn't sound very plausible to me, because there
are a lot more stand-alone 'i's in English text than
there are ones following an f. What is there to stop
them from breaking?

It's more likely to be simply a kerning issue. You
want to get the stems of the f and the i close together,
and the only practical way to do that with mechanical
type is to merge them into one piece of metal.

Which makes it even sillier to have an 'ffi' character
in this day and age, when you can simply space the
characters so that they overlap.

The fi ligature was created because visually, an f and i wouldn't work
well together: the crossbar of the f was near, but not connected to the
serif of the i, and the terminal bulb of the f was close to, but not
coincident, with the dot of the i.

This article goes into great detail, and has a good illustration of how
an f and i can clash, and how an fi ligature can fix the problem:
http://opentype.info/blog/2012/11/20/whats-a-ligature/ . Note the second
fi illustration, which demonstrates using a ligature to make the letters
appear *less* connected than they would individually!

This is also why "simply spacing the characters" isn't a solution: a
specially designed ligature looks better than a separate f and i, no
matter how minutely kerned.

It's unfortunate that Unicode includes presentation alternatives like
the fi (and ff, fl, ffi, and fl) ligatures. It was done to be a
superset of existing encodings.

Many typefaces have other non-encoded ligatures as well, especially
display faces, which also have alternate glyphs. Unicode is a funny mix
in that it includes some forms of alternates, but can't include all of
them, so we have to put up with both an ad-hoc Unicode that includes
presentational variants, and also some other way to specify variants
because Unicode can't include all of them.

--Ned.
 
S

Steven D'Aprano

Which makes it even sillier to have an 'ffi' character in this day and
age, when you can simply space the characters so that they overlap.

It's in Unicode to support legacy character sets that included it[1].
There are a bunch of similar cases:

* LATIN CAPITAL LETTER A WITH RING ABOVE versus ANGSTROM SIGN
* KELVIN SIGN versus LATIN CAPITAL LETTER A
* DEGREE CELSIUS and DEGREE FAHRENHEIT
* the whole set of full-width and half-width forms

On the other hand, there are cases which to a naive reader might look
like needless duplication but actually aren't. For example, there are a
bunch of visually indistinguishable characters[2] in European languages,
like AΑРand BΒВ. The reason for this becomes more obvious[3] when you
lowercase them:

py> 'AΑРBΒВ'.lower()
'aαа bβв'

Sorting and case-conversion rules would become insanely complicated, and
context-sensitive, if Unicode only included a single code point per thing-
that-looks-the-same.

The rules for deciding what is and what isn't a distinct character can be
quite complex, and often politically charged. There's a lot of opposition
to Unicode in East Asian countries because it unifies Han ideograms that
look and behave the same in Chinese, Japanese and Korean. The reason they
do this is for the same reason that Unicode doesn't distinguish between
(say) English A, German A and French A. One reason some East Asians want
it to is for the same reason you or I might wish to flag a section of
text as English and another section of text as German, and have them
displayed in slightly different typefaces and spell-checked with a
different dictionary. The Unicode Consortium's answer to that is, this is
beyond the remit of the character set, and is best handled by markup or
higher-level formatting.

(Another reason for opposing Han unification is, let's be frank, pure
nationalism.)



[1] As far as I can tell, the only character supported by legacy
character sets which is not included in Unicode is the Apple logo from
Mac charsets.

[2] The actual glyphs depends on the typeface used.

[3] Again, modulo the typeface you're using to view them.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top