Unicode Normalization Form C?

emf · Apr 4, 2013

My webpage:

https://files.nyu.edu/emf202/public/fr/limericks.html

checks OK with the W3C Validation Service as HTML5, but it triggers 8
warnings that have to do with the use of the following Greek characters:

? U0387 Greek Ano Teleia
? U03C2 Greek Small Letter Final Sigma
? U03AC Greek Small Letter Alpha With Tonos
? U03AF Greek Small Letter Iota With Tonos

These are all basic characters of the Greek alphabet and
non-replaceable, the trigger, however, the warning

"Text run is not in Unicode Normalization Form C."

Can somebody explain to me what this means?

Thanks,

emf

Jukka K. Korpela · Apr 4, 2013

My webpage:

https://files.nyu.edu/emf202/public/fr/limericks.html

checks OK with the W3C Validation Service as HTML5, but it triggers 8
warnings that have to do with the use of the following Greek characters:

? U0387 Greek Ano Teleia
? U03C2 Greek Small Letter Final Sigma
? U03AC Greek Small Letter Alpha With Tonos
? U03AF Greek Small Letter Iota With Tonos

These are all basic characters of the Greek alphabet and
non-replaceable, the trigger, however, the warning

"Text run is not in Unicode Normalization Form C."

This is explained fairly well at
http://stackoverflow.com/questions/5465170/text-run-is-not-in-unicode-normalization-form-c
As I remark in a comment that I added there now, the message used to be
an error, but it was changed to a warning after the discussion started by
http://lists.w3.org/Archives/Public/www-validator/2011May/0031.html

See also
http://stackoverflow.com/questions/8766675/normalizing-unicode-according-to-the-w3c-in-php

So it is not about conformance to HTML5 (which is a vague concept as
such, since HTML5 is mutable) but about general opinions of the W3C on
normalization.

In this case, the warning about GREEK ANO TELEIA is understandable,
since that character has canonical decomposition to U+00B7 MIDDLE DOT,
on in normalization to Normalization Form C (NFC), GREEK ANO TELEIA is
replaced by MIDDLE DOT. This is an example of inadequacy of NFC in many
situations: these characters have different glyphs in many fonts, and
GREEK ANO TELEIA is, not surprisingly, usually a much better choice for
the Greek punctuation mark (it usually sits around the x-height, whereas
the middle dot tends to be considerably lower).

Regarding the other warnings, I really don't understand. The characters
seem to be in NFC. And for example, on line 409, there are two
occurrences of GREEK SMALL LETTER ALPHA WITH TONOS, but only the latter
has been flagged. I thought it might relate to the position (at the end
of the line), but the next line contains the same character at the end
of the line, with no warning issued.

So this seems to be a bug in the validator.

The bad thing is that we cannot tell which of the warnings are real, in
the sense that the text is actually not NFC. The Greek alpha with tonos
*could* be written as decomposed, and in general it would be best to
avoid that and especially using precomposed and decomposed form in the
same document, as they *might* get rendered differently (and the
precomposed one then probably renders better).

Jukka K. Korpela · Apr 4, 2013

2013-04-04 9:44 said:
Regarding the other warnings, I really don't understand.

Now I do.

The characters seem to be in NFC.

They are. All the warnings about something not being NFC are caused by
GREEK ANO TELEIA. The validator just misrepresents the location by
highlighting, in red, the last letter of a line, no matter where the
issue is on that line.

So this seems to be a bug in the validator.

Well, it's a bug in highlighting - and in counting, since it says
"Validation Output: 8 Warnings" but shows only 6 warnings (and there are
6 occurrences of GREEK ANO TELEIA on the page).

So, ignore the warnings.

Jukka K. Korpela · Apr 4, 2013

2013-04-04 10:07 said:
Well, it's a bug in highlighting - and in counting, since it says
"Validation Output: 8 Warnings" but shows only 6 warnings (and there are
6 occurrences of GREEK ANO TELEIA on the page).

I have submitted a bug report:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=21577

emf · Apr 5, 2013

I have submitted a bug report:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=21577

Thanks for your explanations. BTW, the Greek Ano Teleia is not
equivalent to the Middle Dot, not in reality, but somehow it was
considered so and it has been impossible to change it with the
authorities, though it has been tried.

In Greek grade school you learn to put the ano teleia at the same height
as the upper dot of the colon or the dot of the Greek question mark,
which is like the Latin semicolon. Some old fonts still place it there,
though newer wants misplace it lower, following the wrong official
guidelines. See discussion at
https://bugs.freedesktop.org/show_bug.cgi?id=31285 by a Greek university
professor.

This is not the only misadventure of ano teleia: When they decided on
the Greek computer keyboard, they forgot (!) to include it, and so it
still is not included. Eventually I found and installed a small program
that permits me to use it with a key combination; it may look like the
middle dot, but it's better than nothing.

Unfortunately, once things get established, it's difficult to change
them, though I imagine that at one point it happens, unless ano teleia
is deprecated in the Greek grammar after long years of limited use
because of its problematic use in computers, despite the insistence of
some like me to keep using it when appropriate.

emf

Jukka K. Korpela · Apr 5, 2013

BTW, the Greek Ano Teleia is not
equivalent to the Middle Dot, not in reality, but somehow it was
considered so and it has been impossible to change it with the
authorities, though it has been tried.

Yes, that’s what I meant. It’s a different character, but it was unified
(in terms of canonical equivalence). There has been a lot of criticism
on Unicode unification, and this is a particularly striking example. But
it’s too late to change that. NFC has been carved into stone. There is a
large amount of software that relies on NFC as currently defined. Or at
least that’s what the Unicode Consortium thinks.

In Greek grade school you learn to put the ano teleia at the same height
as the upper dot of the colon or the dot of the Greek question mark,
which is like the Latin semicolon. Some old fonts still place it there,
though newer wants misplace it lower, following the wrong official
guidelines.

Well yes, the problem is that once MIDDLE DOT has been defined as a
strongly polysemic symbol, its design in fonts needs to be tolerable for
many uses, implying that it won’t be really *good* for anything. It’s
rather similar to HYPHEN-MINUS in this respect, except that instead of
HYPHEN-MINUS we can use, between consenting adults at least,
semantically much more accurate characters like HYPHEN, NON-BREAKING
HYPHEN, EN DASH, MINUS SIGN, etc.

This is not the only misadventure of ano teleia: When they decided on
the Greek computer keyboard, they forgot (!) to include it, and so it
still is not included.

Tragicomically, MIDDLE DOT cannot be conveniently typed in most
keyboards either, and it is not used much. But when used, it might be
used in the original meaning (as in Catalan), or as raised decimal point
(as in British usage), or as multiplication dot (instead of the more
correct DOT OPERATOR), etc. etc.

Eventually I found and installed a small program
that permits me to use it with a key combination; it may look like the
middle dot, but it's better than nothing.

You can use GREEK ANO TELEIA in HTML. Browsers won’t punish you. It’s
just a W3C opinion that it should not be used. Even though canonical
equivalence is supposed to mean identity of rendering, the reality is
different. Canonically equivalent characters may have different glyphs.

In theory, you could use MIDDLE DOT and some CSS to suggest that it be
rendered using a suitable glyph variant. Modern browsers generally
support OpenType features and let you specify such things, though IE 9
and older don’t get such things. But the main problem is that most fonts
commonly available on people’s computers, as well as most free fonts
that you could use as downloadable fonts, have limited or no OpenType
features.

Unicode Normalization Form C?

emf

Jukka K. Korpela

Jukka K. Korpela

Jukka K. Korpela

emf

Jukka K. Korpela

Members online

Forum statistics

Latest Threads