Unicode Normalization Form C?

Discussion in 'HTML' started by emf, Apr 4, 2013.

  1. emf

    emf Guest

    My webpage:

    https://files.nyu.edu/emf202/public/fr/limericks.html

    checks OK with the W3C Validation Service as HTML5, but it triggers 8
    warnings that have to do with the use of the following Greek characters:

    ? U0387 Greek Ano Teleia
    ? U03C2 Greek Small Letter Final Sigma
    ? U03AC Greek Small Letter Alpha With Tonos
    ? U03AF Greek Small Letter Iota With Tonos

    These are all basic characters of the Greek alphabet and
    non-replaceable, the trigger, however, the warning

    "Text run is not in Unicode Normalization Form C."

    Can somebody explain to me what this means?

    Thanks,

    emf

    --
    It ain't THAT, babe! - A radical reinterpretation
    https://files.nyu.edu/emf202/public/bd/itaintmebabe.html
    emf, Apr 4, 2013
    #1
    1. Advertising

  2. 2013-04-04 8:37, emf wrote:

    > My webpage:
    >
    > https://files.nyu.edu/emf202/public/fr/limericks.html
    >
    > checks OK with the W3C Validation Service as HTML5, but it triggers 8
    > warnings that have to do with the use of the following Greek characters:
    >
    > ? U0387 Greek Ano Teleia
    > ? U03C2 Greek Small Letter Final Sigma
    > ? U03AC Greek Small Letter Alpha With Tonos
    > ? U03AF Greek Small Letter Iota With Tonos
    >
    > These are all basic characters of the Greek alphabet and
    > non-replaceable, the trigger, however, the warning
    >
    > "Text run is not in Unicode Normalization Form C."


    This is explained fairly well at
    http://stackoverflow.com/questions/5465170/text-run-is-not-in-unicode-normalization-form-c
    As I remark in a comment that I added there now, the message used to be
    an error, but it was changed to a warning after the discussion started by
    http://lists.w3.org/Archives/Public/www-validator/2011May/0031.html

    See also
    http://stackoverflow.com/questions/8766675/normalizing-unicode-according-to-the-w3c-in-php

    So it is not about conformance to HTML5 (which is a vague concept as
    such, since HTML5 is mutable) but about general opinions of the W3C on
    normalization.

    In this case, the warning about GREEK ANO TELEIA is understandable,
    since that character has canonical decomposition to U+00B7 MIDDLE DOT,
    on in normalization to Normalization Form C (NFC), GREEK ANO TELEIA is
    replaced by MIDDLE DOT. This is an example of inadequacy of NFC in many
    situations: these characters have different glyphs in many fonts, and
    GREEK ANO TELEIA is, not surprisingly, usually a much better choice for
    the Greek punctuation mark (it usually sits around the x-height, whereas
    the middle dot tends to be considerably lower).

    Regarding the other warnings, I really don't understand. The characters
    seem to be in NFC. And for example, on line 409, there are two
    occurrences of GREEK SMALL LETTER ALPHA WITH TONOS, but only the latter
    has been flagged. I thought it might relate to the position (at the end
    of the line), but the next line contains the same character at the end
    of the line, with no warning issued.

    So this seems to be a bug in the validator.

    The bad thing is that we cannot tell which of the warnings are real, in
    the sense that the text is actually not NFC. The Greek alpha with tonos
    *could* be written as decomposed, and in general it would be best to
    avoid that and especially using precomposed and decomposed form in the
    same document, as they *might* get rendered differently (and the
    precomposed one then probably renders better).

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Apr 4, 2013
    #2
    1. Advertising

  3. 2013-04-04 9:44, Jukka K. Korpela wrote:

    > Regarding the other warnings, I really don't understand.


    Now I do.

    > The characters seem to be in NFC.


    They are. All the warnings about something not being NFC are caused by
    GREEK ANO TELEIA. The validator just misrepresents the location by
    highlighting, in red, the last letter of a line, no matter where the
    issue is on that line.

    > So this seems to be a bug in the validator.


    Well, it's a bug in highlighting - and in counting, since it says
    "Validation Output: 8 Warnings" but shows only 6 warnings (and there are
    6 occurrences of GREEK ANO TELEIA on the page).

    So, ignore the warnings.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Apr 4, 2013
    #3
  4. Jukka K. Korpela, Apr 4, 2013
    #4
  5. emf

    emf Guest

    On 2013-04-04 04:28 Jukka K. Korpela wrote:
    > 2013-04-04 10:07, Jukka K. Korpela wrote:
    >
    >>> So this seems to be a bug in the validator.

    >>
    >> Well, it's a bug in highlighting - and in counting, since it says
    >> "Validation Output: 8 Warnings" but shows only 6 warnings (and there are
    >> 6 occurrences of GREEK ANO TELEIA on the page).

    >
    > I have submitted a bug report:
    > https://www.w3.org/Bugs/Public/show_bug.cgi?id=21577


    Thanks for your explanations. BTW, the Greek Ano Teleia is not
    equivalent to the Middle Dot, not in reality, but somehow it was
    considered so and it has been impossible to change it with the
    authorities, though it has been tried.

    In Greek grade school you learn to put the ano teleia at the same height
    as the upper dot of the colon or the dot of the Greek question mark,
    which is like the Latin semicolon. Some old fonts still place it there,
    though newer wants misplace it lower, following the wrong official
    guidelines. See discussion at
    https://bugs.freedesktop.org/show_bug.cgi?id=31285 by a Greek university
    professor.

    This is not the only misadventure of ano teleia: When they decided on
    the Greek computer keyboard, they forgot (!) to include it, and so it
    still is not included. Eventually I found and installed a small program
    that permits me to use it with a key combination; it may look like the
    middle dot, but it's better than nothing.

    Unfortunately, once things get established, it's difficult to change
    them, though I imagine that at one point it happens, unless ano teleia
    is deprecated in the Greek grammar after long years of limited use
    because of its problematic use in computers, despite the insistence of
    some like me to keep using it when appropriate.

    emf

    --
    Date Calculator with all-purpose JS code
    https://files.nyu.edu/emf202/public/js/dateCalculator.html
    emf, Apr 5, 2013
    #5
  6. 2013-04-05 10:52, emf wrote:

    > BTW, the Greek Ano Teleia is not
    > equivalent to the Middle Dot, not in reality, but somehow it was
    > considered so and it has been impossible to change it with the
    > authorities, though it has been tried.


    Yes, that’s what I meant. It’s a different character, but it was unified
    (in terms of canonical equivalence). There has been a lot of criticism
    on Unicode unification, and this is a particularly striking example. But
    it’s too late to change that. NFC has been carved into stone. There is a
    large amount of software that relies on NFC as currently defined. Or at
    least that’s what the Unicode Consortium thinks.

    > In Greek grade school you learn to put the ano teleia at the same height
    > as the upper dot of the colon or the dot of the Greek question mark,
    > which is like the Latin semicolon. Some old fonts still place it there,
    > though newer wants misplace it lower, following the wrong official
    > guidelines.


    Well yes, the problem is that once MIDDLE DOT has been defined as a
    strongly polysemic symbol, its design in fonts needs to be tolerable for
    many uses, implying that it won’t be really *good* for anything. It’s
    rather similar to HYPHEN-MINUS in this respect, except that instead of
    HYPHEN-MINUS we can use, between consenting adults at least,
    semantically much more accurate characters like HYPHEN, NON-BREAKING
    HYPHEN, EN DASH, MINUS SIGN, etc.

    > This is not the only misadventure of ano teleia: When they decided on
    > the Greek computer keyboard, they forgot (!) to include it, and so it
    > still is not included.


    Tragicomically, MIDDLE DOT cannot be conveniently typed in most
    keyboards either, and it is not used much. But when used, it might be
    used in the original meaning (as in Catalan), or as raised decimal point
    (as in British usage), or as multiplication dot (instead of the more
    correct DOT OPERATOR), etc. etc.

    > Eventually I found and installed a small program
    > that permits me to use it with a key combination; it may look like the
    > middle dot, but it's better than nothing.


    You can use GREEK ANO TELEIA in HTML. Browsers won’t punish you. It’s
    just a W3C opinion that it should not be used. Even though canonical
    equivalence is supposed to mean identity of rendering, the reality is
    different. Canonically equivalent characters may have different glyphs.

    In theory, you could use MIDDLE DOT and some CSS to suggest that it be
    rendered using a suitable glyph variant. Modern browsers generally
    support OpenType features and let you specify such things, though IE 9
    and older don’t get such things. But the main problem is that most fonts
    commonly available on people’s computers, as well as most free fonts
    that you could use as downloadable fonts, have limited or no OpenType
    features.


    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Apr 5, 2013
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    4
    Views:
    983
  2. Chris

    URL normalization

    Chris, May 3, 2004, in forum: Java
    Replies:
    2
    Views:
    3,163
    Real Gagnon
    May 4, 2004
  3. William Ahern

    Unicode Normalization of Text Streams

    William Ahern, Sep 14, 2006, in forum: C Programming
    Replies:
    4
    Views:
    339
    Simon Biber
    Sep 19, 2006
  4. turbovince

    Unicode strings normalization

    turbovince, Jul 9, 2007, in forum: C++
    Replies:
    0
    Views:
    385
    turbovince
    Jul 9, 2007
  5. kcobra
    Replies:
    2
    Views:
    461
    Roedy Green
    Jun 4, 2008
Loading...

Share This Page