Unicode Normalization of Text Streams

Discussion in 'C Programming' started by William Ahern, Sep 14, 2006.

  1. Has it ever been proposed or posited within any C committee to define or
    discuss (in a standard's document) the transformation of Unicode text I/O
    according to a Unicode Normalization Form (assuming a locale which employs
    a Unicode representation)? Is such a capability implicit?

    The notion exists and seems to work well for line/record delimiters
    (e.g. "\r\n" -> "\n").

    - Bill
    William Ahern, Sep 14, 2006
    #1
    1. Advertising

  2. William Ahern

    pete Guest

    William Ahern wrote:
    >
    > Has it ever been proposed or posited within any
    > C committee to define or
    > discuss (in a standard's document)
    > the transformation of Unicode text I/O
    > according to a Unicode Normalization Form
    > (assuming a locale which employs
    > a Unicode representation)? Is such a capability implicit?
    >
    > The notion exists and seems to work well for line/record delimiters
    > (e.g. "\r\n" -> "\n").


    That seems more like a question for
    news:comp.std.c
    to me.

    --
    pete
    pete, Sep 14, 2006
    #2
    1. Advertising

  3. William Ahern

    Jack Klein Guest

    On Thu, 14 Sep 2006 14:54:22 -0700, William Ahern
    <> wrote in comp.lang.c:

    > Has it ever been proposed or posited within any C committee to define or
    > discuss (in a standard's document) the transformation of Unicode text I/O
    > according to a Unicode Normalization Form (assuming a locale which employs
    > a Unicode representation)? Is such a capability implicit?


    I hardly think this is likely to ever be placed into the C standard. C
    does not define or support Unicode at all, any more than it defines
    ASCII or any other character set. The exact format and/or
    representation of C's wide characters is completely
    implementation-defined, and completely unspecified by the standard.

    There is no requirement in the language that an implementation
    supports or understands Unicode, nor any locales defined besides the C
    locale.

    >
    > The notion exists and seems to work well for line/record delimiters
    > (e.g. "\r\n" -> "\n").


    But I do agree with Pete, you certainly should ask in comp.std.c if
    you want to discuss this.

    --
    Jack Klein
    Home: http://JK-Technology.Com
    FAQs for
    comp.lang.c http://c-faq.com/
    comp.lang.c++ http://www.parashift.com/c -faq-lite/
    alt.comp.lang.learn.c-c++
    http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
    Jack Klein, Sep 15, 2006
    #3
  4. William Ahern

    Guest

    William Ahern wrote:
    > Has it ever been proposed or posited within any C committee to define or
    > discuss (in a standard's document) the transformation of Unicode text I/O
    > according to a Unicode Normalization Form (assuming a locale which employs
    > a Unicode representation)? Is such a capability implicit?


    This is a complicated algorithm. In a demo normalizer I wrote, my
    implementation took 107K of object code. It could likely be optimized,
    but I would be surprised if you could get it under 30K or so. (Its
    mostly because you need to encode the UniData table, of course.)
    Besides, C does not support Unicode. So there is no reason to expect
    such functionality from the language.

    In a bizarre twist, however, someone has actually proposed some kind of
    UTF codecs into the next C standard. I think, of course, this is
    probably meant to be a joke to see if the ANSI C people are utterly and
    completely incompetent or not -- since their actions to date do not
    make it clear whether or not they are.

    UTF encoders are, of course, trivial pieces of code anyone can write in
    a few hours at most. But the key point is that they don't achieve any
    useful functionality if you don't have other unicode support, such as a
    normalizer, as you suggest above. But a unicode normalizer is
    expensive (as I mentioned above) to implement. So either they go whole
    hog and do a complete Unicode implementation (some implementations, of
    course, map wchar_t to unicode -- so this is plausible), or they should
    do nothing. There is no point in half measures that can't be really
    used in practice.

    This "depth" of reasoning may very well be beyond the capabilities of
    the ANSI C committee, so its not clear to me at all what they will end
    up doing.

    --
    Paul Hsieh
    http://www.pobox.com/~qed/
    http://bstring.sf.net/
    , Sep 15, 2006
    #4
  5. William Ahern

    Simon Biber Guest

    Jack Klein wrote:
    > On Thu, 14 Sep 2006 14:54:22 -0700, William Ahern
    > <> wrote in comp.lang.c:
    >
    >> Has it ever been proposed or posited within any C committee to define or
    >> discuss (in a standard's document) the transformation of Unicode text I/O
    >> according to a Unicode Normalization Form (assuming a locale which employs
    >> a Unicode representation)? Is such a capability implicit?

    >
    > I hardly think this is likely to ever be placed into the C standard. C
    > does not define or support Unicode at all, any more than it defines
    > ASCII or any other character set. The exact format and/or
    > representation of C's wide characters is completely
    > implementation-defined, and completely unspecified by the standard.


    C supports ISO/IEC 10646 (Unicode) characters through the 'universal
    character name' system.

    The standard intends a C implementation to define __STDC_ISO_10646__ if
    values of type wchar_t correspond to Unicode characters.

    --
    Simon.
    Simon Biber, Sep 19, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    4
    Views:
    975
  2. Chris

    URL normalization

    Chris, May 3, 2004, in forum: Java
    Replies:
    2
    Views:
    3,151
    Real Gagnon
    May 4, 2004
  3. turbovince

    Unicode strings normalization

    turbovince, Jul 9, 2007, in forum: C++
    Replies:
    0
    Views:
    380
    turbovince
    Jul 9, 2007
  4. kcobra
    Replies:
    2
    Views:
    459
    Roedy Green
    Jun 4, 2008
  5. emf
    Replies:
    5
    Views:
    990
    Jukka K. Korpela
    Apr 5, 2013
Loading...

Share This Page