how to convert characters to upper case in utf8 env.

Discussion in 'C Programming' started by csanjith@gmail.com, Mar 16, 2006.

  1. Guest

    Hi, i have a situaion where i need to convert the characters entered in

    an text field to upper case using C. The configuration id utf8
    environment in which user can enter any character (single , double,
    triple byte etc). I need to convert to upper case only those characters

    which has got upper case. ie if an user enter bot english and japanese
    characters in the text field, then I should convert only english
    characters, not japanese.

    I have seen that the C functions toupper() and tolower() handles multi
    byte characters from Solaris 8. I am not sure with other platforms.


    can any one suggest the best approch for the above scenario.
     
    , Mar 16, 2006
    #1
    1. Advertising

  2. Micah Cowan Guest

    writes:

    > Hi, i have a situaion where i need to convert the characters entered in


    Hi. Do you know you triple-posted this?

    > an text field to upper case using C. The configuration id utf8
    > environment in which user can enter any character (single , double,
    > triple byte etc). I need to convert to upper case only those characters
    >
    > which has got upper case. ie if an user enter bot english and japanese
    > characters in the text field, then I should convert only english
    > characters, not japanese.
    >
    > I have seen that the C functions toupper() and tolower() handles multi
    > byte characters from Solaris 8. I am not sure with other platforms.
    >
    > can any one suggest the best approch for the above scenario.


    The encodings supported by your C implementation for operations in
    toupper() and tolower() are implementation-defined: you'll need to
    look it up in your documentation.

    You might need to use setlocale() with something like "en_US.UTF8",
    and restore the locale afterwards. This may or may not work. And you
    still won't be able to work with multibyte strings directly: you'll
    have to convert to and from wchar_t's.

    Your best bet is to use a (off-topic) specialized library devoted to
    manipulating UTF8 strings. IBM has one:
    http://www-306.ibm.com/software/globalization/icu/index.jsp

    This provides a u_strToLower() function in ustring.h. Please don't
    post here for questions regarding this library, however, as it is
    off-topic for this NG.

    HTH,
    -Micah
     
    Micah Cowan, Mar 16, 2006
    #2
    1. Advertising

  3. loufoque Guest

    Micah Cowan a écrit :

    > Your best bet is to use a (off-topic) specialized library devoted to
    > manipulating UTF8 strings. IBM has one:


    And glib can do it too.
     
    loufoque, Mar 17, 2006
    #3
  4. On Thursday 16 March 2006 20:26, Micah Cowan opined (in
    <>):
    > writes:
    >>
    >> Hi, i have a situaion where i need to convert the characters entered

    >
    > Hi. Do you know you triple-posted this?


    It's the blinkin' Google. It does it sometimes.

    --
    BR, Vladimir

    "There is hopeful symbolism in the fact that flags do not wave in a
    vacuum."
    -- Arthur C. Clarke
     
    Vladimir S. Oka, Mar 17, 2006
    #4
  5. Guest

    wrote:
    > Hi, i have a situaion where i need to convert the characters entered in
    >
    > an text field to upper case using C. The configuration id utf8
    > environment in which user can enter any character (single , double,
    > triple byte etc). I need to convert to upper case only those characters


    Latin based uppercasing is easy, just convert to those characters
    exactly in the lower-case or upper-case ASCII range. This is one of
    the properties of UTF-8. However, to perform correct case change over
    the whole Unicode range, you need to simply know which characters have
    either a upper case or capitalization case alternative character (as
    well as the reverse.) This information is available from the standard
    Unicode data table.

    Oh yeah, and this off topic here in comp.lang.c. ANSI C does not have
    a notion of portable internationalization, let alone Unicode (though
    some compilers implement wchar_t as Unicode, this cannot be relied
    upon.)

    --
    Paul Hsieh
    http://www.pobox.com/~qed/
    http://bstring.sf.net/
     
    , Mar 17, 2006
    #5
  6. On Thu, 16 Mar 2006 10:17:13 -0800, csanjith wrote:

    > Hi, i have a situaion where i need to convert the characters entered in
    >
    > an text field to upper case using C. The configuration id utf8
    > environment in which user can enter any character (single , double,
    > triple byte etc). I need to convert to upper case only those characters
    >
    > which has got upper case. ie if an user enter bot english and japanese
    > characters in the text field, then I should convert only english
    > characters, not japanese.
    >
    > I have seen that the C functions toupper() and tolower() handles multi
    > byte characters from Solaris 8. I am not sure with other platforms.
    >


    It would seem improbable that toupper(), on Solaris or elsewhere, could
    give the correct output for all valid input when using UTF-8, UTF-16,
    UTF-32 or any other Unicode encoding variant.

    For all Unicode encoding variants there are some "characters" (or
    "graphemes" in Unicode terminology) which can be encoded
    equivalently across multiple sets of integer values. Assume UTF-32, and
    our Unicode string is an array of uint32_t integers. The Latin-1 character
    a+umlaut could be described equivalently within the range of one 32-bit
    integer with a value of 0x000000C1 or by combining two integers,
    0x00000041-0x00000301. The latter representation could not be passed to
    toupper() or tolower(), considering that neither can take an array.

    [http://www.unicode.org/faq/char_combmark.html#8]

    Hmmmm. Do any Unicode gurus know if 0x0061-0301 would accomplish a capital
    A + umlaut? Regardless, I strongly suspect that there are many graphemes
    in many scripts where such a trick could never work, but where notions
    like uppercase or lowercase are still meaningful.
     
    William Ahern, Mar 17, 2006
    #6
  7. On Thu, 16 Mar 2006 17:29:38 -0800, websnarf wrote:

    > wrote:
    >> Hi, i have a situaion where i need to convert the characters entered in
    >>
    >> an text field to upper case using C. The configuration id utf8
    >> environment in which user can enter any character (single , double,
    >> triple byte etc). I need to convert to upper case only those characters

    >
    > Latin based uppercasing is easy, just convert to those characters
    > exactly in the lower-case or upper-case ASCII range. This is one of
    > the properties of UTF-8. However, to perform correct case change over
    > the whole Unicode range, you need to simply know which characters have
    > either a upper case or capitalization case alternative character (as
    > well as the reverse.) This information is available from the standard
    > Unicode data table.


    Latin != ASCII. ASCII is 7-bit, ISO Latin encodings are 8-bit. Non-ASCII
    code points, in UTF-8, are multibyte. How do you pass an array to toupper()?

    > Oh yeah, and this off topic here in comp.lang.c. ANSI C does not have
    > a notion of portable internationalization, let alone Unicode (though
    > some compilers implement wchar_t as Unicode, this cannot be relied
    > upon.)


    The wide-character API is not sufficient to support Unicode.
    But you're right, this is off-topic. Anybody know where this would be
    on-topic, though?
     
    William Ahern, Mar 17, 2006
    #7
  8. Richard Bos Guest

    William Ahern <> wrote:

    > On Thu, 16 Mar 2006 10:17:13 -0800, csanjith wrote:
    >
    > > Hi, i have a situaion where i need to convert the characters entered in
    > > an text field to upper case using C. The configuration id utf8
    > > environment in which user can enter any character (single , double,
    > > triple byte etc). I need to convert to upper case only those characters
    > > which has got upper case. ie if an user enter bot english and japanese
    > > characters in the text field, then I should convert only english
    > > characters, not japanese.


    > For all Unicode encoding variants there are some "characters" (or
    > "graphemes" in Unicode terminology) which can be encoded
    > equivalently across multiple sets of integer values. Assume UTF-32, and
    > our Unicode string is an array of uint32_t integers. The Latin-1 character
    > a+umlaut could be described equivalently within the range of one 32-bit
    > integer with a value of 0x000000C1 or by combining two integers,
    > 0x00000041-0x00000301. The latter representation could not be passed to
    > toupper() or tolower(), considering that neither can take an array.
    >
    > [http://www.unicode.org/faq/char_combmark.html#8]
    >
    > Hmmmm. Do any Unicode gurus know if 0x0061-0301 would accomplish a capital
    > A + umlaut?


    If I read the Unicode Standard correctly, yes, it would. However, the
    right question is: do you really _want_ to capitalise an accented lowed
    case letter to an accented upper case letter? In Dutch you wouldn't.

    There are (at least) two reasonable C solutions:
    - trust that your implementation handles this correctly, for example by
    letting the sysadmin of the system the program runs on install
    language-specific libraries for the <ctype.h> functions, and just use
    tolower() and toupper(), as you would otherwise;
    - assume that you know better than J. Random Sysadmin which characters
    you want to capitalise, and write your own case-changing functions
    with knowledge about the Unicode tables.
    Something is to be said for either solution; the first is simpler and
    more flexible, in the second case the results are more strictly known.

    Richard
     
    Richard Bos, Mar 17, 2006
    #8
  9. Eric Sosman Guest

    wrote On 03/16/06 20:29,:
    > wrote:
    >
    >>Hi, i have a situaion where i need to convert the characters entered in
    >>
    >>an text field to upper case using C. The configuration id utf8
    >>environment in which user can enter any character (single , double,
    >>triple byte etc). I need to convert to upper case only those characters

    >
    >
    > Latin based uppercasing is easy, just convert to those characters
    > exactly in the lower-case or upper-case ASCII range. [...]


    Note that all of "àáâãäåæçèéêëìíîïðñòóôõöøùúûüý" appear
    in the alphabets of Latinic (Latinous?) languages, are
    outside the ASCII lower-case range, and yet have upper-
    case equivalents.

    --
     
    Eric Sosman, Mar 17, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Janice

    lower case to upper case

    Janice, Dec 10, 2004, in forum: C Programming
    Replies:
    17
    Views:
    1,227
    Richard Bos
    Dec 14, 2004
  2. Replies:
    3
    Views:
    373
    Keith Thompson
    Mar 17, 2006
  3. Replies:
    0
    Views:
    329
  4. penny
    Replies:
    28
    Views:
    3,069
    Charlton Wilbur
    Mar 10, 2008
  5. BlackHelicopter
    Replies:
    0
    Views:
    620
    BlackHelicopter
    Jan 31, 2013
Loading...

Share This Page