Using wchar_t instead of char

Discussion in 'C Programming' started by Michael Brennan, Jul 8, 2008.

  1. I guess this question only applies to programming applications for UNIX,
    Windows and similiar. If one develops something for an embedded system
    I can understand that wchar_t would be unnecessary.

    I wonder if there is any point in using char over wchar_t? I don't see
    much code using wchar_t when reading other people's code (but then I
    haven't really looked much) or when following this newsgroup. To me it
    sounds reasonable to make sure your program can handle multibyte
    characters so that it can be used at as many places as possible.
    Is there any reason I should not use wchar_t for all my future programs?

    I am aware that on UNIX at least, if you use UTF-8, char works pretty
    well. But if you use wchar_t you don't need to rely on UTF-8 and thus
    makes it more portable, correct?

    (I of course do not mean just the type wchar_t, but all of the things
    in wide character land)

    Thanks

    --
    Michael Brennan
    Michael Brennan, Jul 8, 2008
    #1
    1. Advertising

  2. Michael Brennan

    CBFalconer Guest

    Michael Brennan wrote:
    >
    > I guess this question only applies to programming applications for
    > UNIX, Windows and similiar. If one develops something for an
    > embedded system I can understand that wchar_t would be unnecessary.
    >
    > I wonder if there is any point in using char over wchar_t? I don't
    > see much code using wchar_t when reading other people's code (but
    > then I haven't really looked much) or when following this newsgroup.
    > To me it sounds reasonable to make sure your program can handle
    > multibyte characters so that it can be used at as many places as
    > possible. Is there any reason I should not use wchar_t for all my
    > future programs?
    >
    > I am aware that on UNIX at least, if you use UTF-8, char works
    > pretty well. But if you use wchar_t you don't need to rely on UTF-8
    > and thus makes it more portable, correct?


    I believe that wchar etc. are only available in C99. Using them
    may seriously reduce your code portability.

    --
    [mail]: Chuck F (cbfalconer at maineline dot net)
    [page]: <http://cbfalconer.home.att.net>
    Try the download section.
    CBFalconer, Jul 8, 2008
    #2
    1. Advertising

  3. Michael Brennan

    viza Guest

    On Tue, 08 Jul 2008 21:12:54 +0000, Michael Brennan wrote:

    > I wonder if there is any point in using char over wchar_t? I don't see
    > much code using wchar_t when reading other people's code (but then I
    > haven't really looked much) or when following this newsgroup. To me it
    > sounds reasonable to make sure your program can handle multibyte
    > characters so that it can be used at as many places as possible. Is
    > there any reason I should not use wchar_t for all my future programs?
    >
    > I am aware that on UNIX at least, if you use UTF-8, char works pretty
    > well. But if you use wchar_t you don't need to rely on UTF-8 and thus
    > makes it more portable, correct?


    wchar_t is 32 bits on my system. That's a lot of space to use when I
    only need 7. Also, there aren't many well distributed apps using
    wchar_t, just for one example: editors.

    More fundamentally all sorts of I/O is done specifically in 8 bit bytes.
    IP is 8 bit based, as are files under Linux and most other operating
    systems. The problem is that it is very difficult to do a partial
    changeover. Every application would spend half of its time and code
    converting back and forth, and then what do you do when it doesn't go?
    How long in wchar_t is a seven byte file? One, perhaps, but then you
    have to add a whole load of error handling code to every part of the
    program that interfaces with the char based world.

    In C, memory is always dealt with in sizeof(char) units. Life might be
    made easier for the C programmer in a UTF16/24/32 world by increasing
    CHAR_BIT, but you still have the problems when you interface with the
    rest of the world.
    viza, Jul 9, 2008
    #3
  4. CBFalconer <> writes:

    > Michael Brennan wrote:
    >>
    >> I guess this question only applies to programming applications for
    >> UNIX, Windows and similiar. If one develops something for an
    >> embedded system I can understand that wchar_t would be unnecessary.
    >>
    >> I wonder if there is any point in using char over wchar_t? I don't
    >> see much code using wchar_t when reading other people's code (but
    >> then I haven't really looked much) or when following this newsgroup.
    >> To me it sounds reasonable to make sure your program can handle
    >> multibyte characters so that it can be used at as many places as
    >> possible. Is there any reason I should not use wchar_t for all my
    >> future programs?
    >>
    >> I am aware that on UNIX at least, if you use UTF-8, char works
    >> pretty well. But if you use wchar_t you don't need to rely on UTF-8
    >> and thus makes it more portable, correct?

    >
    > I believe that wchar etc. are only available in C99. Using them
    > may seriously reduce your code portability.


    I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
    bit, but I am pretty sure that wchar_t was in there. C95 added some
    more related things (all of which ended up in C99) but using wchar_t
    should be very portable indeed[1]. Do you have a reference to C90
    without wchar_t? All I can site is online versions of the ANSI
    standard as a .txt file and the C90 rationale at:

    http://www.lysator.liu.se/c/rat/title.html

    As soon as anyone with a copy to hand tells me otherwise, I will
    withdraw, but then again maybe someone will back me up.

    --
    Ben.
    Ben Bacarisse, Jul 9, 2008
    #4
  5. Michael Brennan <> writes:

    > I guess this question only applies to programming applications for UNIX,
    > Windows and similiar. If one develops something for an embedded system
    > I can understand that wchar_t would be unnecessary.


    I'd be very surprised if this were true, but I do not know much about
    embedded systems. My audio player seems to support all sorts of
    characters.

    > I wonder if there is any point in using char over wchar_t? I don't see
    > much code using wchar_t when reading other people's code (but then I
    > haven't really looked much) or when following this newsgroup. To me it
    > sounds reasonable to make sure your program can handle multibyte
    > characters so that it can be used at as many places as possible.
    > Is there any reason I should not use wchar_t for all my future
    > programs?


    It is not a simple "use one or the other".

    > I am aware that on UNIX at least, if you use UTF-8, char works pretty
    > well.


    Yes, but a truly portable program won't assume UTF-8. Even if you can
    assume it, converting to wide characters helps when you are doing lots
    of character counting operations. For example, finding the longest
    match of a pattern is complex if you keep everything in a multi-byte
    encoding like UTF-8.

    > But if you use wchar_t you don't need to rely on UTF-8 and thus
    > makes it more portable, correct?


    It is one of the components you need. Another is to use C's locale
    support. How portable you can be depends on what systems you are
    targeting since not all of the features of C99's wide character
    support are available on all compiler/library combinations. In fact,
    the maximally portable set of things you can do with a wchar_t (or and
    array of them) is very small. Here I hope an expert steps in a gives
    you real experience-based wisdom about portable use of wide-character
    support.

    > (I of course do not mean just the type wchar_t, but all of the things
    > in wide character land)


    --
    Ben.
    Ben Bacarisse, Jul 9, 2008
    #5
  6. Michael Brennan

    CBFalconer Guest

    Ben Bacarisse wrote:
    > CBFalconer <> writes:
    >> Michael Brennan wrote:
    >>>
    >>> I guess this question only applies to programming applications for
    >>> UNIX, Windows and similiar. If one develops something for an
    >>> embedded system I can understand that wchar_t would be unnecessary.
    >>>
    >>> I wonder if there is any point in using char over wchar_t? I don't
    >>> see much code using wchar_t when reading other people's code (but
    >>> then I haven't really looked much) or when following this newsgroup.
    >>> To me it sounds reasonable to make sure your program can handle
    >>> multibyte characters so that it can be used at as many places as
    >>> possible. Is there any reason I should not use wchar_t for all my
    >>> future programs?
    >>>
    >>> I am aware that on UNIX at least, if you use UTF-8, char works
    >>> pretty well. But if you use wchar_t you don't need to rely on UTF-8
    >>> and thus makes it more portable, correct?

    >>
    >> I believe that wchar etc. are only available in C99. Using them
    >> may seriously reduce your code portability.

    >
    > I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
    > bit, but I am pretty sure that wchar_t was in there. C95 added some
    > more related things (all of which ended up in C99) but using wchar_t
    > should be very portable indeed[1]. Do you have a reference to C90
    > without wchar_t? All I can site is online versions of the ANSI
    > standard as a .txt file and the C90 rationale at:


    I am basing it on this excerpt from the C99 standard (N869):

    [#5] This edition replaces the previous edition, ISO/IEC
    9899:1990, as amended and corrected by ISO/IEC
    9899/COR1:1994, ISO/IEC 9899/COR2:1995, and ISO/IEC
    9899/AMD1:1995. Major changes from the previous edition
    include:

    -- restricted character set support in <iso646.h>
    (originally specified in AMD1)

    -- wide-character library support in <wchar.h> and
    <wctype.h> (originally specified in AMD1)

    --
    [mail]: Chuck F (cbfalconer at maineline dot net)
    [page]: <http://cbfalconer.home.att.net>
    Try the download section.
    CBFalconer, Jul 9, 2008
    #6
  7. Michael Brennan

    Nick Bowler Guest

    On Tue, 08 Jul 2008 21:02:34 -0400, CBFalconer wrote:

    > Ben Bacarisse wrote:
    >> CBFalconer <> writes:
    >>> Michael Brennan wrote:
    >>> I believe that wchar etc. are only available in C99. Using them may
    >>> seriously reduce your code portability.

    >>
    >> I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
    >> bit, but I am pretty sure that wchar_t was in there. C95 added some
    >> more related things (all of which ended up in C99) but using wchar_t
    >> should be very portable indeed[1]. Do you have a reference to C90
    >> without wchar_t? All I can site is online versions of the ANSI
    >> standard as a .txt file and the C90 rationale at:

    >
    > I am basing it on this excerpt from the C99 standard (N869):
    >
    > [#5] This edition replaces the previous edition, ISO/IEC
    > 9899:1990, as amended and corrected by ISO/IEC
    > 9899/COR1:1994, ISO/IEC 9899/COR2:1995, and ISO/IEC
    > 9899/AMD1:1995. Major changes from the previous edition include:
    >
    > -- restricted character set support in <iso646.h>
    > (originally specified in AMD1)
    >
    > -- wide-character library support in <wchar.h> and
    > <wctype.h> (originally specified in AMD1)


    The headers specified in that excerpt and all functions declared within
    are indeed new in AMD1/C99.

    The type wchar_t (from <stddef.h>) was present in C90. Additionally, the
    library functions mblen, mbtowc, wctomb, mbstowcs and wcstombs are
    available from <stdlib.h>.

    AMD1 is fairly widely implemented, anyway.
    Nick Bowler, Jul 9, 2008
    #7
  8. On 2008-07-09, Ben Bacarisse <> wrote:
    > Michael Brennan <> writes:
    >
    >> I guess this question only applies to programming applications for UNIX,
    >> Windows and similiar. If one develops something for an embedded system
    >> I can understand that wchar_t would be unnecessary.

    >
    > I'd be very surprised if this were true, but I do not know much about
    > embedded systems. My audio player seems to support all sorts of
    > characters.


    My mistake, please ignore what I said about that.

    >> I wonder if there is any point in using char over wchar_t? I don't see
    >> much code using wchar_t when reading other people's code (but then I
    >> haven't really looked much) or when following this newsgroup. To me it
    >> sounds reasonable to make sure your program can handle multibyte
    >> characters so that it can be used at as many places as possible.
    >> Is there any reason I should not use wchar_t for all my future
    >> programs?

    >
    > It is not a simple "use one or the other".


    No, I understand now that it's more complicated, unfortunantely.

    >> I am aware that on UNIX at least, if you use UTF-8, char works pretty
    >> well.

    >
    > Yes, but a truly portable program won't assume UTF-8. Even if you can
    > assume it, converting to wide characters helps when you are doing lots
    > of character counting operations. For example, finding the longest
    > match of a pattern is complex if you keep everything in a multi-byte
    > encoding like UTF-8.
    >
    >> But if you use wchar_t you don't need to rely on UTF-8 and thus
    >> makes it more portable, correct?

    >
    > It is one of the components you need. Another is to use C's locale
    > support. How portable you can be depends on what systems you are
    > targeting since not all of the features of C99's wide character
    > support are available on all compiler/library combinations. In fact,
    > the maximally portable set of things you can do with a wchar_t (or and
    > array of them) is very small. Here I hope an expert steps in a gives
    > you real experience-based wisdom about portable use of wide-character
    > support.
    >


    This wasn't easy, I need to rely on C99 stuff and according to viza
    programs will be inefficient. I always aim for writing portable
    programs but I also need to be able to use CJK characters, so I'm not
    really sure on what to do here.

    I currently have a program that reads names and birthdates from a file
    and then does some calculations to show how many days left until their
    birthday and so on. It works well, but I also need to have names in
    Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
    a lot of portability by choosing either of them. Any recommendation on
    which to choose?

    --
    Michael Brennan
    Michael Brennan, Jul 9, 2008
    #8
  9. Michael Brennan

    viza Guest

    On Wed, 09 Jul 2008 11:19:57 +0000, Michael Brennan wrote:

    > On 2008-07-09, Ben Bacarisse <> wrote:
    >> Michael Brennan <> writes:


    > I currently have a program that reads names and birthdates from a file
    > and then does some calculations to show how many days left until their
    > birthday and so on. It works well, but I also need to have names in
    > Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
    > a lot of portability by choosing either of them. Any recommendation on
    > which to choose?


    What about UTF16 (probably as unsigned short)? It has the simplicity of
    programming with fixed width characters and you will be able to find text
    editors that can read and write the file more easily.

    Just a thought. As you've realised there isn't a perfect solution.

    viza
    viza, Jul 9, 2008
    #9
  10. Michael Brennan

    Rui Maciel Guest

    On Wed, 09 Jul 2008 11:39:08 +0000, viza wrote:

    > What about UTF16 (probably as unsigned short)? It has the simplicity of
    > programming with fixed width characters and you will be able to find
    > text editors that can read and write the file more easily.


    Isn't UTF16 a variable-length format?


    Rui Maciel
    Rui Maciel, Jul 9, 2008
    #10
  11. Michael Brennan <> writes:

    <snip>
    > I currently have a program that reads names and birthdates from a file
    > and then does some calculations to show how many days left until their
    > birthday and so on. It works well, but I also need to have names in
    > Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
    > a lot of portability by choosing either of them. Any recommendation on
    > which to choose?


    First, C does not assume UTF-8 though it is clearly the most likely
    multi-byte string encoding you will come across. When talking about
    standard, portable, C the choice is about if, and when, to convert
    between wide and multi-byte sequences.

    Secondly, do you have a choice about the input? You suggest that it
    is in a file, so you may have no choice about the input, but the
    problem sounds like an assignment so maybe you get to choose the input
    encoding.

    Either way, it does not sound as if either the wasted space of always
    using wide characters nor the extra complexity of having multi-byte
    strings really matters for your application. If you get to choose,
    pick one and be happy. If you don't get to choose, go with what is
    mandated and don't convert.

    When I say "pick one" I don't mean at random. Different environments
    will favour different encodings. If your input will be prepared by an
    editor that makes entering Japanese as wide characters easy, then that
    would be a reason to choose wide character input.

    In general, if your input is as muti-byte strings, keep it that way.
    A typical reason to convert to wchar_t would be if you need to match it
    against other data that is already wchar_t or if your processing
    requires frequent access to single characters.

    It is much more rare to convert data that is already wide to
    multi-byte strings. You may save some space, you might not. You will
    end up with slightly more complex character processing.

    --
    Ben.
    Ben Bacarisse, Jul 9, 2008
    #11
  12. In article <4874c3e7$0$29146$>,
    Rui Maciel <> wrote:
    >Isn't UTF16 a variable-length format?


    Yes, though if you don't need to interpret the characters above 0xFFFF
    you can pretend it isn't.

    -- Richard

    --
    Please remember to mention me / in tapes you leave behind.
    Richard Tobin, Jul 9, 2008
    #12
  13. Michael Brennan

    Guest

    On Jul 9, 1:13 am, Ben Bacarisse <> wrote:
    > Michael Brennan <> writes:
    > It is not a simple "use one or the other".
    >
    > > I am aware that on UNIX at least, if you use UTF-8, char works pretty
    > > well.

    >
    > Yes, but a truly portable program won't assume UTF-8.  Even if you can
    > assume it, converting to wide characters helps when you are doing lots
    > of character counting operations.  For example, finding the longest
    > match of a pattern is complex if you keep everything in a multi-byte
    > encoding like UTF-8.


    Indeed. I've worked, a while ago, on code for index creation and
    scanning,
    porting it from an 8-bit character-set to Unicode. In that case, the
    context
    required the storage to be in UTF-8. In memory we would do on-the-fly
    conversion to UTF-32 to do pattern matching, counting, normalization
    (that's
    a veritable Pandora's box) and whatever else was required.
    For this we used IBM icu (international components for unicode), an
    IBM-developed
    library with a very permissive license that still seems to be actively
    maintained.

    Developing for Unicode does seem to require putting a lot of thought
    in how the
    application interacts with the environment, and the less assumptions
    you pose on
    the environment, the hairier it gets.

    Stijn
    , Jul 9, 2008
    #13
  14. Rui Maciel <> writes:
    > On Wed, 09 Jul 2008 11:39:08 +0000, viza wrote:
    >
    >> What about UTF16 (probably as unsigned short)? It has the simplicity of
    >> programming with fixed width characters and you will be able to find
    >> text editors that can read and write the file more easily.

    >
    > Isn't UTF16 a variable-length format?


    Yes, but it's effectively fixed-length if you only use characters
    within the "Basic Multilingual Plane".

    With UTF16, you also have to consider byte order and the presence or
    absence of a Byte Order Mark.

    The Wikipedia article <http://en.wikipedia.org/wiki/UTF16> appears to
    be a good overview, with links to articles about other encodings.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Jul 9, 2008
    #14
  15. In article <fmifk5-ufj.ln1@wilbur.25thandClement.com>,
    William Ahern <william@wilbur.25thandClement.com> wrote:

    >There's no such thing as fixed-width Unicode characters. 8-bit, 16-bits,
    >32-bits, 128-bits or 1024-bits is insufficient.


    32 bits is plenty for Unicode.

    A more accurate claim would be about the sufficiency of Unicode.

    -- Richard
    --
    Please remember to mention me / in tapes you leave behind.
    Richard Tobin, Jul 9, 2008
    #15
  16. On 2008-07-09, Ben Bacarisse <> wrote:
    > Michael Brennan <> writes:
    >
    ><snip>
    >> I currently have a program that reads names and birthdates from a file
    >> and then does some calculations to show how many days left until their
    >> birthday and so on. It works well, but I also need to have names in
    >> Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
    >> a lot of portability by choosing either of them. Any recommendation on
    >> which to choose?

    >
    > First, C does not assume UTF-8 though it is clearly the most likely
    > multi-byte string encoding you will come across. When talking about
    > standard, portable, C the choice is about if, and when, to convert
    > between wide and multi-byte sequences.
    >
    > Secondly, do you have a choice about the input? You suggest that it
    > is in a file, so you may have no choice about the input, but the
    > problem sounds like an assignment so maybe you get to choose the input
    > encoding.
    >
    > Either way, it does not sound as if either the wasted space of always
    > using wide characters nor the extra complexity of having multi-byte
    > strings really matters for your application. If you get to choose,
    > pick one and be happy. If you don't get to choose, go with what is
    > mandated and don't convert.
    >
    > When I say "pick one" I don't mean at random. Different environments
    > will favour different encodings. If your input will be prepared by an
    > editor that makes entering Japanese as wide characters easy, then that
    > would be a reason to choose wide character input.
    >
    > In general, if your input is as muti-byte strings, keep it that way.
    > A typical reason to convert to wchar_t would be if you need to match it
    > against other data that is already wchar_t or if your processing
    > requires frequent access to single characters.
    >
    > It is much more rare to convert data that is already wide to
    > multi-byte strings. You may save some space, you might not. You will
    > end up with slightly more complex character processing.
    >


    Thank you, and everyone else!

    --
    Michael Brennan
    Michael Brennan, Jul 10, 2008
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Adrian Cornish
    Replies:
    2
    Views:
    8,178
    Adrian Cornish
    Jul 12, 2003
  2. Bren
    Replies:
    4
    Views:
    4,113
    Peter van Merkerk
    Oct 7, 2003
  3. Mark A. Gibbs
    Replies:
    1
    Views:
    614
    John Carson
    Mar 29, 2005
  4. lovecreatesbeauty
    Replies:
    1
    Views:
    1,007
    Ian Collins
    May 9, 2006
  5. Replies:
    3
    Views:
    1,080
    James Kanze
    Aug 15, 2008
Loading...

Share This Page