Using wchar_t instead of char

  • Thread starter Michael Brennan
  • Start date
M

Michael Brennan

I guess this question only applies to programming applications for UNIX,
Windows and similiar. If one develops something for an embedded system
I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible.
Is there any reason I should not use wchar_t for all my future programs?

I am aware that on UNIX at least, if you use UTF-8, char works pretty
well. But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

(I of course do not mean just the type wchar_t, but all of the things
in wide character land)

Thanks
 
C

CBFalconer

Michael said:
I guess this question only applies to programming applications for
UNIX, Windows and similiar. If one develops something for an
embedded system I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't
see much code using wchar_t when reading other people's code (but
then I haven't really looked much) or when following this newsgroup.
To me it sounds reasonable to make sure your program can handle
multibyte characters so that it can be used at as many places as
possible. Is there any reason I should not use wchar_t for all my
future programs?

I am aware that on UNIX at least, if you use UTF-8, char works
pretty well. But if you use wchar_t you don't need to rely on UTF-8
and thus makes it more portable, correct?

I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.
 
V

viza

I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible. Is
there any reason I should not use wchar_t for all my future programs?

I am aware that on UNIX at least, if you use UTF-8, char works pretty
well. But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

wchar_t is 32 bits on my system. That's a lot of space to use when I
only need 7. Also, there aren't many well distributed apps using
wchar_t, just for one example: editors.

More fundamentally all sorts of I/O is done specifically in 8 bit bytes.
IP is 8 bit based, as are files under Linux and most other operating
systems. The problem is that it is very difficult to do a partial
changeover. Every application would spend half of its time and code
converting back and forth, and then what do you do when it doesn't go?
How long in wchar_t is a seven byte file? One, perhaps, but then you
have to add a whole load of error handling code to every part of the
program that interfaces with the char based world.

In C, memory is always dealt with in sizeof(char) units. Life might be
made easier for the C programmer in a UTF16/24/32 world by increasing
CHAR_BIT, but you still have the problems when you interface with the
rest of the world.
 
B

Ben Bacarisse

CBFalconer said:
I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.

I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:

http://www.lysator.liu.se/c/rat/title.html

As soon as anyone with a copy to hand tells me otherwise, I will
withdraw, but then again maybe someone will back me up.
 
B

Ben Bacarisse

Michael Brennan said:
I guess this question only applies to programming applications for UNIX,
Windows and similiar. If one develops something for an embedded system
I can understand that wchar_t would be unnecessary.

I'd be very surprised if this were true, but I do not know much about
embedded systems. My audio player seems to support all sorts of
characters.
I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible.
Is there any reason I should not use wchar_t for all my future
programs?

It is not a simple "use one or the other".
I am aware that on UNIX at least, if you use UTF-8, char works pretty
well.

Yes, but a truly portable program won't assume UTF-8. Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations. For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.
But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

It is one of the components you need. Another is to use C's locale
support. How portable you can be depends on what systems you are
targeting since not all of the features of C99's wide character
support are available on all compiler/library combinations. In fact,
the maximally portable set of things you can do with a wchar_t (or and
array of them) is very small. Here I hope an expert steps in a gives
you real experience-based wisdom about portable use of wide-character
support.
 
C

CBFalconer

Ben said:
CBFalconer said:
I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.

I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:

I am basing it on this excerpt from the C99 standard (N869):

[#5] This edition replaces the previous edition, ISO/IEC
9899:1990, as amended and corrected by ISO/IEC
9899/COR1:1994, ISO/IEC 9899/COR2:1995, and ISO/IEC
9899/AMD1:1995. Major changes from the previous edition
include:

-- restricted character set support in <iso646.h>
(originally specified in AMD1)

-- wide-character library support in <wchar.h> and
<wctype.h> (originally specified in AMD1)
 
N

Nick Bowler

Ben said:
CBFalconer said:
Michael Brennan wrote:
I believe that wchar etc. are only available in C99. Using them may
seriously reduce your code portability.

I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:

I am basing it on this excerpt from the C99 standard (N869):

[#5] This edition replaces the previous edition, ISO/IEC
9899:1990, as amended and corrected by ISO/IEC
9899/COR1:1994, ISO/IEC 9899/COR2:1995, and ISO/IEC
9899/AMD1:1995. Major changes from the previous edition include:

-- restricted character set support in <iso646.h>
(originally specified in AMD1)

-- wide-character library support in <wchar.h> and
<wctype.h> (originally specified in AMD1)

The headers specified in that excerpt and all functions declared within
are indeed new in AMD1/C99.

The type wchar_t (from <stddef.h>) was present in C90. Additionally, the
library functions mblen, mbtowc, wctomb, mbstowcs and wcstombs are
available from <stdlib.h>.

AMD1 is fairly widely implemented, anyway.
 
M

Michael Brennan

I'd be very surprised if this were true, but I do not know much about
embedded systems. My audio player seems to support all sorts of
characters.

My mistake, please ignore what I said about that.
It is not a simple "use one or the other".

No, I understand now that it's more complicated, unfortunantely.
Yes, but a truly portable program won't assume UTF-8. Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations. For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.


It is one of the components you need. Another is to use C's locale
support. How portable you can be depends on what systems you are
targeting since not all of the features of C99's wide character
support are available on all compiler/library combinations. In fact,
the maximally portable set of things you can do with a wchar_t (or and
array of them) is very small. Here I hope an expert steps in a gives
you real experience-based wisdom about portable use of wide-character
support.

This wasn't easy, I need to rely on C99 stuff and according to viza
programs will be inefficient. I always aim for writing portable
programs but I also need to be able to use CJK characters, so I'm not
really sure on what to do here.

I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?
 
V

viza

I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find text
editors that can read and write the file more easily.

Just a thought. As you've realised there isn't a perfect solution.

viza
 
R

Rui Maciel

What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find
text editors that can read and write the file more easily.

Isn't UTF16 a variable-length format?


Rui Maciel
 
B

Ben Bacarisse

I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

First, C does not assume UTF-8 though it is clearly the most likely
multi-byte string encoding you will come across. When talking about
standard, portable, C the choice is about if, and when, to convert
between wide and multi-byte sequences.

Secondly, do you have a choice about the input? You suggest that it
is in a file, so you may have no choice about the input, but the
problem sounds like an assignment so maybe you get to choose the input
encoding.

Either way, it does not sound as if either the wasted space of always
using wide characters nor the extra complexity of having multi-byte
strings really matters for your application. If you get to choose,
pick one and be happy. If you don't get to choose, go with what is
mandated and don't convert.

When I say "pick one" I don't mean at random. Different environments
will favour different encodings. If your input will be prepared by an
editor that makes entering Japanese as wide characters easy, then that
would be a reason to choose wide character input.

In general, if your input is as muti-byte strings, keep it that way.
A typical reason to convert to wchar_t would be if you need to match it
against other data that is already wchar_t or if your processing
requires frequent access to single characters.

It is much more rare to convert data that is already wide to
multi-byte strings. You may save some space, you might not. You will
end up with slightly more complex character processing.
 
M

micans

Yes, but a truly portable program won't assume UTF-8.  Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations.  For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.

Indeed. I've worked, a while ago, on code for index creation and
scanning,
porting it from an 8-bit character-set to Unicode. In that case, the
context
required the storage to be in UTF-8. In memory we would do on-the-fly
conversion to UTF-32 to do pattern matching, counting, normalization
(that's
a veritable Pandora's box) and whatever else was required.
For this we used IBM icu (international components for unicode), an
IBM-developed
library with a very permissive license that still seems to be actively
maintained.

Developing for Unicode does seem to require putting a lot of thought
in how the
application interacts with the environment, and the less assumptions
you pose on
the environment, the hairier it gets.

Stijn
 
K

Keith Thompson

Rui Maciel said:
Isn't UTF16 a variable-length format?

Yes, but it's effectively fixed-length if you only use characters
within the "Basic Multilingual Plane".

With UTF16, you also have to consider byte order and the presence or
absence of a Byte Order Mark.

The Wikipedia article <http://en.wikipedia.org/wiki/UTF16> appears to
be a good overview, with links to articles about other encodings.
 
R

Richard Tobin

William Ahern said:
There's no such thing as fixed-width Unicode characters. 8-bit, 16-bits,
32-bits, 128-bits or 1024-bits is insufficient.

32 bits is plenty for Unicode.

A more accurate claim would be about the sufficiency of Unicode.

-- Richard
 
M

Michael Brennan

First, C does not assume UTF-8 though it is clearly the most likely
multi-byte string encoding you will come across. When talking about
standard, portable, C the choice is about if, and when, to convert
between wide and multi-byte sequences.

Secondly, do you have a choice about the input? You suggest that it
is in a file, so you may have no choice about the input, but the
problem sounds like an assignment so maybe you get to choose the input
encoding.

Either way, it does not sound as if either the wasted space of always
using wide characters nor the extra complexity of having multi-byte
strings really matters for your application. If you get to choose,
pick one and be happy. If you don't get to choose, go with what is
mandated and don't convert.

When I say "pick one" I don't mean at random. Different environments
will favour different encodings. If your input will be prepared by an
editor that makes entering Japanese as wide characters easy, then that
would be a reason to choose wide character input.

In general, if your input is as muti-byte strings, keep it that way.
A typical reason to convert to wchar_t would be if you need to match it
against other data that is already wchar_t or if your processing
requires frequent access to single characters.

It is much more rare to convert data that is already wide to
multi-byte strings. You may save some space, you might not. You will
end up with slightly more complex character processing.

Thank you, and everyone else!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top