How to handle Unicode?

Rui Maciel · Feb 27, 2007

I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.

From what I gathered, the two main methods (based on standard, not 3rd party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.

As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.

So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?

Thanks in advance
Rui Maciel

Richard Tobin · Feb 27, 2007

Rui Maciel said:
As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.

So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?

In my XML parser, I use 16-bit integers internally to represent data
in UTF-16, converting from external encodings "by hand". I chose this
because my applications needed to do things like regular expression
matching, and when I started, wide character support was even less
portable than it is now. (I rejected UTF-32 for its space
inefficiency.)

With hindsight, this was a mistake. It made it harder for others to
write applications based on the library, I had to provide numerous
support functions, and I still had to deal with multi-word characters
(using surrogates) in several places.

UTF-8 has the advantage that many single-byte string functions will
still work unchanged: strlen() (when interpreted as meaning length in
bytes), strcpy(), all the ones that don't interpret bytes except '\0'.
You can use ordinary string literals. You can even use strchr() etc
when searching for ASCII characters. Writing a regular expression
matcher would of course have been slightly more tedious, but not
enormously so.

-- Richard

Stephen Sprunk · Feb 27, 2007

Rui Maciel said:
I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.

Sadly, that's my experience too. It's hard to find public examples of code
that properly handles i18n.

From what I gathered, the two main methods (based on standard, not 3rd
party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.

Not quite. Those things actually work together. Internally, your data
sohuld be wchar_t. However, one needs to pick a representation for
communicating with the outside world (via networks, files, etc.), and UTF-8
is a good choice for that since it's reasonably compact and (this is the
important part) it's compatible with programs that are not i18n-aware.

You do _not_ want to write wchar_t's directly to a file in binary mode,
since different systems may have different ideas of what a wchar_t is. This
is not so different from problems where systems disagree on what a char is,
but that's less common these days (but still a problem if you want to write
truly portable code).

As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table.

That's not quite correct. Any implementation has to provide you a working
wchar_t and the various functions to manipulate them. Unfortunately, there
is no guarantee it will provide the locales needed to input/output those
wchar_t's in a particular representation (such as UTF-8). This is why
there's not much good portable code floating around, since you can't count
on the implementation being usable in practice; if you want your code to be
portable, you have to rely on a third-party library and hope that _it_ is
portable.

S

user923005 · Feb 27, 2007

I like ICU:
http://www-306.ibm.com/software/globalization/icu/index.jsp

SM Ryan · Feb 28, 2007

# I want to support Unicode on a pet project of mine (small markup language
# parser). I've read a bit about Unicode (didn't delved beyond the basics)
# and I searched for some info on how to support Unicode on C programs.
# Unfortunately I wasn't able to find articles that could be considered more
# than loose ends, small blog entries and side remarks, never delving too
# much into specifics.

There are libraries ported to many systems that can do UTF,
Unicode, and other encodings. The Tcl library, for example, can
probably do just about anything you want, and it has been
ported to probably any system you want to run on.

Spring Boot Request Mapping: How to Handle Multiple Request Paths in a Controller	1	Oct 12, 2023
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
attempting to print unicode characters.	23	Aug 29, 2010
UNICODE: reinventing the wheel with WSUCONV	5	Mar 12, 2012
Ascii to Unicode.	4	Jul 28, 2010
Unicode strings as arguments to exceptions	3	Jan 16, 2014
how to use unicode in c under linux?	9	Sep 13, 2008
given char* utf8, how to read unicode line by line, and output utf8	2	Mar 13, 2012

How to handle Unicode?

Rui Maciel

Richard Tobin

Stephen Sprunk

user923005

SM Ryan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads