How to handle Unicode?

R

Rui Maciel

I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.

From what I gathered, the two main methods (based on standard, not 3rd party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.

As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.

So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?


Thanks in advance
Rui Maciel
 
R

Richard Tobin

Rui Maciel said:
As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.
So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?

In my XML parser, I use 16-bit integers internally to represent data
in UTF-16, converting from external encodings "by hand". I chose this
because my applications needed to do things like regular expression
matching, and when I started, wide character support was even less
portable than it is now. (I rejected UTF-32 for its space
inefficiency.)

With hindsight, this was a mistake. It made it harder for others to
write applications based on the library, I had to provide numerous
support functions, and I still had to deal with multi-word characters
(using surrogates) in several places.

UTF-8 has the advantage that many single-byte string functions will
still work unchanged: strlen() (when interpreted as meaning length in
bytes), strcpy(), all the ones that don't interpret bytes except '\0'.
You can use ordinary string literals. You can even use strchr() etc
when searching for ASCII characters. Writing a regular expression
matcher would of course have been slightly more tedious, but not
enormously so.

-- Richard
 
S

Stephen Sprunk

Rui Maciel said:
I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.

Sadly, that's my experience too. It's hard to find public examples of code
that properly handles i18n.
From what I gathered, the two main methods (based on standard, not 3rd
party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.

Not quite. Those things actually work together. Internally, your data
sohuld be wchar_t. However, one needs to pick a representation for
communicating with the outside world (via networks, files, etc.), and UTF-8
is a good choice for that since it's reasonably compact and (this is the
important part) it's compatible with programs that are not i18n-aware.

You do _not_ want to write wchar_t's directly to a file in binary mode,
since different systems may have different ideas of what a wchar_t is. This
is not so different from problems where systems disagree on what a char is,
but that's less common these days (but still a problem if you want to write
truly portable code).
As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table.

That's not quite correct. Any implementation has to provide you a working
wchar_t and the various functions to manipulate them. Unfortunately, there
is no guarantee it will provide the locales needed to input/output those
wchar_t's in a particular representation (such as UTF-8). This is why
there's not much good portable code floating around, since you can't count
on the implementation being usable in practice; if you want your code to be
portable, you have to rely on a third-party library and hope that _it_ is
portable.

S
 
S

SM Ryan

# I want to support Unicode on a pet project of mine (small markup language
# parser). I've read a bit about Unicode (didn't delved beyond the basics)
# and I searched for some info on how to support Unicode on C programs.
# Unfortunately I wasn't able to find articles that could be considered more
# than loose ends, small blog entries and side remarks, never delving too
# much into specifics.

There are libraries ported to many systems that can do UTF,
Unicode, and other encodings. The Tcl library, for example, can
probably do just about anything you want, and it has been
ported to probably any system you want to run on.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,202
Latest member
MikoOslo

Latest Threads

Top