Using Unicode in C programs

  • Thread starter Marco Iannaccone
  • Start date
M

Marco Iannaccone

I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?

Thanx
 
S

Simon Biber

Marco said:
I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?

C provides a concept of wide characters (arrays of wchar_t) and
multibyte characters (arrays of char where each character may take up
more than one byte). The C standard defines functions for converting
between wide and multibyte representations. The standard does not
specify what encoding these two representational forms take.

On at least one platform, depending on the current locale setting, the
wide characters built in to C represent Unicode characters, and the
multibyte characters represent the UTF-8 form.

The following program attempts to set the locale to en_AU.UTF-8, which
means Australian English in UTF-8 encoding. The language portion doesn't
matter, just the encoding does. It then takes a UTF-8 string (which
happens to contain Simplified Chinese characters), and converts it to
the wide character representation, which on my platform is equivalent to
Unicode.

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>

int main(void)
{
wchar_t ucs2[5];
if(!setlocale(LC_ALL, "en_AU.UTF-8"))
{
printf("Unable to set locale to Australian English in UTF-8\n");
return 0;
}

/* The UTF-8 representation of string "水调歌头"
(four Chinese characters pronounced shui3 diao4 ge1 tou2) */
char *utf8 = "\xE6\xB0\xB4\xE8\xB0\x83\xE6\xAD\x8C\xE5\xA4\xB4";

mbstowcs(ucs2, utf8, sizeof ucs2 / sizeof *ucs2);

printf("UTF-8: ");
for(char *p = utf8; *p; p++)
printf("%02X ", (unsigned)(unsigned char)*p);
printf("\n");

printf("Unicode: ");
for(wchar_t *p = ucs2; *p; p++)
printf("U+%04lX ", (unsigned long) *p);
printf("\n");

return 0;
}

[sbiber@eagle c]$ c99 -Wall utf8ucs2.c -o utf8ucs2
[sbiber@eagle c]$ ./utf8ucs2
UTF-8: E6 B0 B4 E8 B0 83 E6 AD 8C E5 A4 B4
Unicode: U+6C34 U+8C03 U+6B4C U+5934

I'd be interested to know how widespread this technique works. Is it
portable?
 
A

Alexei A. Frounze

Marco Iannaccone said:
I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?

The best and the most authorative source of info on all aspects of Unicode
is www.unicode.org.
At least read the Unicode FAQ and the article on Unicode "To the BMP and
beyond!" by Eric Muller of Adobe Systems (the doc must be linked somewhere
at unicode.org -- or just google for it). Read that info with attention.
By default, Unicode isn't guaranteed to be supported by anything in every
compiler on every system, unlike ASCII. But, to the best of my knowledge
recent linux distros support UTF-8 in functions like printf() and fopen().
Once again, make use of www.unicode.org.

Alex
 
M

Michael B Allen

I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?

If you would like a quick intro into what your up against see this link:

http://www.io plex.com/~miallen/libmba/dl/docs/ref/text_details.html

It describes an api used to improve portability of code across different
platforms which you may or may not be concerned with but it does describe
the basics of working with Unicode in C.

Mike
 
M

Marco Iannaccone

Thanx a lot! :) (and thanx to everyone for helping! I'll start
studying...! :p)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,023
Latest member
websitedesig25

Latest Threads

Top