Using Unicode in C programs

Marco Iannaccone · Sep 1, 2005

I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?

Thanx

Simon Biber · Sep 1, 2005

Marco said:
I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?

C provides a concept of wide characters (arrays of wchar_t) and
multibyte characters (arrays of char where each character may take up
more than one byte). The C standard defines functions for converting
between wide and multibyte representations. The standard does not
specify what encoding these two representational forms take.

On at least one platform, depending on the current locale setting, the
wide characters built in to C represent Unicode characters, and the
multibyte characters represent the UTF-8 form.

The following program attempts to set the locale to en_AU.UTF-8, which
means Australian English in UTF-8 encoding. The language portion doesn't
matter, just the encoding does. It then takes a UTF-8 string (which
happens to contain Simplified Chinese characters), and converts it to
the wide character representation, which on my platform is equivalent to
Unicode.

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>

int main(void)
{
wchar_t ucs2[5];
if(!setlocale(LC_ALL, "en_AU.UTF-8"))
{
printf("Unable to set locale to Australian English in UTF-8\n");
return 0;
}

/* The UTF-8 representation of string "æ°´è°ƒæŒå¤´"
(four Chinese characters pronounced shui3 diao4 ge1 tou2) */
char *utf8 = "\xE6\xB0\xB4\xE8\xB0\x83\xE6\xAD\x8C\xE5\xA4\xB4";

mbstowcs(ucs2, utf8, sizeof ucs2 / sizeof *ucs2);

printf("UTF-8: ");
for(char *p = utf8; *p; p++)
printf("%02X ", (unsigned)(unsigned char)*p);
printf("\n");

printf("Unicode: ");
for(wchar_t *p = ucs2; *p; p++)
printf("U+%04lX ", (unsigned long) *p);
printf("\n");

return 0;
}

[sbiber@eagle c]$ c99 -Wall utf8ucs2.c -o utf8ucs2
[sbiber@eagle c]$ ./utf8ucs2
UTF-8: E6 B0 B4 E8 B0 83 E6 AD 8C E5 A4 B4
Unicode: U+6C34 U+8C03 U+6B4C U+5934

I'd be interested to know how widespread this technique works. Is it
portable?

Alexei A. Frounze · Sep 1, 2005

Marco Iannaccone said:
I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?

The best and the most authorative source of info on all aspects of Unicode
is www.unicode.org.
At least read the Unicode FAQ and the article on Unicode "To the BMP and
beyond!" by Eric Muller of Adobe Systems (the doc must be linked somewhere
at unicode.org -- or just google for it). Read that info with attention.
By default, Unicode isn't guaranteed to be supported by anything in every
compiler on every system, unlike ASCII. But, to the best of my knowledge
recent linux distros support UTF-8 in functions like printf() and fopen().
Once again, make use of www.unicode.org.

Alex

Michael B Allen · Sep 2, 2005

I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?

If you would like a quick intro into what your up against see this link:

http://www.io plex.com/~miallen/libmba/dl/docs/ref/text_details.html

It describes an api used to improve portability of code across different
platforms which you may or may not be concerned with but it does describe
the basics of working with Unicode in C.

Mike

Marco Iannaccone · Sep 2, 2005

Thanx a lot!

(and thanx to everyone for helping! I'll start
studying...!

)

Unicode (UTF-8) in C	13	Mar 16, 2014
Compiling and distributing C programs	19	Jun 23, 2013
Unicode	20	Dec 16, 2012
Typography of programs	83	Jun 30, 2011
Linux: using "clone3" and "waitid"	0	Oct 17, 2023
Need Help with Windows.Forms in VS (C#)	0	Feb 2, 2020
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
attempting to print unicode characters.	23	Aug 29, 2010

Using Unicode in C programs

Marco Iannaccone

Simon Biber

Alexei A. Frounze

Michael B Allen

Marco Iannaccone

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads