Unicode (UTF-8) in C


C

Chicken McNuggets

Hopefully this is a question that is related to standard C.

I have (rather shamefully) ignored Unicode when it comes to programming
in C up to now. But given the prevalence of it and the ease of use when
using other programming languages (Python for example) I thought I'd
re-visit the subject when it comes to C.

I'm confused about how one is supposed to encode a string into UTF-8. I
don't think I understand the concept at all and must be missing
something really simple. If I take input from something like getchar()
and the user inputs a UTF-8 character - what happens? I would assume
that if the character was a multi-byte character then only the first 8
bits will be stored in the corresponding char that getchar() returns. Is
that correct?

Where do wide characters (wchar_t etc) fit in with this?

Any help is very much appreciated.
 
Ad

Advertisements

X

Xavier Roche

Le 16/03/2014 05:49, Chicken McNuggets a écrit :
If I take input from something like getchar()
and the user inputs a UTF-8 character - what happens?

First of all, if it is pure ASCII (7 bit), things will remain exactly
the same -- ASCII is a subset of UTF-8.

If a character outside the 7-bit range is entered, you will get several
characters through getchar(). For example, the "é" (latin letter e with
acute accent) letter will cause the getchar() to return 0xc3 and then 0xa9.

The naming of getchar() is a bit misleading - it should be called
getbyte() actually, because this is exactly that: you get one 8-bit byte
each time, not a character (yes, "char" is the returned type after error
check in fact).
Where do wide characters (wchar_t etc) fit in with this?

As far as I know, the most portable way is to... decode the UTF-8 stream
by yourself.

Fortunately, UTF-8 is a very smart encoding, and decoding it is not
really hard.

Basically,
- 7-bit ASCII (0..0x7f) remains the same.
- 0x80..0xBF are intermediate sequence characters
- 0xC0..0xFF are leading sequence characters

When you have a leading byte, the number of leading "1" is the full
length of the UTF-8 sequence (0 == "ASCII" and "1" == "intermediate
sequence"). After the leading 1's, the remaining bits are data (big
endian style)

See also http://en.wikipedia.org/wiki/UTF-8

Here's my try (code taken from httrack code) ; might be improved,
probably (loops could be unrolled):

#include <stdio.h>
#include <stdlib.h>
#include <stdlib.h>

#define UTF8_ERROR ( (int) (-2) )

/* Hacker's delight number of leading zeros. */
static unsigned int nlz8(unsigned char x) {
unsigned int b = 0;

if (x & 0xf0) {
x >>= 4;
} else {
b += 4;
}

if (x & 0x0c) {
x >>= 2;
} else {
b += 2;
}

if (! (x & 0x02) ) {
b++;
}

return b;
}

/* Length of an utf-8 sequence. */
static size_t utf8_length(const char lead) {
const unsigned char f = (unsigned char) lead;
return nlz8(~f);
}

/* Replacement for getchar(), utf-8 compatible. */
static int utf8_getchar() {
const int c = getchar();
const size_t len = utf8_length(c);
if (c < 0) { /* EOF */
return EOF;
} else if (len == 0) { /* Error (in-sequence) */
return c;
} else if (len == 1) { /* ASCII */
return UTF8_ERROR;
} else { /* utf-8 */
unsigned int uc = c & ( (1 << (7 - len)) - 1 );
size_t i;
for(i = 0 ; i + 1 < len ; i++) {
const int c = getchar();
if (c != -1 && ( c >> 6 ) == 0x2) {
uc <<= 6;
uc |= (c & 0x3f);
} else if (c == -1) {
return EOF;
} else {
return UTF8_ERROR;
}
}
return (int) uc;
}
}

int main(int argc, char* argv[]) {
for(;;) {
const int c = utf8_getchar();
if (c == EOF) {
break;
} else if (c == UTF8_ERROR) {
fprintf(stderr, "* utf-8 error\n");
} else {
printf("unicode character 0x%04x\n", c);
}
}
return EXIT_SUCCESS;
}
 
K

Keith Thompson

Xavier Roche said:
Le 16/03/2014 05:49, Chicken McNuggets a écrit :

First of all, if it is pure ASCII (7 bit), things will remain exactly
the same -- ASCII is a subset of UTF-8.

If a character outside the 7-bit range is entered, you will get several
characters through getchar(). For example, the "é" (latin letter e with
acute accent) letter will cause the getchar() to return 0xc3 and then 0xa9.

The naming of getchar() is a bit misleading - it should be called
getbyte() actually, because this is exactly that: you get one 8-bit byte
each time, not a character (yes, "char" is the returned type after error
check in fact).

No, the value returned by getchar() is an int; the value is either EOF
(a negative value) or an *unsigned char* value, converted to int,
representing the next byte read.

It's an important distinction. Plain char is often signed, but if
you're reading UTF-8 data, you'll be reading values that don't fit in a
signed char.

(This all assumes CHAR_BIT==8, which is almost certain to be the case on
any system where UTF-8 makes sense.)

[...]
#define UTF8_ERROR ( (int) (-2) )

The cast and the outer parentheses are superfluous:

#define UTF8_ERROR (-2)

since -2 is already of type int.

[...]
 
S

Stephen Sprunk

I'm confused about how one is supposed to encode a string into UTF-8. I
don't think I understand the concept at all and must be missing
something really simple. If I take input from something like getchar()
and the user inputs a UTF-8 character - what happens? I would assume
that if the character was a multi-byte character then only the first 8
bits will be stored in the corresponding char that getchar() returns. Is
that correct?

Where do wide characters (wchar_t etc) fit in with this?

The first thing you need to decide is whether you actually _care_ that
the strings you're working with are UTF-8, ASCII, etc.; in many cases,
you do not. Particularly in the POSIX world, you can pass UTF-8 to the
API transparently; that was a major factor in the design of UTF-8, in fact.

When working with UTF-8, you do need to understand that strlen() et al
are counting bytes ("code units") rather than characters ("code
points"*), but in most cases that too is transparent. Splitting UTF-8
strings can be dangerous, but just add some logic to make sure the byte
("code unit") after the split isn't 10XXXXXX (a "trailing" byte); if
it's 0XXXXXXX (a "single" byte) or 11XXXXXX (a "leading" byte), you're
fine. Other encodings are nowhere near as easy to deal with and should
be avoided when possible.

wchar_t is only required when you _do_ have to care about the encoding
of your strings and, in particular, have to convert from one to another.
In many cases, you just need convert from encoding A to wchar_t on
input and then from wchar_t to encoding B on output. FWIW, wchar_t
itself is almost certainly UTF-16 (Windows) or UTF-32 (most others).

If you're forced to work on Windows, you can't avoid it because the
native API works in UTF-16 but it's unlikely that your data is in that
form, so you have to convert back and forth all over the place.
Stupidly, they still do not allow UTF-8 in the "ANSI" (narrow char) API.
You'd probably know by now if that nightmare applied to you, though.
Ditto if you're interfacing with Java or other languages that use UTF-16
internally.

(* Actually, "grapheme cluster" is probably a better match to what most
people think of as a "character", but unless you're writing a rendering
engine, don't worry about it. Trying to count "characters" in Unicode
is usually more trouble than it's worth anyway; what you care about in
nearly all cases is either "code units" or "code points", not "characters".)

S
 
C

Chicken McNuggets

wchar_t is only required when you _do_ have to care about the encoding
of your strings and, in particular, have to convert from one to another.
In many cases, you just need convert from encoding A to wchar_t on
input and then from wchar_t to encoding B on output. FWIW, wchar_t
itself is almost certainly UTF-16 (Windows) or UTF-32 (most others).

I think this is my big area of misunderstanding. How does one go about
converting between different encodings? I assume one has to use a
library to do so but which one?

If my text is already ASCII then no changes to the encoding need to be
done due to the fact that ASCII is the same in the different Unicode
implementations but the problems come when working with characters in
different Unicode implementations such as UTF-8, UTF-16 and UTF-32.

How would one go about converting a UTF-8 string to a UTF-32 string for
instance in C?

Thanks for all the responses by the way. Some very useful information.
 
Ad

Advertisements

S

Stephen Sprunk

I think this is my big area of misunderstanding. How does one go about
converting between different encodings? I assume one has to use a
library to do so but which one?

The C Standard Library can do it for you--if you can guess the correct
strings to pass to setlocale(); they vary from one implementation to the
next, so good luck with that.

There are several third-party libraries too; it's hard to guess which is
best for your particular needs, but Iconv and ICU seem popular.

If you're only dealing with UTF-8/16/32, though, you might find it
simpler to just roll your own.
If my text is already ASCII then no changes to the encoding need to be
done due to the fact that ASCII is the same in the different Unicode
implementations

ASCII is a subset of several encodings, most notably UTF-8, but
certainly not all of them.
but the problems come when working with characters in different Unicode
implementations such as UTF-8, UTF-16 and UTF-32.

Those are encodings, not implementations, and the easiest three to work
with at that; there are dozens of others that are quite painful and best
left to specialized libraries written by experts.
How would one go about converting a UTF-8 string to a UTF-32 string for
instance in C?

Googling for "UTF-8 C" turns up many promising links. IMHO, though, it
is worth reading the spec and learning how to do it yourself; UTF-8 is
fairly straightforward, and UTF-32 is downright trivial.

S
 
N

Noob

Dr said:
I've handled UTF-8 perfectly successfully with the traditional C string
functions. In your example the first getchar() returns the first byte
of a multibyte character, the second the next and so on.

What happens when a valid "NUL" occurs within an UTF-8 string?
(Consider e.g. the "GRINNING FACE" emoticon, 0x1F600)
I don't think strcpy would be too happy...
 
S

Siri Cruz

Chicken McNuggets said:
I think this is my big area of misunderstanding. How does one go about
converting between different encodings? I assume one has to use a
library to do so but which one?

Your C library might also have iconv(3) which can convert a variety character
set encodings. UTF to/from Unicode is simple by design. Other character encoding
to Unicode can be much harder.
If my text is already ASCII then no changes to the encoding need to be
done due to the fact that ASCII is the same in the different Unicode
implementations but the problems come when working with characters in
different Unicode implementations such as UTF-8, UTF-16 and UTF-32.

If you handle 0x01 to 0x7F as ASCII and 0x80 to 0xFF as a valid but unknown
characters, you should be able to handle UTF-8 with little or no modification.
All str* functions work on UTF-8, although calls like strchr won't be able to
find non-ASCII characters. However strstr will be able to search non-ASCII
characters in UTF-8 strings.
How would one go about converting a UTF-8 string to a UTF-32 string for
instance in C?

http://en.wikipedia.org/wiki/UTF-8 describes the conversion to/from UTF-8 and
Unicode. It might sound frightenning, but it's really easy and straightforward.

My functions to extract the first Unicode character from a UTF-8 C-string and
the remaining characters look something like

unsigned firstunicode(char *string) {
if ((string[0]&0x80)==0)
return string[0];
else if ( (string[0]&0xE0)==0xC0
&& (string[1]&0xC0)==0x80)
return ((string[0]&0x1F)<< 6)
| (string[1]&0x3F);
else if ((string[0]&0xF0)==0xE0
&& (string[1]&0xC0)==0x80
&& (string[2]&0xC0)==0x80)
return ((string[0]&0x0F)<<12)
| ((string[1]&0x3F)<< 6)
| (string[2]&0x3F );
else if ((string[0]&0xF8)==0xF0
&& (string[1]&0xC0)==0x80
&& (string[2]&0xC0)==0x80
&& (string[3]&0xC0)==0x80)
return ((string[0]&0x07)<<18)
| ((string[1]&0x3F)<<12)
| ((string[2]&0x3F)<< 6)
| (string[3]&0x3F );
else if ((string[0]&0xFC)==0xF8
&& (string[1]&0xC0)==0x80
&& (string[2]&0xC0)==0x80
&& (string[3]&0xC0)==0x80
&& (string[4]&0xC0)==0x80)
return ((string[0]&0x03)<<24)
| ((string[1]&0x3F)<<18)
| ((string[2]&0x3F)<<12)
| ((string[3]&0x3F)<< 6)
| (string[4]&0x3F);
else if ((string[0]&0xFE)==0xFC
&& (string[1]&0xC0)==0x80
&& (string[2]&0xC0)==0x80
&& (string[3]&0xC0)==0x80
&& (string[4]&0xC0)==0x80
&& (string[5]&0xC0)==0x80)
return ((string[0]&0x01)<<30)
| ((string[1]&0x3F)<<24)
| ((string[2]&0x3F)<<18)
| ((string[3]&0x3F)<<12)
| ((string[4]&0x3F)<< 6)
| (string[5]&0x3F );
else
return 0;
}

char *restunicode(char *string) {
string++;
while ((*string&0xC0)==0x80) string++;
return string;
}

...
unsigned u[strlen(string)]; int l = 0;
for (char *s=string; *s; s=restunicode(s))
u[l++] = firstunicode(s);
...

and convert one Unicode to UTF-8

char *addunicode(unsigned code, char *string, char *endplusone) {
if (code==0) {
return 0;
}else if (code<=0x7F) {
if (string+2<=endplusone) return 0;
*string++ = code;
*string = 0;
return string;
}else if (code<=0x7FF) {
if (string+3<=endplusone) return 0;
*string++ = (code>>6) & 0x1F | 0xC0;
*string++ = (code ) & 0x3F | 0x80;
*string = 0;
return string;
}else if (code<=0xFFFF) {
if (string+4<=endplusone) return 0;
*string++ = (code>>12) & 0x0F | 0xE0;
*string++ = (code>> 6) & 0x3F | 0x80;
*string++ = (code ) & 0x3F | 0x80;
*string = 0;
return string;
}else if (code<=0x1FFFFF) {
if (string+5<=endplusone) return 0;
*string++ = (code>>18) & 0x07 | 0xF0;
*string++ = (code>>12) & 0x3F | 0x80;
*string++ = (code>> 6) & 0x3F | 0x80;
*string++ = (code ) & 0x3F | 0x80;
*string = 0;
return string;
}else if (code<=0x3FFFFFF) {
if (string+6<=endplusone) return 0;
*string++ = (code>>24) & 0x03 | 0xF8;
*string++ = (code>>18) & 0x3F | 0x80;
*string++ = (code>>12) & 0x3F | 0x80;
*string++ = (code>> 6) & 0x3F | 0x80;
*string++ = (code ) & 0x3F | 0x80;
*string = 0;
return string;
}else {
if (string+7<=endplusone) return 0;
*string++ = (code>>30) & 0x01 | 0xFC;
*string++ = (code>>24) & 0x3F | 0x80;
*string++ = (code>>18) & 0x3F | 0x80;
*string++ = (code>>12) & 0x3F | 0x80;
*string++ = (code>> 6) & 0x3F | 0x80;
*string++ = (code ) & 0x3F | 0x80;
*string = 0;
return string;
}else
return 0;
}

...
unsigned u[strlen(string)]; int l = 0;
char t[5*l+1]; char *p = t;
for (int j=0; j<l; j++)
p = addunicode(u[j], p, t+5*l+1;
...
 
M

Mikko Rauhala

What happens when a valid "NUL" occurs within an UTF-8 string?

It never does. All octets encoding a non-ascii character have
their high bit set in UTF-8.
(Consider e.g. the "GRINNING FACE" emoticon, 0x1F600)
I don't think strcpy would be too happy...

It is encoded as f0 9f 98 80 in UTF-8.
 
S

Stephen Sprunk

What happens when a valid "NUL" occurs within an UTF-8 string?

0x00 only occurs in a UTF-8 string when encoding a real NUL, i.e. U+0000.
(Consider e.g. the "GRINNING FACE" emoticon, 0x1F600)

U+1F600 is 0xF0 0x9F 0x98 0x80; there is no 0x00 byte (see above).
I don't think strcpy would be too happy...

UTF-8 was _designed_ to be passed through existing APIs, including
string functions that are not Unicode-aware. You may be thinking of
UTF-16 or UTF-32, which _do_ require new APIs.

S
 
Ad

Advertisements

K

Kaz Kylheku

What happens when a valid "NUL" occurs within an UTF-8 string?
(Consider e.g. the "GRINNING FACE" emoticon, 0x1F600)
I don't think strcpy would be too happy...

The only way that a valid NUL can occur in UTF-8 is if it in fact
represents the code point U+0000, corresponding to USASCII NUL.

The program faces the problem that the input data contains nulls,
and it is using null-terminated strings.

That problem is not related to Unicode or internationalization.

* * *

However, I will tell you something useful.

If you use wide character null-terminate strings in your program, there is a
way to handle null quite transparently.

It hit me in small a flash of insight several mnoths ago and I implemented
it in the TXR language, which uses null-terminated wchar_t C strings
internally.

The trick is this: when you see a NUL on input, then have your UTF-8 decoder
produce the invalid code point U+DC00.

The rest of your code can be written to understand this convention.
To deal witn NUL-delimited data, just write the code to look for U+DC00,
and everything is fine.

Furthermore, on ouput, have your UTF-8 encoder turn U+DC00 into NUL.

This idea came to me because I was already using the DCxx convention:
the convention is to map all invalid UTF-8 bytes into the range U+DCxx,
and on encoding, recover them. Furthermore, if the UTF-8 data contains an
actual valid multi-byte encoding of a code in the U+DCXX range, that encoding
is considered invalid: all of the bytes are individually represented as invalid
bytes in DCXX, so they are then properly recovered on output.

So with that convention in place, you can transparently record invalid UTF-8
bytes in a wchar_t string, and recover exactly the same data back in UTF-8.
Great. I had that going for several yaers and didn't give it another
thought.

Then it hit me: hey, since NUL characters wreck C strings, they are *de facto*
invalid bytes---to a program which relies completely on C strings, right? So,
since they are invalid bytes, then let us treat them as such: map the suckers
to U+DC00.

And, bingo. Transparent nul character support in a whole language, in one
simple line of code!

Suddenly, I'm easily writing TXR scripts that can read the output of, say, GNU
find -0, or process /proc/<pid>/environ on Linux.

So there is a beauty to UTF-8: it has a benefit for programs that is
independent of internationalization. Even if you only handle ASCII data, by
treating it as UTF-8 and using wide strings, you can represent the null
character using a code that is outside of the byte range.

UTF-8 and wide character strings together solve an ages-old problem in the
C string representation.
 
S

Siri Cruz

This depends on normalization form, among other things. Example

Which are not hard to derive automatically from UnicodeData.txt

#!/usr/bin/tclsh

set unicodedata [open UnicodeData.txt r]
array set first {}; array set last {}
array set CMP {}; set CMP1 {}; set DCM {}; set CMB {}
while {[gets $unicodedata line]>=0} {
lassign [split $line ";"] \
code name (category) \
combining (bidi) decomposition \
numeric1 numeric2 numeric3 \
(bidimirrored) (oldname) (oldcomment) \
(uppercase) (lowercase) (titlecase)
set code [scan $code %x]; set max $code
if {[regexp {[^0-9a-fA-F]+;([^;]*), First([^;]*)} $line - a b]} {
set name $a$b
if {[info exists last($name)]} {
set max $last($name)
array unset last $name
} else {
set first($name) $code
continue
}
} elseif {[regexp {[^0-9a-fA-F]+;([^;]*), Last([^;]*)} $line - a b]} {
set name $a$b
if {[info exists first($name)]} {
set code $first($name)
array unset first $name
} else {
set last($name) $max
continue
}
} elseif {![regexp {[^0-9a-fA-F]+;} $line]} {
continue
}
for {} {$code<=$max} {incr code} {
while {[llength $CMP1]<=$code} {lappend CMP1 "0,\n"}
while {[llength $DCM]<=$code} {lappend DCM "{0, [llength $DCM]},\n"}
if {[regexp {^([0-9a-fA-F]+)\s+([0-9a-fA-F]+)$} $decomposition - a b]} {
scan $a %x a; scan $b %x b
lappend CMP($a) [list $b $code]
lset DCM $code "{$a, $b}, //$name\n"
} elseif {[regexp {^([0-9a-fA-F]+)$} $decomposition - b]} {
scan $b %x b
lset DCM $code "{0, $b}, //$name\n"
}
while {[llength $CMB]<=$code} {lappend CMB "0,\n"}
lset CMB $code "$combining,\n"
}
}
close $unicodedata
set CMP2 {"{0, 0},\n"}
foreach {a B} [array get CMP] {
set B [lsort -integer -index 0 $B]
lset CMP1 $a [llength $CMP2]
foreach b $B {
lappend CMP2 "{[lindex $b 0], [lindex $b 1]}, "
}
lappend CMP2 "{0,0},\n"
}

puts "
static unsigned DCM\[\]\[2\] = {[join $DCM {}]};
static unsigned char CMB\[\] = {[join $CMB {}]};
static unsigned CMP1\[\] = {[join $CMP1 {}]};
static unsigned CMP2\[\]\[2\] = {[join $CMP2 {}]};
"

puts {
static unsigned *unicodedecompose(
unsigned unicode, unsigned *dcm, unsigned *limit
) {
if (DCM[unicode][0])
dcm = unicodedecompose(DCM[unicode][0], dcm, limit);
if (DCM[unicode][1]) {
if (!dcm || dcm>=limit) return 0;
*dcm++ = DCM[unicode][1];
}
return dcm;
}

static void unicodeorderring(unsigned *str, unsigned *limit) {
while (str<limit) {
if (CMB[*str]==0) {str++; continue;}
int n = 1;
while (str+n<limit && CMB[str[n]]>0) n++;
for (int j=n-1; j>0l j++)
for (int k=0; k<j-1; k++)
if (CMB[str[k]]>CMB[str[j]]) {
unsigned t = str[k];
str[k] = str[j]; str[jn] = t;
}
str += n;
}
}

static unsigned *unicodecompose(unsigned *str, unsigned *limit) {
unsigned *cmp = str;
while (str<limit) {
if (CMP1[*str]==0) {*cmp++ = *str++; continue;}
int cmb = CMB[*str]; *cmp = *str++;
int x = CMP1[*str];
for (;;)
if (CMP2[x][0]==0) break;
else if (CMP2[x][0]==*str && cmb<CMB[*str]) {
*cmp = CMP2[x][1]; cmb = CMB[*str];
str++; x = CMP1[*cmp];
}else
x++;
cmp++;
}
return cmp;
}

int NFD(unsigned *inp, int len, unsigned *out, unsigned *limit) {
unsigned *p = out;
for (; len>0; inp++, len++)
if ((p=unicodedecompose(*inp, p, limit))==0) return -1;
unicodeorderring(out, p);
return p-out;
}

int NFC(unsigned *inp, int len, unsigned *out, unsigned *limit) {
if ((len=NFD(inp, len, out, limit))<0) return -1;
limit = unicodecompose(out, out+n);
return limit-out;
}
}
 
Ad

Advertisements

S

Stephen Sprunk

However, I will tell you something useful.

If you use wide character null-terminate strings in your program,
there is a way to handle null quite transparently.

It hit me in small a flash of insight several mnoths ago and I
implemented it in the TXR language, which uses null-terminated
wchar_t C strings internally.

The trick is this: when you see a NUL on input, then have your UTF-8
decoder produce the invalid code point U+DC00.

The rest of your code can be written to understand this convention.
To deal witn NUL-delimited data, just write the code to look for
U+DC00, and everything is fine.

Furthermore, on ouput, have your UTF-8 encoder turn U+DC00 into NUL.

There is already a convention for this: Java's "Modified UTF-8" uses the
overlong sequence 0xC0 0x80 for embedded NULs when serializing data.

S
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top