B
Bjoern Hoehrmann
Hi,
For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.
unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;
if (c < 0x80) return c;
len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;
/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;
for (i = len; i > 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;
/* while unusual, the desired result is an int */
return res | c << len * 8;
}
Any ideas? Thanks,
For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.
unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;
if (c < 0x80) return c;
len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;
/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;
for (i = len; i > 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;
/* while unusual, the desired result is an int */
return res | c << len * 8;
}
Any ideas? Thanks,