More elegant UTF-8 encoder

Bjoern Hoehrmann · Jun 10, 2007

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i > 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,

christian.bau · Jun 10, 2007

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i > 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,

What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.

And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.

Richard Tobin · Jun 10, 2007

Bjoern Hoehrmann said:
I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable

"Choose any two".

To be honest, I don't see the point. It looks fast enough: after all,
you must be reading the data from somewhere, which is likely to be
much slower. Unless you have profiling data showing that it's a
significant overhead, forget it. As for clearer, it depends where
you're starting from. If you want to match a typical textual
description of UTF-8, I think something like this is much clearer:

unsigned char b[4] = {0, 0, 0, 0};

if(c < 0x80)
b[0] = c;
else if(c < 0x800)
{
b[1] = 0xc0 + (c >> 6);
b[0] = 0x80 + (c & 0x3f);
}
else if(c < 0x10000)
{
b[2] = 0xe0 + (c >> 12);
b[1] = 0x80 + ((c >> 6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}
else
{
b[3] = 0xf0 + (c >> 18);
b[2] = 0x80 + ((c >> 12) & 0x3f);
b[1] = 0x80 + ((c >> 6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}

return b[0] + (b[1] << 8) + (b[2] << 16) + (b[3] << 24);

That's untested and derived from code intended to output bytes in
sequence. Of course you could replace the array assignments with
returns of expressions composing the parts.

-- Richard

Richard Tobin · Jun 10, 2007

christian.bau said:
And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.

I agree!

-- Richard

J. J. Farrell · Jun 10, 2007

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i > 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,

I'd have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?

Richard Tobin · Jun 10, 2007

J. J. Farrell said:
You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?

4 bytes is sufficient to cover all values up to 0x10ffff. I don't
think there's any prospect of codes being allocated outside that range
in the foreseeable future.

-- Richard

Bjoern Hoehrmann · Jun 11, 2007

* christian.bau wrote in comp.lang.c:

What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.

Well, I have what you can consider a regular expression engine based on
Janusz Brzozowski's notion of derivatives of regular expression, meaning
that, given a regular expression and a character, it computes a regular
expression matching the rest of the string. Currently it stores ranges
of characters using the Unicode scalar value and transcodes from UTF-8
to UTF-32.

For several reasons, I want to avoid transcoding to UTF-32, so I want to
change it so that, given a regex and a octet, it computes a new regex. I
am experimenting with possible solutions, one is to exploit

utf8toint(c1) < utf8toint(c2) <=> c1 < c2

which allows me to store the character ranges in their utf8toint encoded
form. The derivative of a range with respect to an octet can then easily
be computed by computing the intersection of the range and a new range
consisting of the minimal and maximal utf8toint value given the octet(s)
seen up to that point (they consist of the current byte followed by n-1
0x80 and 0xBF octets respectively, where n is the required length).

So a range [ U+0000 - U+00FF ] would be stored as [ 0x0000 - 0xc3bf ]
and if it sees e.g. a 0xc2 it would create a range [ 0xc280 - 0xc2bf ],
compute the intersection which is [ 0xc280 - 0xc2bf ] and drop the seen
byte, resulting in [ 0x80 - 0xbf ]; I can always tell, due to how UTF-8
byte patterns are organized, whether a given range is a partial range
and how many bytes are still needed to make a full character, though I
will be storing the remaining byte count for performance reasons.

Obviously I could do something similar by partially decoding the UTF-8
octets and storing Unicode scalar value ranges in the derivative instead
or mix these approaches in some way, but that seemed more difficult to
me. Similarily, rewriting the regular expression upfront so it matches
on bytes rather than characters would be more difficult. So, while it
might be unusual, I don't think this is particularily bizarre.

Bjoern Hoehrmann · Jun 11, 2007

* Richard Tobin wrote in comp.lang.c:

To be honest, I don't see the point.

I am asking because I hope to learn something; I currently do not see a
way to improve the code in some way, if someone else manages to provide
an improved version, I could learn from that. I can't give any hard and
fast rules what would constitute an improvement, but if the alternative
has many more non-whitespace characters, compiles to slower code on my
system, or introduces undefined behavior or platform-specific code, it
is unlikely an improvement, while eliminating a variable without nega-
tively affecting performance might well be.

Clark Cox · Jun 11, 2007

I'd have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?

4-bytes is sufficient to contain any legal Unicode codepoint in UTF-8
representation.

Stephen Sprunk · Jun 11, 2007

J. J. Farrell said:
what if the UTF-8 encoding requires more bytes than any
available integer type?

That's only a risk in C89. C99 requires "long long", which is at least 64
bits, and the longest valid UTF-8 sequence is 7 octets (56 bits).

Using "int" is just plain broken, since that isn't guaranteed to hold any
more than two octets. "long" is less broken, since it's capable of holding
at least four octets and that's enough for all currently-assigned
codepoints.

S

Clark Cox · Jun 11, 2007

That's only a risk in C89.

It's not even a risk there, as long must be at least 32 bits

C99 requires "long long", which is at least 64 bits, and the longest
valid UTF-8 sequence is 7 octets (56 bits).

No. There are no legal UTF-8 sequences that are longer than 4-octets.

Using "int" is just plain broken, since that isn't guaranteed to hold
any more than two octets.
Agreed

"long" is less broken, since it's capable of holding at least four
octets and that's enough for all currently-assigned codepoints.

UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).

Richard Tobin · Jun 11, 2007

UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).

It's UTF-16 that's broken.

But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.

-- Richard

christian.bau · Jun 12, 2007

* Richard Tobin wrote in comp.lang.c:

I am asking because I hope to learn something; I currently do not see a
way to improve the code in some way, if someone else manages to provide
an improved version, I could learn from that. I can't give any hard and
fast rules what would constitute an improvement, but if the alternative
has many more non-whitespace characters, compiles to slower code on my
system, or introduces undefined behavior or platform-specific code, it
is unlikely an improvement, while eliminating a variable without nega-
tively affecting performance might well be.

if (c < 0x80)
return c;
else if (c < 0x800)
return ((c << 2) & 0x1f00) | (c & 0x003f) | 0xc080;
else if (c < 0x10000)
return ((c << 4) & 0x0f0000) | ((c << 2) & 0x3f00) | (c & 0x003f) |
0xe08080;
else
return ((c << 6) & 0x07000000) | ((c << 4) & 0x3f0000) | ((c << 2) &
0x3f00) | (c & 0x003f) | 0xf0808080;

So I assume that you have lots of UTF-8 encoded text, and every time
you extract the next character, you don't extract the Unicode
codepoint, but this strange UTF-8 encoded version of the codepoint,
because it would be faster to calculate from UTF-8?

Stephen Sprunk · Jun 12, 2007

Clark Cox said:
No. There are no legal UTF-8 sequences that are longer than 4-
octets. ....

UTF-8 is officially capped at 4 octets. There is no way to make it longer
without breaking Unicode (consider round-tripping with
UTF-16).

The Unicode folks and IETF agree with you, but the ISO standard doesn't
limit UTF-8 to four or codepoints to U+10FFFF.

While I'll grant it's unlikely, it's indeed _possible_ that the limit will
be lifted in the future. Since UTF-8 follows a consistent pattern up to
seven octets, there's no reason not to allow for encoding or decoding it as
long as it's well-formed. The UCS-2 folks all got burned when UTF-16 came
out with its surrogates, remember, and it didn't even take that long; I
don't plan on repeating their mistakes. Just like I never thought 640kB RAM
(or 4GB) was enough for everybody and allowed for more if/when it became
possible...

S

Stephen Sprunk · Jun 13, 2007

Richard Tobin said:
It's UTF-16 that's broken.

But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.

Thankfully, the IETF has already made a first step in that direction:
http://www.ietf.org/rfc/rfc4042.txt

Yes, I know the publication date*, but it's still somewhat relevant...

S

* For those that aren't aware, the IETF publishes spoof standards most years
on April Fools' Day (1 Apr). One, RFC 1149, was actually implemented.

websnarf · Jun 14, 2007

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

UCS-4 or UTF-32 is 31 bits whose valid range is a subset of
[0x0,0x10FFFF]. UTF-8 is a variable length encoding of code points
from 1 to 4 octets. So, in C the output data type you are looking for
is probably an unsigned long, not an unsigned int (though a struct
{ int len, unsigned char v[4]}; seems more appropriate if you don't
want to worry about speed).

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i > 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

On a modern processor you are getting your ass kicked on the control
flow. Let's try this again:

#include "pstdint.h" /* http://www.pobox.com/~qed/pstdint.h */

uint32_t utf32ToUtf8 (uint32_t cp) {
uint32_t ret, c;
static uint32_t encodingmode[4] = { 0x0, 0xc080, 0xe08080,
0xf0808080 };

/* Spread the bits to their target locations */
ret = (cp & UINT32_C(0x3f)) |
((cp << 2) & UINT32_C(0x3f00)) |
((cp << 4) & UINT32_C(0x3f0000)) |
((cp << 6) & UINT32_C(0x3f000000));

/* Count the length */
c = (-(cp & 0xffff0000)) >> UINT32_C(31);
c += (-(cp & 0xfffff800)) >> UINT32_C(31);
c += (-(cp & 0xffffff80)) >> UINT32_C(31);

/* Merge the spread bits with the mode bits */
return ret | encodingmode[c];
}

I haven't tested this, but it seems ok upon visual inspection.

Dik T. Winter · Jun 14, 2007

That is false. It is capped at six octets. There is round-tripping
with UTF-16, but that is a bit elaborate. In UTF-8 the surrogates
should *not* be encoded, but the actual code-point. (Encoding U+D800
to U+DFFF is not permitted in UTF-8.)

> It's UTF-16 that's broken.

Indeed, and that becomes visible when we get beyond plane 16.

> But we'll have 10-bit bytes before we need more than 0x10ffff code points
> in Unicode.

With the current rate of increase that would be in 210 years. But in
the current (eh, 4.1) coding the largest serious code point was U+2FA1D,
and the largest *defined* code point was U+10FFFF. For five bytes of
UTF-8 we need at least U+200000. But one of these days I should look
at the differences between 4.1 and 5.0.

websnarf · Jun 14, 2007

The Unicode folks and IETF agree with you, but the ISO standard doesn't
limit UTF-8 to four or codepoints to U+10FFFF.

The *OLDER* ISO 10646 standard allowed for larger encodings. However,
the ISO 10646 has merged with Unicode (version 3.0 I think) and thus
obsoleted/abandonded its old expanded range.

While I'll grant it's unlikely, it's indeed _possible_ that the limit will
be lifted in the future.

We would probably have to encounter an extra-terrestrial life form
that used sequential symbolic communications like we do, and who
decided that an alphabet 30 times larger than the Chinese one was part
of their communications systems. Its not going to happen here on
earth.

Since UTF-8 follows a consistent pattern up to
seven octets, there's no reason not to allow for encoding or decoding it as
long as it's well-formed. The UCS-2 folks all got burned when UTF-16 came
out with its surrogates, remember, and it didn't even take that long; I
don't plan on repeating their mistakes.

You are in charge of the Unicode Standards? The original Unicode
people were idiots and could not properly count the number of Chinese
characters. Perhaps the "offset printing lobby" tricked them into
choosing too few bits to throw a monkey wrench into the system.

[...] Just like I never thought 640kB RAM
(or 4GB) was enough for everybody and allowed for more if/when it became
possible...

Just like huh? RAM requirements are clearly tied to Moore's Law. But
Alphabet sizes? I don't know how old writing is, but if its about
5000 years old, and we assume a constant growth rate, then Unicode
will still be good for about 1000 years in its current form.

However, now that so much of language and human activity is tied up in
the current incumbent communications systems, I would claim that in
fact growth of alphabets will be severely curtailed, except for
certain marginal applications (alphabets for learning disabled people,
Indigenous people's language when/if they decide to convert them to
written form etc.)

Clark Cox · Jun 14, 2007

That is false. It is capped at six octets. There is round-tripping
with UTF-16, but that is a bit elaborate.

It's not that elaborate; it's dead simple in fact.

UTF-16 -> Unicode scalar value -> UTF-8
UTF-8 -> Unicode scalar value -> UTF-16

This is not possible if UTF-8 is extended beyond 4 bytes.

In UTF-8 the surrogates should *not* be encoded,

I never claimed that they should.

but the actual code-point. (Encoding U+D800 to U+DFFF is not permitted
in UTF-8.)

Indeed, and that becomes visible when we get beyond plane 16.

UTF-16 is perfectly suited to represent all of the possible Unicode
values (as is 4-byte UTF-8)

websnarf · Jun 14, 2007

That is false. It is capped at six octets.

No, that's the old mechanistic definition from the ISO 10646
standard. The 5 and 6 byte encodings have been retired and are not
considered correct anymore.

[...] There is round-tripping
with UTF-16, but that is a bit elaborate. In UTF-8 the surrogates
should *not* be encoded, but the actual code-point. (Encoding U+D800
to U+DFFF is not permitted in UTF-8.)

Ok, this is a bit of a contorted way of thinking about it. Unicode,
defines text streams as a sequence of code points which are each from
a certain subset of the range [0x0,0x10FFFF]. UTF-16 and UTF-8 can
each encode the entire legal range. The surrogates, the endian
reversed BOM character and various other characters are illegal mostly
because UTF-16 cannot encode them independently.

Well its broken in the sense that it is the real limiter for the total
range, and it consumes higher than average storage space for western
text. Otherwise, from the Unicode standard point of view, its fine.

Indeed, and that becomes visible when we get beyond plane 16.
Hm?

With the current rate of increase that would be in 210 years.

Care to explain this prediction? (I assumed continuous exponential
growth over 5000 years, leading to a result of 1000 years of
additional use; what's your model?)

[...] But in
the current (eh, 4.1) coding the largest serious code point was U+2FA1D,
and the largest *defined* code point was U+10FFFF. For five bytes of
UTF-8 we need at least U+200000.

More likely, some undefined/reserved Unicode space would be used to
make another layer of surrogate pairs. This would allow existing
UTF-16 encoders to continue to be used.

[...] But one of these days I should look at the differences between 4.1
and 5.0.

I highly doubt there are any differences related to encodings.

Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
huffman encoder	16	Dec 21, 2005
printf and UTF-8 in linux	6	Sep 18, 2009
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
ChatBot	4	Jan 19, 2021
UTF-8 and strings	44	Jun 7, 2011
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
Translater + module + tkinter	1	Feb 16, 2023

More elegant UTF-8 encoder

Bjoern Hoehrmann

christian.bau

Richard Tobin

Richard Tobin

J. J. Farrell

Richard Tobin

Bjoern Hoehrmann

Bjoern Hoehrmann

Clark Cox

Stephen Sprunk

Clark Cox

Richard Tobin

christian.bau

Stephen Sprunk

Stephen Sprunk

websnarf

Dik T. Winter

websnarf

Clark Cox

websnarf

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads