More elegant UTF-8 encoder

Bjoern Hoehrmann · Jun 14, 2007

* (e-mail address removed) wrote in comp.lang.c:

/* Spread the bits to their target locations */
ret = (cp & UINT32_C(0x3f)) |
((cp << 2) & UINT32_C(0x3f00)) |
((cp << 4) & UINT32_C(0x3f0000)) |
((cp << 6) & UINT32_C(0x3f000000));

/* Count the length */
c = (-(cp & 0xffff0000)) >> UINT32_C(31);
c += (-(cp & 0xfffff800)) >> UINT32_C(31);
c += (-(cp & 0xffffff80)) >> UINT32_C(31);

/* Merge the spread bits with the mode bits */
return ret | encodingmode[c];
}

Thanks, this is quite nice, although it does not work for code points in
the range U+0040 to U+007F; for those the 7th bit should end up in the
least significant byte, while the code above shifts it into the next. So
it seems a little bit of control flow is needed, at least I don't see a
way around that.

Bjoern Hoehrmann · Jun 14, 2007

* Bjoern Hoehrmann wrote in comp.lang.c:

unsigned int
utf8toint(unsigned int c) {
[...]
}

For the reverse, I came up with this:

uint32_t
utf8dec(uint32_t c) {
/* drop leading bits in 3 byte seqs */
if ((c & 0x00C00000) == 0xC00000)
c &= 0xFF1FFFFF;

return ((c >> 6) & 0x1C0000)
+ ((c >> 4) & 0x03F000)
+ ((c >> 2) & 0x000FC0)
+ ((c >> 0) & 0x00007f);
}

websnarf · Jun 14, 2007

* (e-mail address removed) wrote in comp.lang.c:

/* Spread the bits to their target locations */
ret = (cp & UINT32_C(0x3f)) |
((cp << 2) & UINT32_C(0x3f00)) |
((cp << 4) & UINT32_C(0x3f0000)) |
((cp << 6) & UINT32_C(0x3f000000));

Click to expand...

/* Count the length */
c = (-(cp & 0xffff0000)) >> UINT32_C(31);
c += (-(cp & 0xfffff800)) >> UINT32_C(31);
c += (-(cp & 0xffffff80)) >> UINT32_C(31);

Click to expand...

/* Merge the spread bits with the mode bits */
return ret | encodingmode[c];
}

Click to expand...

Thanks, this is quite nice, although it does not work for code points in
the range U+0040 to U+007F; for those the 7th bit should end up in the
least significant byte, while the code above shifts it into the next. So
it seems a little bit of control flow is needed, at least I don't see a
way around that.

Crap, yeah. That's what I get for trying to eyeball it. So lets give
this another go:

#include "pstdint.h" /* http://www.pobox.com/~qed/pstdint.h */

uint32_t utf32ToUtf8 (uint32_t cp) {
uint32_t c;
static uint32_t emode[4] = { 0x0, 0xc080, 0xe08080, 0xf0808080 };
static uint32_t mmode[4] = { 0x7f, 0x3f, 0x3f, 0x3f };

/* Count the length */
c = (-(cp & 0xffff0000)) >> UINT32_C(31);
c += (-(cp & 0xfffff800)) >> UINT32_C(31);
c += (-(cp & 0xffffff80)) >> UINT32_C(31);

return (cp & mmode[c]) |
((cp << 2) & UINT32_C(0x3f00)) |
((cp << 4) & UINT32_C(0x3f0000)) |
((cp << 6) & UINT32_C(0x3f000000)) |
emode[c];
}

Dik T. Winter · Jun 15, 2007

>
> Care to explain this prediction? (I assumed continuous exponential
> growth over 5000 years, leading to a result of 1000 years of
> additional use; what's your model?)

Well, starting with 29929 code points in 1.0 (in 1991) and raised to
97786 code points in 4.1 (2005), that means an increase by a factor
of 3.27 in 14 years. Increasing it to 10FFFF code points (1114111)
would mean an increase by a factor of 37.23. But we can do better.
Assuming linear growth, I come to:
(1114111 - 29929) / (97786 - 29929) * 14 = 224.
And as 224 - 14 = 210...

Care to explain how you came to 1000 years with exponential growth?
With the figures above, with exponential growth I would expect
about 154 years.

But of course all models are flawed because the change from 2.1 to
3.0 saw an increase by 10307 code points (many more new scripts),
and 3.0 to 3.1 saw an increase by 44978 code points (Chinese).

Bjoern Hoehrmann · Jun 15, 2007

* (e-mail address removed) wrote in comp.lang.c:

Crap, yeah. That's what I get for trying to eyeball it. So lets give
this another go:

static uint32_t mmode[4] = { 0x7f, 0x3f, 0x3f, 0x3f };

return (cp & mmode[c]) |
((cp << 2) & UINT32_C(0x3f00)) |

But the mask needs to be 0x00 if c is zero, otherwise you would put
the bit into two places. So this would need to be:

#include "pstdint.h" /* http://www.pobox.com/~qed/pstdint.h */

uint32_t utf32ToUtf8 (uint32_t cp) {
uint32_t c;
static uint32_t emode[4] = { 0x00, 0xc080, 0xe08080, 0xf0808080 };
static uint32_t mmode[4] = { 0x7f, 0x003f, 0x00003f, 0x0000003f };
static uint32_t nmode[4] = { 0x00, 0x3f00, 0x003f00, 0x00003f00 };

/* Count the length */
c = (-(cp & 0xffff0000)) >> UINT32_C(31);
c += (-(cp & 0xfffff800)) >> UINT32_C(31);
c += (-(cp & 0xffffff80)) >> UINT32_C(31);

return ((cp << 0) & mmode[c]) |
((cp << 2) & nmode[c]) |
((cp << 4) & UINT32_C(0x003f0000)) |
((cp << 6) & UINT32_C(0x3f000000)) |
emode[c];
}

Richard Tobin · Jun 15, 2007

Well, starting with 29929 code points in 1.0 (in 1991) and raised to
97786 code points in 4.1 (2005), that means an increase by a factor
of 3.27 in 14 years. Increasing it to 10FFFF code points (1114111)
would mean an increase by a factor of 37.23.

Very interesting, but completely detached from reality. The increase
in Unicode code points is not some natural process that might be
expected to be linear or exponential. It results from an effort first
to include all the characters in living languages (which is, I
believe, more or less complete), and then to add characters from dead
languages once they are satisfactorily enumerated. Barring some
remarkable discoveries, 0x10ffff should be quite enough.

-- Richard

Richard Tobin · Jun 15, 2007

On a modern processor you are getting your ass kicked on the control
flow.

That depends. You also need to take into account the distribution of
your data. If it consists only of English text, then 99.9% of the
characters will be ASCII, so an immediate test

if (c < 0x80) return c;

is a big win. If you include western European languages, it will
still get about 90% of characters. Obviously if you have a lot of
Chinese texts the situation will be different.

Also, if I understand correctly, modern processors can speculatively
execute multiple branches, so the control flow may not be such a
problem and the branches may even be computed in parallel. There's
no substitute for benchmarking.

-- Richard

websnarf · Jun 16, 2007

That depends. You also need to take into account the distribution of
your data. If it consists only of English text, then 99.9% of the
characters will be ASCII, so an immediate test

if (c < 0x80) return c;

is a big win. If you include western European languages, it will
still get about 90% of characters.

Tell that to the Greeks, French or Russians. The above is a good
idea, basically for English, and may be ok for Spanish and German.

[...] Obviously if you have a lot of
Chinese texts the situation will be different.

No, Chinese would at least have a consistent branching pattern and
therefore not pay those sorts of penalties.

Also, if I understand correctly, modern processors can speculatively
execute multiple branches,

I am not aware of any processor which does that. Its a bit harder to
do that than you might think. (If there are multiple outstanding
speculations, do you try to do two-sided branch flow for all of them?)

[...] so the control flow may not be such a
problem and the branches may even be computed in parallel. There's
no substitute for benchmarking.

Knowledge is sometimes useful. That's why you don't need to benchmark
a bubble sort against a heap sort.

Keith Thompson · Jun 16, 2007

Tell that to the Greeks, French or Russians. The above is a good
idea, basically for English, and may be ok for Spanish and German.

Greek and Russian are not western European languages. French uses
accented characters, but the majority of typical French text is plain
ASCII, n'est-ce pas?

[...]

Richard Tobin · Jun 16, 2007

is a big win. If you include western European languages, it will
still get about 90% of characters.

[/QUOTE]

Tell that to the Greeks, French or Russians.

Russian is not western European. French has accented characters, but
less than 10% (yes, I checked some examples). Overall, I believe that
about 90% of characters in western European languages will be from
the Unicode range below 0x80.

-- Richard

Dik T. Winter · Jun 20, 2007

> In article <[email protected]>,

>
> Russian is not western European.

How do you define western European?

> Russian is not western European. French has accented characters, but
> less than 10% (yes, I checked some examples). Overall, I believe that
> about 90% of characters in western European languages will be from
> the Unicode range below 0x80.

That may be the case. But there is only one of the western European
languages that will fit completely in that range, and I do not know
whether it is indeed 10% accented characters in all other languages
(and there are more than you think). I would think that figure is
exceeded in Frisian, one of the official languages of the Netherlands.

Joachim Schmitz · Jun 20, 2007

Dik T. Winter said:
How do you define western European?

Does it matter? Russia is eastern Europe and even goes into Asia, there's no
european country that is more eastern, so it can't be western, can it?

Eastern Europe used to be devided from western Europe by the Iron Curtain.
Even now this still is the line to draw with the minor exception of eastern
Germany, maybe

Also Russia uses a completly different character set. The Greek too. While
the French only have a couple of accented characters, the Dutch having one
extra character (ij), etc, in addition to the Latin alphabeth

Bye, Jojo

Richard Bos · Jun 20, 2007

Dik T. Winter said:
How do you define western European?

From the West of Europe, obviously. Russia is about as far East as you
can go while still being in Europe.

That may be the case. But there is only one of the western European
languages that will fit completely in that range,

Two, if you count dead languages. Since English is completely identical
to Latin in all other regards (hence the ban on split infinitives), this
is but proper.

and I do not know whether it is indeed 10% accented characters in all
other languages (and there are more than you think). I would think
that figure is exceeded in Frisian, one of the official languages of
the Netherlands.

Is not! It's a speech defect. But no, you'd be surprised how few accents
there are in a typical Frisian text. Odd ones, such as a circonflexe on
the 'y', but not that many. Too bloody many unaccented 'y's and 'j's
inserted any which where, but not that many accents.

All this, and nobody has yet mentioned that categorical statements about
what kind of code is more time-efficient are usually the sign of a very
poor programmer? Measure, people, measure! And don't be surprised to
find a difference of only 1% either way.

Richard

Dik T. Winter · Jun 25, 2007

>
> Does it matter? Russia is eastern Europe and even goes into Asia, there's no
> european country that is more eastern, so it can't be western, can it?

But there are many people who would not call German western European either.
Rather central European.

> Eastern Europe used to be devided from western Europe by the Iron Curtain.
> Even now this still is the line to draw with the minor exception of eastern
> Germany, maybe

Ah. But because Greece and in fact also Yugoslavia were not behind the
Iron Curtain they belong to Western Europe? And Cyrillic is also used
in some of the former Yugoslavian Republics (it was actually invented
in Croatia, but not used there). And we have now also Bulgaria in the
EU, using a Cyrillic script. It will not be long before the banknotes
are adapted to include that script.

> Also Russia uses a completly different character set. The Greek too. While
> the French only have a couple of accented characters, the Dutch having one
> extra character (ij), etc, in addition to the Latin alphabeth

Take care. Linguists do not agree that "ij" is an extra character in
Dutch, it is a bit controversial. And you are missing the accented letters
that are used in Dutch quite a lot (diaeresis amongst others, with a function
different from the Umlaut in German, our neighbouring country is called
België in Dutch).

Dik T. Winter · Jun 25, 2007

>
> From the West of Europe, obviously. Russia is about as far East as you
> can go while still being in Europe.

But that script is used more west than the most western part of (former)
Russia.

>
> Is not! It's a speech defect.

Perhaps, but in that case it is an official speech defect. And there are
two more such in the Netherlands (called regional languages).

> All this, and nobody has yet mentioned that categorical statements about
> what kind of code is more time-efficient are usually the sign of a very
> poor programmer? Measure, people, measure! And don't be surprised to
> find a difference of only 1% either way.

Right. And: make it correct first, bother about optimisation only when
time is a problem.

Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
huffman encoder	16	Dec 21, 2005
printf and UTF-8 in linux	6	Sep 18, 2009
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
ChatBot	4	Jan 19, 2021
UTF-8 and strings	44	Jun 7, 2011
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
Translater + module + tkinter	1	Feb 16, 2023

More elegant UTF-8 encoder

Bjoern Hoehrmann

Bjoern Hoehrmann

websnarf

Dik T. Winter

Bjoern Hoehrmann

Richard Tobin

Richard Tobin

websnarf

Keith Thompson

Richard Tobin

Dik T. Winter

Joachim Schmitz

Richard Bos

Dik T. Winter

Dik T. Winter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads