UTF-8 and strings

John M. Dlugosz · Jun 7, 2011

I may soon be working on a code base that is undergoing
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.

Clearly, new code needs to use Unicode, or _some_ way of representing
a lot more than 256 characters. UTF-8 and Unicode has merit as an
encoding and transmission format, and I'm certainly not against it.

But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.

Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?

My initial thoughts is that most of the time the code simply hangs
onto the string data and doesn't manipulate it, so "funny" contents
won't matter much. Furthermore, UTF-8 avoids (by design) many of the
problems found with multi-byte encodings, so a naive handling might
work "well enough" for common tasks. However, I should catalog in
detail just what works and what the problems are. For more general
manipulation of the string, we need functions like the C library's
mblen etc. and replacements for standard library routines whose
implementation doesn't handle UTF-8 as multi-byte character strings.

Or, is there some paradigm shift I should be aware of?

—John

Asger-P · Jun 7, 2011

Hi John

I am by no means a C++ expert, I use C++Builder which is full
unicode now, but I can see that there is a wstring in STL so
my suggestion would be to translate all incoming and outgoing
data and the n internally use wchar_t and wstring.

You didn't mention if You were on a specific operating system,
but Windows work in wchar_t internally and I suppose unix, linux
and Mac does the same, I dont know though.

Best regards
Asger-P

Balog Pal · Jun 7, 2011

John M. Dlugosz said:
I may soon be working on a code base that is undergoing
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.

check out std::wstring, Glib::ustring for discussion. I'd bet boost also
must have i18n related stuff.

Jeff Flinn · Jun 7, 2011

John said:
I may soon be working on a code base that is undergoing
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.

Clearly, new code needs to use Unicode, or _some_ way of representing
a lot more than 256 characters. UTF-8 and Unicode has merit as an
encoding and transmission format, and I'm certainly not against it.

But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.

Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?

My initial thoughts is that most of the time the code simply hangs
onto the string data and doesn't manipulate it, so "funny" contents
won't matter much. Furthermore, UTF-8 avoids (by design) many of the
problems found with multi-byte encodings, so a naive handling might
work "well enough" for common tasks. However, I should catalog in
detail just what works and what the problems are. For more general
manipulation of the string, we need functions like the C library's
mblen etc. and replacements for standard library routines whose
implementation doesn't handle UTF-8 as multi-byte character strings.

Or, is there some paradigm shift I should be aware of?

Check the boost mailing list. There's been some discussion recently, as
well as a presentation at boostcon last month.

Jeff

diamondback · Jun 7, 2011

I may soon be working on a code base that is undergoing
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.

Clearly, new code needs to use Unicode, or _some_ way of representing
a lot more than 256 characters. UTF-8 and Unicode has merit as an
encoding and transmission format, and I'm certainly not against it.

But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.

Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?

My initial thoughts is that most of the time the code simply hangs
onto the string data and doesn't manipulate it, so "funny" contents
won't matter much. Furthermore, UTF-8 avoids (by design) many of the
problems found with multi-byte encodings, so a naive handling might
work "well enough" for common tasks. However, I should catalog in
detail just what works and what the problems are. For more general
manipulation of the string, we need functions like the C library's
mblen etc. and replacements for standard library routines whose
implementation doesn't handle UTF-8 as multi-byte character strings.

Or, is there some paradigm shift I should be aware of?

—John

John,

I actually had the pleasure(?) of working on some I18n projects in the
past. Most of my work was done in C, so the standard C++ libraries
were not a consideration. However, I think you are correct that a
"naive" implementation will go a long way to accomplishing what you
need. UTF-8 is explicitly designed to be backwards compatible with
ASCII for the first 128 characters.

The challenge, of course, comes when you need to count, manipulate, or
render the text with complex or multibyte characters. I am not aware
of any ubiquitous standard or library that makes this easier. There
are, of course, numerous proprietary libraries available, but I cannot
speak to their quality. Balog probably had the best suggestion to look
into the Glib::ustring library. Frankly, if the UTF-8 structure is
understood, it is fairly trivial to roll your own small library to do
the manipulation. But if, as you said, you are not doing a lot of
manipulation, that consideration is minimal.

All the best,
DiamondBack

Joshua Maurice · Jun 7, 2011

I may soon be working on a code base that is undergoing
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.

Clearly, new code needs to use Unicode, or _some_ way of representing
a lot more than 256 characters. UTF-8 and Unicode has merit as an
encoding and transmission format, and I'm certainly not against it.

But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.

Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?

My initial thoughts is that most of the time the code simply hangs
onto the string data and doesn't manipulate it, so "funny" contents
won't matter much. Furthermore, UTF-8 avoids (by design) many of the
problems found with multi-byte encodings, so a naive handling might
work "well enough" for common tasks. However, I should catalog in
detail just what works and what the problems are. For more general
manipulation of the string, we need functions like the C library's
mblen etc. and replacements for standard library routines whose
implementation doesn't handle UTF-8 as multi-byte character strings.

Or, is there some paradigm shift I should be aware of?

—John

If you want to work on win32 and unix-like systems, avoid wstring and
wchar like the plague. It's 16 bits on win32, and 32 bits on most unix-
like systems.

String data is itself just data. You can pass it around in std::vector
if you so wanted. The interesting parts come when you need to inspect
and transform the data.

Printing glyphs is well beyond my expertise, and probably beyond
portable C++. That's GUI territory.

Sorting and collation. The same string can be encoded in different
ways, so you need to be able to go to a normalized form to do string
comparison. Also sorting is non-trivial. Sorting is language and
situation dependent, and a lot of the sort orders are /not/
lexicographic. That is, you cannot simply do a memcmp. (At least not
on the strings themselves. There are libraries and such that can puts
strings into another "normalized" form, which differs depending on the
sort order, in which you can do comparisons with memcmp.) For example,
the normal German sort order is slightly different than the German
phonebook sort order. IIRC, in French, you first sort the strings left
to right ignoring accents, and for strings differing in accents only,
you sort /right to left/ based on accents. It's non-trivial if you
want to meet user expectations.

String concatenation and substringing. Encoding units are not Unicode
code points are not "user perceived characters" aka "grapheme
clusters". Latin letter e with accent acute, é, can be encoded in two
different ways, 1 Unicode code point precombined character Latin
letter e with accent acute, and with two Unicode code points: Latin
letter e, and combining character accent acute. Thus, if your program
wants to manipulate data based on user requests, such as "remove the
last character", you need to be careful what "character" means. Does
it mean Unicode code point, or grapheme cluster?

And then there's transformations between encodings. Any good Unicode
library should cover that for you.

That's about all I remember from the top of my head.

Nobody · Jun 7, 2011

But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.

Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?

If you want to manipulate "characters", then use std::wstring internally.
For converting between std::string and std::wstring, possible approaches
include:

1. Writing the code yourself (simple if you only use UTF-8)
2. mbstowcs and wcstombs (C99, <cstdlib>).
3. iconv (Unix)
4. MultiByteToWideChar and WideCharToMultiByte (Windows)

Each has at least one drawback. 1 requires that you write the encoding
and decoding functions. 2 requires setting the locale to a UTF-8 locale
before any conversion then changing it back afterwards. 3 and 4 aren't
portable.

Jorgen Grahn · Jun 8, 2011

....

The challenge, of course, comes when you need to count, manipulate, or
render the text with complex or multibyte characters. I am not aware
of any ubiquitous standard or library that makes this easier. There
are, of course, numerous proprietary libraries available, but I cannot
speak to their quality. Balog probably had the best suggestion to look
into the Glib::ustring library.

'libiconv' is popular and apparently standardized on Unix (i.e. pretty
much everything except Windows). http://en.wikipedia.org/wiki/Iconv

/Jorgen

Jorgen Grahn · Jun 8, 2011

If you want to work on win32 and unix-like systems, avoid wstring and
wchar like the plague. It's 16 bits on win32, and 32 bits on most unix-
like systems.

If you mean the 16/32 difference is the reason to avoid wchar, why
exactly? I don't always need to know the exact size of a long; why
should wchar be any different?

Of course, if the trend is "let the normal representation of strings
be UTF-8" then the answer doesn't matter.

/Jorgen

Nobody · Jun 8, 2011

If you want to work on win32 and unix-like systems, avoid wstring and
wchar like the plague. It's 16 bits on win32, and 32 bits on most unix-
like systems.

That's a silly reason to avoid std::wstring and wchar_t. You may as well
suggest avoiding "int" or "long" for similar reasons.

If you need to deal with individual "characters", wchar_t and std::wstring
are the most appropriate solution. Multi-byte strings are only sensible if
you're mainly treating strings as opaque values, passing them around
without paying too much attention to what is inside them.

Most of the awkward issues which affect internationalisation (e.g.
pre-composed characters versus composing characters, lexicographic
ordering, equivalence, etc) apply equally to multi-byte or wide strings,
so avoiding wide strings doesn't gain you anything here.

Marc · Jun 8, 2011

Jorgen said:
If you mean the 16/32 difference is the reason to avoid wchar, why
exactly? I don't always need to know the exact size of a long; why
should wchar be any different?

32 bits are enough that any unicode character fits in a single
wchat_t, so you can work on those almost (ok, that's a big "almost")
as easily as with plain old ascii. 16 bits force you to use some
variable length encoding like utf-16, so this is just as complicated
as utf-8.

Joshua Maurice · Jun 8, 2011

If you mean the 16/32 difference is the reason to avoid wchar, why
exactly? I don't always need to know the exact size of a long; why
should wchar be any different?

Of course, if the trend is "let the normal representation of strings
be UTF-8" then the answer doesn't matter.

If the goal is UTF-8 encoding, then obviously wstring is a bad choice.
If the goal is constant width encoding, then that's UTF-32, which is
not wstring. You ought to use std::basic_string specialized on
uint32_t or whatever the name is for a guaranteed 32 bit character
type in C++0x. (I'd rather roll my own honestly, but meh.)

If you're not going to inspect or manipulate the string data, might as
well use vector<char>. If you are going to do string-y things, you
probably want to rely on a particular encoding, though I suppose it's
possible to do the manipulations agnostic to whether it's UTF-8 or
UTF-16. If it's UTF-32, I suspect there'd be too much reliance on the
assumption that it's a constant width encoding of Unicode code
points.

Really though, the main reason I said that is I don't see the use case
for having UTF-16 on some platforms, and UTF-32 on other platforms.
There's no requirement to do that. It makes no sense to do that. This
is different than how big is long. You're not using arrays of longs to
stored variable width encodings (or constant width for UTF-32) of
data. Here you are. The exact size does kind of matter.

Joshua Maurice · Jun 8, 2011

If you want to manipulate "characters", then use std::wstring internally.
For converting between std::string and std::wstring, possible approaches
include:

1. Writing the code yourself (simple if you only use UTF-8)
2. mbstowcs and wcstombs (C99, <cstdlib>).
3. iconv (Unix)
4. MultiByteToWideChar and WideCharToMultiByte (Windows)

Each has at least one drawback. 1 requires that you write the encoding
and decoding functions. 2 requires setting the locale to a UTF-8 locale
before any conversion then changing it back afterwards. 3 and 4 aren't
portable.

I don't see ICU listed, which to my little knowledge is the default
solution for serious Unicode string processing and manipulation.

Miles Bader · Jun 8, 2011

Marc said:
32 bits are enough that any unicode character fits in a single
wchat_t, so you can work on those almost (ok, that's a big "almost")
as easily as with plain old ascii. 16 bits force you to use some
variable length encoding like utf-16, so this is just as complicated
as utf-8.

Exactly. utf-16 offers no simplicity advantage over utf-8, and suffers
from some significant disadvantages.

In practice, I suppose that many windows apps probably just ignore
anything outside the BMP, pretend "it's all 16-bit!", and as a result
suffer from mysterious and bizarre bugs when such characters crop up...

-Miles

Joshua Maurice · Jun 8, 2011

Jorgen Grahn wrote:

32 bits are enough that any unicode character fits in a single
wchat_t, so you can work on those almost (ok, that's a big "almost")
as easily as with plain old ascii. 16 bits force you to use some
variable length encoding like utf-16, so this is just as complicated
as utf-8.

You can work with it almost like ASCII, except for combining character
and multi-code point grapheme clusters.

Joshua Maurice · Jun 8, 2011

Exactly. utf-16 offers no simplicity advantage over utf-8, and suffers
from some significant disadvantages.

I wouldn't say that. That's very a very Europe-centric view. UTF-16
results in smaller memory footprints for a variety of Asian scripts
compared to UTF-8.

Joshua Maurice · Jun 8, 2011

You can work with it almost like ASCII, except for combining character
and multi-code point grapheme clusters.

(Hit send too soon.) And of course sorting, equivalence comparisons,
and... nevermind.

Miles Bader · Jun 8, 2011

Joshua Maurice said:
I wouldn't say that. That's very a very Europe-centric view. UTF-16
results in smaller memory footprints for a variety of Asian scripts
compared to UTF-8.

That's why I said "no simplicity advantage" instead of "no
advantage"... :]

In practice I don't think the space savings are worth the pain for most
apps, even if dealing with lots of CJK or something.

-Miles

tm · Jun 8, 2011

Jorgen Grahn wrote:

32 bits are enough that any unicode character fits in a single
wchat_t, so you can work on those almost (ok, that's a big "almost")
as easily as with plain old ascii. 16 bits force you to use some
variable length encoding like utf-16, so this is just as complicated
as utf-8.

That's the reason I used UTF-32 as encoding in my
libraries. As you said 32 bits are enough that any unicode
character fits. Libraries with UTF-8 encoding need to
distinguish between byte- and character-index. To get
character number 12345 from an UTF-8 string, the string
must be processed from the beginning. Ok, UTF-32 wastes
some space, but I think that many 64-bit integers also
contain values that would fit into a byte. I think that
many programs deal only with strings that have several
thousands up to one million of characters. This is no
problem for todays gigabyte machines.

The hard part of UTF-32 strings is that they need to
be converted every time you talk with the OS. This
needs to be encapsulated since the operating systems
have different ideas what their native string type
should be. For my libraries this work is already done.
Note that the conversions to and from operating system
strings turned out to be NO performace problem.

BTW. The description of my libraries is online. See:

http://seed7.sourceforge.net/libraries

Sorry to say, but this are not C++ libraries, but the
basic functions are written in C and are licensed with
the GPL. So they might be of use for C++ also.

Greetings Thomas Mertes

--
Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.

MikeP · Jun 8, 2011

Paavo said:
There is nothing wrong with UTF-8.

Unless one craves the simplicity of constant-width characters.

Unicode (UTF-8) in C	13	Mar 16, 2014
UTF-8 vs w_char	48	Nov 3, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
CGI and UTF-8	14	Sep 28, 2009
StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
ifstream >> string with UTF-8?	6	Sep 9, 2009
utf-8 and ctypes	5	Sep 28, 2010

UTF-8 and strings

John M. Dlugosz

Asger-P

Balog Pal

Jeff Flinn

diamondback

Joshua Maurice

Nobody

Jorgen Grahn

Jorgen Grahn

Nobody

Marc

Joshua Maurice

Joshua Maurice

Miles Bader

Joshua Maurice

Joshua Maurice

Joshua Maurice

Miles Bader

tm

MikeP

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads