UTF-8 and strings

MikeP · Jun 8, 2011

Jorgen said:
Of course, if the trend is "let the normal representation of strings
be UTF-8" then the answer doesn't matter.

What is this/the enamourment with UTF-8 anyway?

MikeP · Jun 8, 2011

Miles said:
Exactly. utf-16 offers no simplicity advantage over utf-8, and
suffers from some significant disadvantages.

"Given" (is it a given?) that "UTF-16" is likely to in reality be UCS-2
(?), then much simpler. And what are the consequences of making that
assumption?

In practice, I suppose that many windows apps probably just ignore
anything outside the BMP, pretend "it's all 16-bit!",

Yes, yes!

and as a result
suffer from mysterious and bizarre bugs when such characters crop
up...

Do tell, please.

none · Jun 8, 2011

Unless one craves the simplicity of constant-width characters.

That pretty much forces you to UTF-32 then which is a massive waste.

none · Jun 8, 2011

I may soon be working on a code base that is undergoing
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.

Clearly, new code needs to use Unicode, or _some_ way of representing
a lot more than 256 characters. UTF-8 and Unicode has merit as an
encoding and transmission format, and I'm certainly not against it.

You need to get your terminology right and understand clearly the
differences between Unicode, UTF-8, UTF-16, UTF-32 and UCS-2. You
don't use "Unicode" in an application. You might use what Windows
refer to as "Unicode" but it will likely be UTF-16 or UCS-2 in
reality.

But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.

Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?

My initial thoughts is that most of the time the code simply hangs
onto the string data and doesn't manipulate it, so "funny" contents
won't matter much. Furthermore, UTF-8 avoids (by design) many of the
problems found with multi-byte encodings, so a naive handling might
work "well enough" for common tasks. However, I should catalog in
detail just what works and what the problems are. For more general
manipulation of the string, we need functions like the C library's
mblen etc. and replacements for standard library routines whose
implementation doesn't handle UTF-8 as multi-byte character strings.

Or, is there some paradigm shift I should be aware of?

What will you do with you data?
What will you be interfacing with mostly with your internationalised
data?
What's your OS?

I can't see the point in coverting most of the time. If your main
purpose is to manipulate, search, transform, reorder strings, then the
convenience of UTF-32 might trump it. If you are sending data over
the network, you probably should use UTF-8. If you are interfacing
with an OS often, you probably would be better sticking with the
common Unicode representation used by that OS. If you are often
interfacing with a GUI library, you may as well keep the data in the
same format.

MikeP · Jun 8, 2011

none said:
Unless one craves the simplicity of constant-width characters.

That pretty much forces you to UTF-32 then which is a massive waste.[/QUOTE]

Or use UTF-16 as UCS-2, which sounds pretty reasonable still on Windows
(?).

John M. Dlugosz · Jun 9, 2011

You need to get your terminology right and understand clearly the
differences between Unicode, UTF-8, UTF-16, UTF-32 and UCS-2. You
don't use "Unicode" in an application. You might use what Windows
refer to as "Unicode" but it will likely be UTF-16 or UCS-2 in
reality.

I'm quite aware of the precise terminology. To summarize, Unicode is
a mapping of cataloged characters to ordinal values, and unlike older
catalogs does not imply any specific means of representing a list of
integers in files or memory. UTF-8, OTOH, is a specific encoding of
such a list as a sequence of bytes.

What in my message makes it seem that I don't know the difference? I
don't see any sloppy use of the terminology, and I certainly used my
terms rigorously and precisely in the parts you quoted.

What will you do with you data?

Anything and everything that applications do with strings. It's not a
publishing system or word processor, so strings are mainly incidental:
labels, file names, and values used in the GUI.

What will you be interfacing with mostly with your internationalised
data?

Just the OS primitives and other like-minded modules.

What's your OS?

It's portable code that needs to work on a variety of OSs.

If you are interfacing
with an OS often, you probably would be better sticking with the
common Unicode representation used by that OS. If you are often
interfacing with a GUI library, you may as well keep the data in the
same format.

The problem is that "common code" needs to be applicable to all
operating systems. Anything facing the OS will be abstracted, using
standard library classes or modules made for the purpose. The code
base already uses UTF-8 internally and I'm exploring the ramifications
of that, and how to do it correctly.

—John

John M. Dlugosz · Jun 9, 2011

Hi John

I am by no means a C++ expert, I use C++Builder which is full
unicode now, but I can see that there is a wstring in STL so
my suggestion would be to translate all incoming and outgoing
data and the n internally use wchar_t and wstring.

You didn't mention if You were on a specific operating system,
but Windows work in wchar_t internally and I suppose unix, linux
and Mac does the same, I dont know though.

Best regards
Asger-P

The question isn't "are there better ways to handle things than using
UTF-8 internally."
My question is, "*given* that this project uses UTF-8 internally, and
will do more of that as development continues, and the C++ standard
library doesn't have a class that models that concept cleanly, what
are good ways to deal with it?"
I'd love to solve the problem differently, and not care about memory
consumption either (just use 32-bit characters), but that's not where
I am on this.

—John

John M. Dlugosz · Jun 9, 2011

check out std::wstring, Glib::ustring for discussion. I'd bet boost also
must have i18n related stuff.

Thanks for the pointer to Glib::ustring. The description and
discussion sounds like exactly my concerns, and it is used by other
libraries already so it's probably nicely adopt-able.

—John

John M. Dlugosz · Jun 9, 2011

Jorgen Grahn wrote:
32 bits are enough that any unicode character fits in a single
wchat_t, so you can work on those almost (ok, that's a big "almost")
as easily as with plain old ascii. 16 bits force you to use some
variable length encoding like utf-16, so this is just as complicated
as utf-8.

That's my feeling exactly. UTF-16 is "neither here nor there" in that
you still can't assume one cell per code point but have to deal with
surrogate pairs, and it takes up more memory for ASCII and Western
languages, and opens up byte-ordering issues.

OTOH, manipulating "characters" doesn't necessarily mean 1:1 with
"code points" anyway, as someone else pointed out. I'm not worried
about that now, since the project isn't a word processor and any use
of Backspace will be in the platform-specific GUI components.

—John

Joshua Maurice · Jun 9, 2011

The question isn't "are there better ways to handle things than using
UTF-8 internally."
My question is, "*given* that this project uses UTF-8 internally, and
will do more of that as development continues, and the C++ standard
library doesn't have a class that models that concept cleanly, what
are good ways to deal with it?"
I'd love to solve the problem differently, and not care about memory
consumption either (just use 32-bit characters), but that's not where
I am on this.

Let me know if you figure that out. My own company uses ICU,
specifically a modified version of ICU to get rid of all of that
stupid virtual function call per character nonsense. It also
unfortunately uses basically UCS-2 strings. A lot of it was done
before UTF-32 and the newer Unicode standard came out.

If I were on a new project, I would like to use a pre-existing
library, but AFAIK no library handles it correctly out of the box.
Thus, I would personally want to write my own string class and divorce
it from the ill-specified and not portable C++ locale stuffs, though
I'd probably give it a std::stringstream equivalent.

What I would want from a good Unicode string class is:
- Well, 3 classes actually. I'd have utf8string, utf16string,
utf32string. I'd code it in such a way to ensure little code
duplication.
- Each one would have three different iterator types:
encoding_unit_iter, code_point_iter, and grapheme_cluster_iter.
- "begin" "end" et. al. get removed, and instead you have one begin/
end pair for each of the 3 iterator types (ex: begin_encoding_unit(),
end_encoding_unit()). encoding_unit_iter is a random access iterator,
and the rest are not. (Well, except for utf32 where code_point_iter is
also random access iterator. In the interests of easily changing
encodings in the program with a recompilation, I might have a separate
utf32string class for the expressed purpose of having its
code_point_iter be /not/ random access.) Of course, there would be
constructors to go from grapheme_cluster_iter to code_point_iter and
encoding_unit_iter, and code_point_iter to encoding_unit_iter. There
would be named conversion functions to force conversions the other way
(as they may not be valid). Note also that grapheme_cluster_iter may
be language and culture dependent (I forget?), and so the begin and
end functions for that may require a locale object (or whatever passes
for a locale object in my new crazy scheme).
- I would remove operator[], but have an explicit subscript function
named something like
encoding_unit_t& utfstring::get_encoding_unit(size_t i);
encoding_unit_t utfstring::get_encoding_unit(size_t i) const;
- I would remove all other functions that use integer offsets. They
would be replaced with the following functions: ones that work in
terms of iterators that iterate over code points, and another that
works in terms of iterators of this string's encoding unit type. The
functions would have obvious names as such.
- I'm not sure if I would want to have a fully mutable interface, or
go read-only and make it thread-safe ala Java. It makes little sense
IMHO to allow mutable utfstrings in the general case. You can't
replace one character with another (except maybe in utf32 strings) as
the code points (or grapheme clusters!) may have different encoding
lengths.
- Stand alone functions that translate from one encoding to another.
Or actually stand alone objects for translation between encodings in
order to support stateful encodings (of which the UTFs are not, thank
god, but some encodings are).

My crazy scheme would probably make a lot of use of existing ICU
classes, such as for collation and translation. Especially it would
make use of their character and locale data files. (Assuming the
license would allow that. I don't know offhand. Gotta get the up to
date and machine readable character and locale data files from
somewhere.)

WARNING! Pardon my ranting. I've done little actual work with Unicode,
but what little work I have done have led me to these conclusions.
Take at your own risk.

Nobody · Jun 9, 2011

I don't see ICU listed, which to my little knowledge is the default
solution for serious Unicode string processing and manipulation.

Too much overhead if you don't need "serious" Unicode support. Which most
applications don't; they just want to support it to the level that the OS
and external libraries do.

none · Jun 9, 2011

Or use UTF-16 as UCS-2, which sounds pretty reasonable still on Windows
(?).

Use real UCS-2: probably acceptable

Use UTF-16 as if it was UCS-2: Euh, this is how bugs happens.
- I want the simplicity of constant-width characters
- I choose UTF-16 but ignore the existence of surrogate pairs
- Ooops bang!

It's even on wikipedia (not a good reference but still)
"Because the most commonly used characters are all in the Basic
Multilingual Plane, handling of surrogate pairs is often not tested
thoroughly. This leads to persistent bugs, and potential security
holes, even in popular and well-reviewed application software"

Yannick

none · Jun 9, 2011

I'm quite aware of the precise terminology. To summarize, Unicode is
a mapping of cataloged characters to ordinal values, and unlike older
catalogs does not imply any specific means of representing a list of
integers in files or memory. UTF-8, OTOH, is a specific encoding of
such a list as a sequence of bytes.

What in my message makes it seem that I don't know the difference? I
don't see any sloppy use of the terminology, and I certainly used my
terms rigorously and precisely in the parts you quoted.

Sorry then. I read the statements that "already ... is using UTF-8",
"...new code needs to use Unicode" and "UTF-8 and Unicode has merit
...." as if it you meant that there were mutually exclusive
alternatives: "Unicode" or "UTF-8". Apologies

Drifting here but using Unicode is a no-brainer. Don't even think
about fussing about with Windows codepages, iso-8859-?, shift-jis,
euc-jp, euc-cn, iso-2022-??, etc. These are enough to give you
nightmares. So this leaves you only with which representation of
Unicode you should use. Three posible candidates: UTF-8, UTF-32 and
UTF-16.

Anything and everything that applications do with strings. It's not a
publishing system or word processor, so strings are mainly incidental:
labels, file names, and values used in the GUI.

The the benefits of constant-width characters is not very important
for you.

Just the OS primitives and other like-minded modules.

It's portable code that needs to work on a variety of OSs.

OK, so you will have to convert anyway. Windows API tend to prefer
UTF-16 and Posix tend to lean toward UTF-8 or UTF-32.

The problem is that "common code" needs to be applicable to all
operating systems. Anything facing the OS will be abstracted, using
standard library classes or modules made for the purpose. The code
base already uses UTF-8 internally and I'm exploring the ramifications
of that, and how to do it correctly.

This sounds like sound design.
I can't see any reason why choosing either of the UTF-* would cause
any problems. You choose what is your internal representation and you
stay fully consistent with it internally. Any external input is
converted immediately to the internal representation if needed. Any
output is converted if needed by the external function wrapper but
your internal API uses your internal representation. Check iconv for
converting library.

So with this design, there should be no problem of using any of the
UTF-* for internal representation. What you are left with is what do
you do with the data internally:

You don't really need constant-width character. So one of the main
benefit of UTF-32 is not needed

You already have some UTF-8 code implemented. So maybe there's no
point changing for changing

It sounds like you will be converting an existing codebase that was
i18n unaware. Quite often, such codebase is not compatible with
wide-characters and the presence of embedded nulls in a byte stream
could break it. IMO, it is much easier to gradually switch an
existing codebase from ASCII (or ISO-8859-x) to utf-8 than it is to
change to UTF-16 or UTF-32.

Does storage space for large amount of string data matter to you? If
so, do you expect to store a lot of Asian characters?

Regards
Yannick

John M. Dlugosz · Jun 9, 2011

What I would want from a good Unicode string class is: …

Thanks for your extensive notes. I see your ideas are sometimes
different from what I've seen thus far of Glib::ustring, which is
designed for more plug-in replacement.

I would also consider how such a class plays with other classes that
use strings, such as the Boost and now TR2 filename class.

John M. Dlugosz · Jun 9, 2011

Sorry then. I read the statements that "already ... is using UTF-8",
"...new code needs to use Unicode" and "UTF-8 and Unicode has merit
..." as if it you meant that there were mutually exclusive
alternatives: "Unicode" or "UTF-8". Apologies

I see. I meant them separately, "Unicode has merit" and "UTF-8 has
merit", not as synonyms, and not as mutual exclusive either. Using
Unicode (as opposed to GB18030, local code pages, ISO-2022 code page
switching) is good, and using UTF-8 (as opposed to UTF-16 etc.) for
internal representation and API passing has its points. In Windows,
I've moved towards using UTF-16 and wide strings. This project is
different.

—John

John M. Dlugosz · Jun 10, 2011

Check the boost mailing list. There's been some discussion recently, as
well as a presentation at boostcon last month.

I found a Boost.Locale library proposed but not in the current release
of Boost. I did not find any Boostcon slides or videos concerning
it. Can you be more specific, or give a URL?

Thanks,
—John

John M. Dlugosz · Jun 10, 2011

What is this/the enamourment with UTF-8 anyway?

There are a few:

(1) Old code is written using byte-size character strings of *some*
kind. For languages larger character sets, the existing practice is
to use a "multi-byte" code page, and actually have variable length
where ASCII is still a single byte and certain ranges of character
indicate prefixes or paired use.

So lots of code uses byte strings with "internationalization" meaning
to allow the target system to specify the mapping of which character
is what value, and allowing multi-byte characters mixed with single
byte characters.

So, using UTF-8 fits the existing data types and code. It's just
another "code page" to such code.

(2) It is efficient for non-ideographic languages, and data-
processing data that uses mostly ASCII identifiers.

(3) Being defined as a series of bytes, it is byte-order neutral.

(4) Contrast with UTF-16, which _still_ requires awareness of pairs
to allow more than 64K code values.

So, it's attractive for data storage and transmission.

John M. Dlugosz · Jun 10, 2011

But isn't Windows (XP)'s "UTF-16" really UCS-2 and hence not a risk at
all then?

Originally, yes. At some point it went to real UTF-16 awareness, at
least for rendering strings using fonts into windows, and converting
between code pages correctly. This was around Windows-2000 or XP,
certainly well before Windows 7.

I think the file system is technically still USC-2 in that it doesn't
care what the code points mean and doesn't reject improper use of
surrogate pairs as errors.

MikeP · Jun 11, 2011

John said:
There are a few:

(1) Old code is written using byte-size character strings of *some*
kind. For languages larger character sets, the existing practice is
to use a "multi-byte" code page, and actually have variable length
where ASCII is still a single byte and certain ranges of character
indicate prefixes or paired use.

So lots of code uses byte strings with "internationalization" meaning
to allow the target system to specify the mapping of which character
is what value, and allowing multi-byte characters mixed with single
byte characters.

So, using UTF-8 fits the existing data types and code. It's just
another "code page" to such code.

(2) It is efficient for non-ideographic languages, and data-
processing data that uses mostly ASCII identifiers.

(3) Being defined as a series of bytes, it is byte-order neutral.

(4) Contrast with UTF-16, which _still_ requires awareness of pairs
to allow more than 64K code values.

So, it's attractive for data storage and transmission.

I guess I have a hard time seeing how anything multi-byte is a boon. But,
and it's a big but (not to be confused with a phat azz!), if one doesn't
need "internationalization" (I mean other than English), it's a waste of
effort. Yes?

Ruben Safir · Jun 11, 2011

John M. Dlugosz said:
Originally, yes. At some point it went to real UTF-16 awareness, at
least for rendering strings using fonts into windows, and converting
between code pages correctly. This was around Windows-2000 or XP,
certainly well before Windows 7.

I don't need my text editor to absorb all the complexity of a word
processor.

Unicode (UTF-8) in C	13	Mar 16, 2014
UTF-8 vs w_char	48	Nov 3, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
CGI and UTF-8	14	Sep 28, 2009
StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
ifstream >> string with UTF-8?	6	Sep 9, 2009
utf-8 and ctypes	5	Sep 28, 2010

UTF-8 and strings

MikeP

MikeP

none

none

MikeP

John M. Dlugosz

John M. Dlugosz

John M. Dlugosz

John M. Dlugosz

Joshua Maurice

Nobody

none

none

John M. Dlugosz

John M. Dlugosz

John M. Dlugosz

John M. Dlugosz

John M. Dlugosz

MikeP

Ruben Safir

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads