std::string vs. Unicode UTF-8

W

Wolfgang Draxinger

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger
 
N

Niels Dybdahl

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

It is much easier to handle unicode strings with wchar_t internally and
there is much less confusion about whether the string is ANSI or UTF8
encoded. So I have started using wchar_t wherever I can and I only use UTF8
for external communication.

Niels Dybdahl
 
J

John Harrison

Wolfgang said:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger

UTF-8 is only an encoding, why to you think a strings internal to the
program should be represented as UTF-8? Makes more sense to me to
translate to or from UTF-8 when you input or output strings from your
program. C++ already has the framework in place for that.

john
 
B

Bob Hairgrove

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

Not only that, but substr(), operator[] etc. pose equally
"interesting" problems.
To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger

People use std::string in many different ways. You can even store
binary data with embedded null characters in it. I don't know for
sure, but I believe there are already proposals in front of the C++
standards committee for what you suggest. In the meantime, it might
make more sense to use a third-party UTF-8 string class if that is
what you mainly use it for. IBM has released the ICU library as open
source, for example, and it is widely used these days.
 
P

peter.koch.larsen

Wolfgang said:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

Correct. Also you can't print it or anything else.
To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.
Ok.


The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version?
It already is - using e.g. wchar_t.
I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. It is not limited.
Of course there is
also the wchar_t variant, but actually I don't like that.

So you'd like to have Unicode support. And you realize you already have
it. But you don't like it. Why?
Wolfgang Draxinger
/Peter
 
B

benben

Wolfgang Draxinger said:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger

That's why people have std::wstring :)

Ben
 
P

Pete Becker

Wolfgang said:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

Yup. That's what happens when you use the wrong tool.
The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays.

There's much more to internationalization than Unicode. Requiring
std::string to be Unicode aware (presumably that means UTF-8 aware)
would impose implementation overhead that's not needed for the kinds of
things it was designed for, like the various ISO 8859 code sets. In
general, neither string nor wstring knows anything about multi-character
encodings. That's for efficiency. Do the translation on input and output.

Of course there is
also the wchar_t variant, but actually I don't like that.

That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?
 
M

msalters

Wolfgang Draxinger schreef:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

Usually correct, but not always. A char is a byte in C++, but
a byte might not be an octet. UTF-8 is of course octet-based.
The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

wchar_t isn't always Unicode, either. There's a proposal to add an
extra unicode char type, and that probably will include std::ustring

However, that is probably a 20+bit type. Unicode itself assigns
numbers to characters, and the numbers have exceeded 65536.
UTF-x means Unicode Transformation Format - x. These formats
map each number to one or more x-bit values. E.g. UTF-8 maps
the number of each unicode character to an octet sequence,
with the additional property that the 0 byte isn't used for
anything but number 0.

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

HTH,
Michiel Salters
 
K

kanze

msalters said:
Wolfgang Draxinger schreef:

[...]
However, that is probably a 20+bit type. Unicode itself
assigns numbers to characters, and the numbers have exceeded
65536. UTF-x means Unicode Transformation Format - x. These
formats map each number to one or more x-bit values.
E.g. UTF-8 maps the number of each unicode character to an
octet sequence, with the additional property that the 0 byte
isn't used for anything but number 0.

It has a lot more additional properties than that. Like the
fact that you can immediately tell whether a byte is a single
byte character, the first byte of a multibyte sequence, or a
following byte in a multibyte sequence, without looking beyond
just that byte.
Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)
 
D

Dave Rahardja

I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)

RFC 3629 says it this way:

"ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an
encoding form, each character is represented as one or more encoding
units. All standard UCS encoding forms except UTF-8 have an encoding
unit larger than one octet, making them hard to use in many current
applications and protocols that assume 8 or even 7 bit characters."

Note that UTF-8 is intended to _encode_ a larger space, its primary purpose
being the compatibily of the encoded format with "applications and protocols"
that assume 8- or 7-bit characters. This suggests to me that UTF-8 was devised
so that Unicode text can be _passed through_ older protocols that only
understand 8- or 7-bit characters by encoding it at the input, and later
decoding it at the output to recover the original data.

If you want to _manipulate_ Unicode characters, however, why not deal with
them in their native, unencoded space? wchar_t is guaranteed to be wide enough
to contain all characters in all supported locales in the implementation, and
each character will have an equal size in memory.

-dr
 
M

msalters

kanze schreef:
msalters said:
Wolfgang Draxinger schreef:
[...]
However, that is probably a 20+bit type. Unicode itself
assigns numbers to characters, and the numbers have exceeded
65536. UTF-x means Unicode Transformation Format - x. These
formats map each number to one or more x-bit values.
E.g. UTF-8 maps the number of each unicode character to an
octet sequence, with the additional property that the 0 byte
isn't used for anything but number 0.

It has a lot more additional properties than that. Like the
fact that you can immediately tell whether a byte is a single
byte character, the first byte of a multibyte sequence, or a
following byte in a multibyte sequence, without looking beyond
just that byte.

Yep, that makes scanning through a byte sequence a lot easier.
However, that's not very important for std::string. .substr()
can't do anything useful with it. For .c_str(), the non-null
property is important.
Of course, for an utf8string type, these additional properties
make implementations a lot easier. UTF8 is quite a good encoding
actually.
I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)

Getting a substring, uppercasing, finding characters, replacing
characters: all common string operations, but non-trivial in UTF8
Saving to file, sending over TCP/IP, or to mobile devices: all
common I/O operations, and UTF8 makes it easy.

Regards,
Michiel Salters
 
K

kanze

msalters said:
kanze schreef:

[...]
Getting a substring, uppercasing, finding characters,
replacing characters: all common string operations, but
non-trivial in UTF8.

I said "for most of what I do". Comparing for equality, using
as keys in std::set or an unordered_set, for example. UTF-8
works fine, and because it uses less memory, it will result in
better overall performance (less cache misses, less paging,
etc.).

In other cases, I've been dealing with binary input, with
embedded UTF-8 strings. Which means that I cannot translate
directly on input, only once I've parsed the binary structure
enough to know where the strings are located. In the last
application, the strings were just user names and passwords --
again, no processing which wouldn't work just fine in UTF-8.

Imagine a C++ compiler. The only place UTF-8 might cause some
added difficulty is when scanning a symbol -- and even there, I
can imagine some fairly simply solutions. For all of the
rest... the critical delimiters can all be easily recognized in
UTF-8, and of course, once past scanning, we're talking about
symbol table management, and perhaps concatenation (to generate
mangled names), but they're both easily done in UTF-8. All in
all, I think a C++ compiler would be a good example of an
application where using UTF-8 as the internal encoding would
make sense.
Saving to file, sending over TCP/IP, or to mobile devices: all
common I/O operations, and UTF8 makes it easy.

The external world is byte oriented. That's for sure. UTF-8
(or some other 8 bit format) is definitly required for external
use. But there are numerous cases where UTF-8 is also a good
choice for internal use as well; why bother with the conversions
and the added memory overhead if it doesn't buy you anything?
 
L

lancediduck

UTF-8 is already in iostream. Just about any platform, when use set you
locale to something with "utf8" support, then your libraries codecvt
facet will likely convert the utf8 to whatever wide char type your
platform supports.
Which on some platforms is 16 byte, and others is 32.

But the main trouble that C++ programmers have with unicode is that
they still want to use it just like arrays of ASCII encoded characters
that you can send to a console command line. That won't work. At the
very least, Unicode assumes that it will be displayed on a graphical
terminal. And there is certainly no "one to one correspondence"
between the characters rendered by the device and what you see encoded
in your Unicode string.
And don't even ask about Unicode regular expressions or "equality
comparison"-- Consider JavaScript,
var a='Hello';
var b=' World!';
if ((a+b) == 'Hello world!')
The conditional expression really means "encode in UTF16LE, normalize
each string using Unicode Normalization Form 3, and then do a byte by
byte comparison and return true if they match"

Just like ASCII is not a better way of doing Morse Code, Unicode is not
a better ASCII, but something way different.
 
D

Dietmar Kuehl

Pete said:
That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
for combining characters.
 
M

Mirek Fidler

Dietmar said:
Pete said:
That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?


Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
^^^^^^^^^^^^^^^^

16-bit?

Mirek
 
D

Dave Rahardja

Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
^^^^^^^^^^^^^^^^

16-bit?

From the Unicode Technical Introduction:

"In all, the Unicode Standard, Version 4.0 provides codes for 96,447
characters from the world's alphabets, ideograph sets, and symbol
collections...The majority of common-use characters fit into the first 64K
code points, an area of the codespace that is called the basic multilingual
plane, or BMP for short. There are about 6,300 unused code points for future
expansion in the BMP, plus over 870,000 unused supplementary code points on
the other planes...The Unicode Standard also reserves code points for private
use. Vendors or end users can assign these internally for their own characters
and symbols, or use them with specialized fonts. There are 6,400 private use
code points on the BMP and another 131,068 supplementary private use code
points, should 6,400 be insufficient for particular applications."

Despite the indication that the code space for Unicode is potentially larger
than 32 bits, the following statement seems to suggest that a 32-bit integer
is more than enough to represent all Unicode characters:

"UTF-32 is popular where memory space is no concern, but fixed width, single
code unit access to characters is desired. Each Unicode character is encoded
in a single 32-bit code unit when using UTF-32."

http://www.unicode.org/standard/principles.html

-dr
 
P

Pete Becker

Dietmar said:
Pete said:
That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?


Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
for combining characters.

Well, true, but wchar_t can certainly be large enough to hold 20 bits.
And the claim from the Unicode folks is that that's all you need.
 
J

Jonathan Coxhead

Pete said:
Dietmar said:
Pete said:
That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?



Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
for combining characters.


Well, true, but wchar_t can certainly be large enough to hold 20 bits.
And the claim from the Unicode folks is that that's all you need.

Actually, you need 21 bits. There are 0x11 planes with 0x10000 characters in
each, so 0x110000 characters. This space is completely flat, though it has
holes. Or, you can use UTF-16, where a character is encoded as 1 or 2 16-bit
values, so in C counts as neither a wide-character encoding nor a multibyte
encoding. (It might be a "multishort" encoding, if such a thing existed.) Or you
can use UTF-8, which is a true multibyte encoding. The translation between these
representations is purely algorithmic.

Anyway, 20 bits: not enough.
 
K

kanze

Pete said:
Dietmar said:
Pete Becker wrote:
That's unfortunate, since it's exactly what wchar_t and
wstring were designed for. What is your objection to them?
Well, 'wchar_t' and 'wstring' were designed at a time when
Unicode was still pretending that they use 16-bit characters
and that each Unicode character consists of a single 16-bit
character. Neither of these two properties holds: Unicode is
[currently] a 20-bit encoding and a Unicode character can
consist of multiple such 20-bit entities for combining
characters.

(If you have 20 or more bits, there's no need for the combining
characters; there only present to allow representing character
codes larger than 0xFFFF as two 16 bit characters.)
Well, true, but wchar_t can certainly be large enough to hold
20 bits. And the claim from the Unicode folks is that that's
all you need.

I think the point is that when wchar_t was introduced, it wasn't
obvious that Unicode was the solution, and Unicode at the time
was only 16 bits anyway. Given that, vendors have defined
wchar_t in a variety of ways. And given that vendors want to
support their existing code bases, that really won't change,
regardless of what the standard says.

Given this, there is definite value in leaving wchar_t as it is
(which is pretty unusable in portable code), and defining a new
type which is guaranteed to be Unicode. (This is, I believe,
the route C is taking; there's probably some value in remaining
C compatible here as well.)
 
D

Dave Rahardja

I think the point is that when wchar_t was introduced, it wasn't
obvious that Unicode was the solution, and Unicode at the time
was only 16 bits anyway. Given that, vendors have defined
wchar_t in a variety of ways. And given that vendors want to
support their existing code bases, that really won't change,
regardless of what the standard says.

Given this, there is definite value in leaving wchar_t as it is
(which is pretty unusable in portable code), and defining a new
type which is guaranteed to be Unicode. (This is, I believe,
the route C is taking; there's probably some value in remaining
C compatible here as well.)

I think wchar_t is fine the way it is defined:

(3.9.1.5)
Type wchar_t is a distinct type whose values can represent distinct codes for
all members of the largest extended character set specified among the
supported locales (22.1.1). Type wchar_t shall have the same size, signedness,
and alignment requirements (3.9) as one of the other integral types, called
its underlying type.

What we need is a Unicode locale! ;-)

-dr
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top