J
John M. Dlugosz
I may soon be working on a code base that is undergoing
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.
Clearly, new code needs to use Unicode, or _some_ way of representing
a lot more than 256 characters. UTF-8 and Unicode has merit as an
encoding and transmission format, and I'm certainly not against it.
But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.
Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?
My initial thoughts is that most of the time the code simply hangs
onto the string data and doesn't manipulate it, so "funny" contents
won't matter much. Furthermore, UTF-8 avoids (by design) many of the
problems found with multi-byte encodings, so a naive handling might
work "well enough" for common tasks. However, I should catalog in
detail just what works and what the problems are. For more general
manipulation of the string, we need functions like the C library's
mblen etc. and replacements for standard library routines whose
implementation doesn't handle UTF-8 as multi-byte character strings.
Or, is there some paradigm shift I should be aware of?
—John
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.
Clearly, new code needs to use Unicode, or _some_ way of representing
a lot more than 256 characters. UTF-8 and Unicode has merit as an
encoding and transmission format, and I'm certainly not against it.
But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.
Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?
My initial thoughts is that most of the time the code simply hangs
onto the string data and doesn't manipulate it, so "funny" contents
won't matter much. Furthermore, UTF-8 avoids (by design) many of the
problems found with multi-byte encodings, so a naive handling might
work "well enough" for common tasks. However, I should catalog in
detail just what works and what the problems are. For more general
manipulation of the string, we need functions like the C library's
mblen etc. and replacements for standard library routines whose
implementation doesn't handle UTF-8 as multi-byte character strings.
Or, is there some paradigm shift I should be aware of?
—John