wide characters: "illusion of portability"?

  • Thread starter Jonathan Mcdougall
  • Start date
J

Jonathan Mcdougall

I started using boost's filesystem library a
couple of days ago. In its FAQ, it states

"Wide-character names would provide an illusion of
portability where portability does not in fact
exist. Behavior would be completely different on
operating systems (Windows, for example) that
support wide-character names, than on systems
which don't (POSIX). Providing functionality that
appears to provide portability but in fact
delivers only implementation-defined behavior is
highly undesirable. Programs would not even be
portable between library implementations on the
same operating system, let alone portable to
different operating systems.

The C++ standards committee Library Working Group
discussed this in some detail both on the
committee's library reflector and at the Spring,
2002, meeting, and feels that (1) names based on
types other than char are extremely non-portable,
(2) there are no agreed upon semantics for
conversion between wide-character and
narrow-character names for file systems which do
not support wide-character name, and (3) even the
committee members most interested in
wide-character names are unsure that they are a
good idea in the context of a portable library.
(boost/libs/filesystem/doc/faq.htm)"

This surprised me, since I thought wide characters
were mandatory in production code.

In fact, of all the libraries I am using right
now, very few of them are Unicode compatible.
Therefore, I am stuck with two choices: abandon
Unicode for the profit of plain chars or do time
consuming and not necessarily valid conversions
between wide-character narrow-characters strings.

I have several questions about this.

1. Are C++ wide-character strings used in real
life, in production code?
2. Is it a good idea to let the user choose
between Unicode and ASCII in a library in a
transparent way (such as Microsoft's -A and -W
versions of all functions)?
3. What is the best way to convert wide strings to
and from narrow strings? System-dependent
functions? A simple loop converting char's to
wchar_t's?
4. Will C++0x provide more means for using wide
and narrow strings, such as conversions and
transparency (converting "strings" into L"strings"
automatically, for example, providing standard
macros such as UNICODE)
5. Are wide characters meant to be used with
Unicode or are they provided for an
implemention-defined use?

Thank you,

Jonathan
 
N

Niels Dybdahl

1. Are C++ wide-character strings used in real
life, in production code?

Yes. I do in my applications. I started using wide-chars when I had to
handle text from a newspaper, which had characters outside the ISO8859-1
range. Now I more and more often have to handle XML sources which are UTF8
encoded. The easiest way to handle this is using wide-chars.
2. Is it a good idea to let the user choose
between Unicode and ASCII in a library in a
transparent way (such as Microsoft's -A and -W
versions of all functions)?

I think that the programmer has to be aware whether he is handling
wide-chars, UTF8, ISO8859-1 or ASCII. So there is no need to make it
invisble for the programmer whether he uses the A or the W version of a
function.
3. What is the best way to convert wide strings to
and from narrow strings? System-dependent
functions? A simple loop converting char's to
wchar_t's?

A simple loop does probably handle special cases. Windows's codepage is not
ISO8859-1 and some of the characters need special handling.
So I would use a special written function for that purpose.

Best regards
Niels Dybdahl
 
S

Samuel Krempp

(e-mail address removed) (02 May 2005 06:56,
I started using boost's filesystem library a
couple of days ago. In its FAQ, it states

"Wide-character names would provide an illusion of
portability where portability does not in fact
exist. Behavior would be completely different on
operating systems (Windows, for example) that
support wide-character names, than on systems

I think you overlooked a detail here : what made this portability an
"illusion" is how different file-systems define file-names rules.

It doesn't mean use of wide-character strings in C++ is not portable, only
that using unicode filenames with various native filesystems is not..
(but boost::filesystem started an "i18n" branch recently, and accessing
files by wide-char names seems to be on the menu, so someone probably took
some time to make that work for the most common platforms)
2. Is it a good idea to let the user choose
between Unicode and ASCII in a library in a
transparent way (such as Microsoft's -A and -W
versions of all functions)?

I think it is.
Posix systems let the user choose a locale (by setting $LANG, or various
sub-variables like LC_TYPE ..).

You can (hope to) get the user's environment locale with portable C++ :
std::locale userLocale("");
but then, there's no easy portable way to know whether this locale uses
UTF-8 for charset encoding or what. (On posix systems you can try to detect
whether "UTF-8" occurs in the useLocale.name() string)

in basic situations, you should *not* need to know, but just :
.. use wide-chars in your program, and wide streams
.. imbue the user's locale on all the wide streams you use and let them
handle conversions.
[ In fact wcout might not work as well as any other widestream .. I found
that imbuing on wcout was ignored, and setting the global locale :
std::global(userLocale);
prior to using wcout was the only way to get the locale have any effect on
wcout ]

3. What is the best way to convert wide strings to
and from narrow strings? System-dependent
functions? A simple loop converting char's to
wchar_t's?

I think the expected way is to let the wide streams handle the conversions.
They use their locale's codecvt facet to convert the internal char_type
sequences to the external char encoding.

if you have to widen/narrow stuff yourself, you can use a locale's widen and
narrow function.

everything boils down to using the "right" locale for your situation.
(note boost - and other portable libraries - provide UTF-8 locales that
provide conversion from wchars holding unicode code-points to UTF-8 encoded
char sequences)
4. Will C++0x provide more means for using wide
and narrow strings, such as conversions and
transparency (converting "strings" into L"strings"
automatically, for example, providing standard
macros such as UNICODE)

that's already handled by current standard, but this conversion is not
canonical, different locales can mean different conversions, so this
depends on the locale.
A locale's codecvt<wchar_t, char, mbstate_t> facet serves that purpose.

For more details on locales, check Stroustrup's Appendix D :
http://www.research.att.com/~bs/3rd_loc0.html
5. Are wide characters meant to be used with
Unicode or are they provided for an
implemention-defined use?

mostly everything is implementation-defined when it comes to locales and
wide-chars..
the values in wchar_t are most of the times "unicode" (UTF-32) code-points,
but check your compiler's documentation if you have to rely on it .. For
instance, gcc-3.4 lets you modify that with command-line option
-fexec-wide-charset, and uses UTF-32 by default.

The way I see it, you can either :
1. use the compiler's native encoding of wide characters, along with the
native locales, and let your compiler's library do its work. In this case,
you don't care what the values are in those wchar_t, as long as it matches
what the locales expect. (and it should !).

2. enforce your own wide-char encoding (on a 4+ bytes type), and your own
conversions (with a 3rd party facet, or set of functions), without ever
using the compiler's native locale and wide IO features.

3. if you want to mix native stuff with third-party tools : set-up the
proper native-to-UTF-32 conversion system (e.g. make a header which tests
compiler-specific and std::library-specific macros, and does the proper
conversion, or aborts, or whatever. In most of the cases, the proper
conversion is keeping the wchar_t values untouched) and apply that
conversion between native calls and third-party UTF-32 calls.
 
J

Jonathan Mcdougall

Samuel said:
(e-mail address removed) (02 May 2005 06:56,



I think you overlooked a detail here : what made this portability an
"illusion" is how different file-systems define file-names rules.

It doesn't mean use of wide-character strings in C++ is not portable, only
that using unicode filenames with various native filesystems is not..
(but boost::filesystem started an "i18n" branch recently, and accessing
files by wide-char names seems to be on the menu, so someone probably took
some time to make that work for the most common platforms)

Oh. I thought it was a general statement.
I think it is.
Good.


I think the expected way is to let the wide streams handle the conversions.
They use their locale's codecvt facet to convert the internal char_type
sequences to the external char encoding.

if you have to widen/narrow stuff yourself, you can use a locale's widen and
narrow function.

I wasn't aware of these functions. Would
something like

char s[6] = {"hello"};
wchar_t w[6] = {0};

std::locale locale("");
std::use_facet<std::ctype<wchar_t> >(locale)
.widen(s, s+6, w);

work?
The way I see it, you can either :
1. use the compiler's native encoding of wide characters, along with the
native locales, and let your compiler's library do its work. In this case,
you don't care what the values are in those wchar_t, as long as it matches
what the locales expect. (and it should !).

I know we're getting off-topic, you may prefer not
to answer.

I am working on a kind of id3 editor. Id3
informations may be encoded in ASCII (iso-8859-1)
or in unicode (utf-8 or utf-16). On a compiler
using, for example, MBCS for wide characters, what
could standard C++ do for me?
2. enforce your own wide-char encoding (on a 4+ bytes type), and your own
conversions (with a 3rd party facet, or set of functions), without ever
using the compiler's native locale and wide IO features.

That would definitly be bad for code reuse :)
3. if you want to mix native stuff with third-party tools : set-up the
proper native-to-UTF-32 conversion system (e.g. make a header which tests
compiler-specific and std::library-specific macros, and does the proper
conversion, or aborts, or whatever. In most of the cases, the proper
conversion is keeping the wchar_t values untouched) and apply that
conversion between native calls and third-party UTF-32 calls.

So adopt a single encoding in my code and provide
conversions (adapters) for other libraries?


Thank you,

Jonathan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top