How do mbtowc() and wctomb() work?

R

Ross

Hi,

I have a question regarding how the mbtowc() and wctomb() functions
work. Given that some compilers (gcc, for example) allow the wide
execution character set to be specified at compile time, and that the
multibyte encoding depends on LC_CTYPE, this suggests that (at runtime)
the compiled program has the ability to convert character strings
between arbitrary character sets.

My question is, how is this conversion performed? As I understand it,
the C library does not have this facility. So what does...?

Thanks in advance
 
J

J. J. Farrell

Ross said:
I have a question regarding how the mbtowc() and wctomb() functions
work. Given that some compilers (gcc, for example) allow the wide
execution character set to be specified at compile time, and that the
multibyte encoding depends on LC_CTYPE, this suggests that (at runtime)
the compiled program has the ability to convert character strings
between arbitrary character sets.

My question is, how is this conversion performed? As I understand it,
the C library does not have this facility. So what does...?

The C library.
 
S

Simon Biber

Ross said:
If the C library has this functionality, is it available through the
API? I can't find any mention of it in the C spec.

The API essentially consists of the setlocale, mbtowc, wctomb, mbstowcs,
wcstombs, mbrtowc, wcrtomb, mbsrtowcs and wcsrtombs functions!

On a hosted C implementation, the C library is required to provide these
functions. They don't have to be particularly useful. For example, the
library may support only the "C" and "" locales, and the native locale
"" may be equivalent to the "C" locale. In this case there is not much
scope for converting character strings between arbitrary character sets.

If your C spec doesn't contain descriptions of those functions, you may
find that it does not conform to the latest C standard.
 
R

Ross

Simon said:
The API essentially consists of the setlocale, mbtowc, wctomb, mbstowcs,
wcstombs, mbrtowc, wcrtomb, mbsrtowcs and wcsrtombs functions!

On a hosted C implementation, the C library is required to provide these
functions. They don't have to be particularly useful. For example, the
library may support only the "C" and "" locales, and the native locale
"" may be equivalent to the "C" locale. In this case there is not much
scope for converting character strings between arbitrary character sets.

If your C spec doesn't contain descriptions of those functions, you may
find that it does not conform to the latest C standard.

Thanks, I'm aware of those functions. However, given that both the
execution and native character sets are flexible, the existence of
these functions seems to suggest that the C library *should* have the
ability to convert between truly arbitrary character sets, not just the
encoding of 'mb' and 'wc'. I guess the existence of such a facility is
implied, rather than required, hence the reason the API doesn't provide
an iconv-esque interface.
 
S

Simon Biber

Ross said:
Thanks, I'm aware of those functions. However, given that both the
execution and native character sets are flexible, the existence of
these functions seems to suggest that the C library *should* have the
ability to convert between truly arbitrary character sets, not just the
encoding of 'mb' and 'wc'. I guess the existence of such a facility is
implied, rather than required, hence the reason the API doesn't provide
an iconv-esque interface.

Yes. It's what I call "partial standardisation". The API is defined but
it's not useful in portable code since you can't tell whether there is
actually any useful functionality behind it. Some implementations
provide a useful implementation with many locales and many different
encodings (glibc) but some implementations don't bother (msvcrt).

By the way, please snip out signatures (anything following -- on its own
line) unless you are specifically commenting on someone's signature.
 
P

P.J. Plauger

Thanks, I'm aware of those functions. However, given that both the
execution and native character sets are flexible, the existence of
these functions seems to suggest that the C library *should* have the
ability to convert between truly arbitrary character sets, not just the
encoding of 'mb' and 'wc'. I guess the existence of such a facility is
implied, rather than required, hence the reason the API doesn't provide
an iconv-esque interface.

Right. Support for multiple conversions can vary from the trivial,
as Biber described above, to the highly adaptive. See the essay
on multibyte encodings:

http://www.dinkumware.com/manuals/?manual=compleat&page=multibyte.html

for an overview of the issues that arise when an implementation
permits various encodings to change.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
S

Simon Biber

P.J. Plauger said:
Right. Support for multiple conversions can vary from the trivial,
as Biber described above, to the highly adaptive. See the essay
on multibyte encodings:

http://www.dinkumware.com/manuals/?manual=compleat&page=multibyte.html

for an overview of the issues that arise when an implementation
permits various encodings to change.

It's an interesting essay, and it introduced me to many facets (no pun
intended) of C++ that I never bothered learning.

Towards the end it says "some people are proposing the use of UTF-16 as
a wide-character encoding". This is no longer just a proposal; it was
introduced in Windows 2000 and now seems to be fully entrenched in
Microsoft products.

"UTF-16 is the native internal representation of text in the Microsoft
Windows NT/Windows 2000/Windows XP/Windows CE, Qualcomm BREW, and
Symbian operating systems; the Java and .NET bytecode environments; Mac
OS X's Cocoa and Core Foundation frameworks; and the Qt cross-platform
graphical widget toolkit."

It makes wchar_t handling much more tricky than it was intended to be.
Indeed, many programs don't bother considering or handling the surrogate
pairs.
 
S

Stephen Sprunk

Ross said:
Thanks, I'm aware of those functions. However, given that both the
execution and native character sets are flexible, the existence of
these functions seems to suggest that the C library *should* have
the
ability to convert between truly arbitrary character sets, not just
the
encoding of 'mb' and 'wc'. I guess the existence of such a facility
is
implied, rather than required, hence the reason the API doesn't
provide
an iconv-esque interface.

Well, if you have a decent implementation, you can convert from any
interest charset to wide chars, then change the locale appropriately
and convert them to any other interesting charset.

Figuring out which locales are available (if any besides "C" and "")
is the stumbling block, since they vary from system to system.
Add-ons like iconv() tend to be more useful and more portable in
practice.

S
 
L

lawrence.jones

Stephen Sprunk said:
Well, if you have a decent implementation, you can convert from any
interest charset to wide chars, then change the locale appropriately
and convert them to any other interesting charset.

There's no guarantee that the wide character encoding isn't also locale-
specific, so that doesn't work in the general case. Of course, you're
free to define "decent implementation" as one where the wide character
encoding is independent of locale.

-Larry Jones

I'm getting disillusioned with these New Years. -- Calvin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top