How do mbtowc() and wctomb() work?

Ross · Jul 24, 2006

Hi,

I have a question regarding how the mbtowc() and wctomb() functions
work. Given that some compilers (gcc, for example) allow the wide
execution character set to be specified at compile time, and that the
multibyte encoding depends on LC_CTYPE, this suggests that (at runtime)
the compiled program has the ability to convert character strings
between arbitrary character sets.

My question is, how is this conversion performed? As I understand it,
the C library does not have this facility. So what does...?

Thanks in advance

J. J. Farrell · Jul 25, 2006

Ross said:
I have a question regarding how the mbtowc() and wctomb() functions
work. Given that some compilers (gcc, for example) allow the wide
execution character set to be specified at compile time, and that the
multibyte encoding depends on LC_CTYPE, this suggests that (at runtime)
the compiled program has the ability to convert character strings
between arbitrary character sets.

My question is, how is this conversion performed? As I understand it,
the C library does not have this facility. So what does...?

The C library.

Ross · Jul 25, 2006

J. J. Farrell said:
The C library.

If the C library has this functionality, is it available through the
API? I can't find any mention of it in the C spec.

Simon Biber · Jul 25, 2006

Ross said:
If the C library has this functionality, is it available through the
API? I can't find any mention of it in the C spec.

The API essentially consists of the setlocale, mbtowc, wctomb, mbstowcs,
wcstombs, mbrtowc, wcrtomb, mbsrtowcs and wcsrtombs functions!

On a hosted C implementation, the C library is required to provide these
functions. They don't have to be particularly useful. For example, the
library may support only the "C" and "" locales, and the native locale
"" may be equivalent to the "C" locale. In this case there is not much
scope for converting character strings between arbitrary character sets.

If your C spec doesn't contain descriptions of those functions, you may
find that it does not conform to the latest C standard.

Ross · Jul 25, 2006

Simon said:
The API essentially consists of the setlocale, mbtowc, wctomb, mbstowcs,
wcstombs, mbrtowc, wcrtomb, mbsrtowcs and wcsrtombs functions!

On a hosted C implementation, the C library is required to provide these
functions. They don't have to be particularly useful. For example, the
library may support only the "C" and "" locales, and the native locale
"" may be equivalent to the "C" locale. In this case there is not much
scope for converting character strings between arbitrary character sets.

If your C spec doesn't contain descriptions of those functions, you may
find that it does not conform to the latest C standard.

Thanks, I'm aware of those functions. However, given that both the
execution and native character sets are flexible, the existence of
these functions seems to suggest that the C library *should* have the
ability to convert between truly arbitrary character sets, not just the
encoding of 'mb' and 'wc'. I guess the existence of such a facility is
implied, rather than required, hence the reason the API doesn't provide
an iconv-esque interface.

Simon Biber · Jul 25, 2006

Ross said:
Thanks, I'm aware of those functions. However, given that both the
execution and native character sets are flexible, the existence of
these functions seems to suggest that the C library *should* have the
ability to convert between truly arbitrary character sets, not just the
encoding of 'mb' and 'wc'. I guess the existence of such a facility is
implied, rather than required, hence the reason the API doesn't provide
an iconv-esque interface.

Yes. It's what I call "partial standardisation". The API is defined but
it's not useful in portable code since you can't tell whether there is
actually any useful functionality behind it. Some implementations
provide a useful implementation with many locales and many different
encodings (glibc) but some implementations don't bother (msvcrt).

By the way, please snip out signatures (anything following -- on its own
line) unless you are specifically commenting on someone's signature.

P.J. Plauger · Jul 25, 2006

Thanks, I'm aware of those functions. However, given that both the
execution and native character sets are flexible, the existence of
these functions seems to suggest that the C library *should* have the
ability to convert between truly arbitrary character sets, not just the
encoding of 'mb' and 'wc'. I guess the existence of such a facility is
implied, rather than required, hence the reason the API doesn't provide
an iconv-esque interface.

Right. Support for multiple conversions can vary from the trivial,
as Biber described above, to the highly adaptive. See the essay
on multibyte encodings:

http://www.dinkumware.com/manuals/?manual=compleat&page=multibyte.html

for an overview of the issues that arise when an implementation
permits various encodings to change.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Simon Biber · Jul 25, 2006

P.J. Plauger said:
Right. Support for multiple conversions can vary from the trivial,
as Biber described above, to the highly adaptive. See the essay
on multibyte encodings:

http://www.dinkumware.com/manuals/?manual=compleat&page=multibyte.html

for an overview of the issues that arise when an implementation
permits various encodings to change.

It's an interesting essay, and it introduced me to many facets (no pun
intended) of C++ that I never bothered learning.

Towards the end it says "some people are proposing the use of UTF-16 as
a wide-character encoding". This is no longer just a proposal; it was
introduced in Windows 2000 and now seems to be fully entrenched in
Microsoft products.

"UTF-16 is the native internal representation of text in the Microsoft
Windows NT/Windows 2000/Windows XP/Windows CE, Qualcomm BREW, and
Symbian operating systems; the Java and .NET bytecode environments; Mac
OS X's Cocoa and Core Foundation frameworks; and the Qt cross-platform
graphical widget toolkit."

It makes wchar_t handling much more tricky than it was intended to be.
Indeed, many programs don't bother considering or handling the surrogate
pairs.

Stephen Sprunk · Jul 26, 2006

Ross said:
Thanks, I'm aware of those functions. However, given that both the
execution and native character sets are flexible, the existence of
these functions seems to suggest that the C library *should* have
the
ability to convert between truly arbitrary character sets, not just
the
encoding of 'mb' and 'wc'. I guess the existence of such a facility
is
implied, rather than required, hence the reason the API doesn't
provide
an iconv-esque interface.

Well, if you have a decent implementation, you can convert from any
interest charset to wide chars, then change the locale appropriately
and convert them to any other interesting charset.

Figuring out which locales are available (if any besides "C" and "")
is the stumbling block, since they vary from system to system.
Add-ons like iconv() tend to be more useful and more portable in
practice.

S

lawrence.jones · Jul 27, 2006

Stephen Sprunk said:
Well, if you have a decent implementation, you can convert from any
interest charset to wide chars, then change the locale appropriately
and convert them to any other interesting charset.

There's no guarantee that the wide character encoding isn't also locale-
specific, so that doesn't work in the general case. Of course, you're
free to define "decent implementation" as one where the wide character
encoding is independent of locale.

-Larry Jones

I'm getting disillusioned with these New Years. -- Calvin

How do I make this craftinfsystem Work	1	Feb 9, 2023
Questions on ISO C character constants	1	Nov 8, 2011
Questions on character constants	2	Dec 12, 2010
How do linkers work?	81	Mar 23, 2008
How do debuggers work?	30	Mar 18, 2008
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
wchar_t is useless	18	Nov 21, 2011
Binary storage of string constants	10	Jun 29, 2006

How do mbtowc() and wctomb() work?

Ross

J. J. Farrell

Ross

Simon Biber

Ross

Simon Biber

P.J. Plauger

Simon Biber

Stephen Sprunk

lawrence.jones

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads