Multi-byte chars

Randy Howard · Jul 11, 2003

Sorry, your logic is too foolish for me to understand.

Can the two of you go off privately somewhere and beat each other to
a pulp? Watching it here doesn't seem very productive.

Jun Woong · Jul 11, 2003

Jun Woong said:
char foo[] = "\x70\x70\x01\x02";
char bar[MB_CUR_MAX];

Assuming that str[] contains a valid multibyte character sequence,
'\x70' is a shift character and redundant shift characters are
allowed,

mbtowc(&wc, str, sizeof(str)-1);

Sorry. Two occurrences of "str" should be replaced with "foo".

Dan Pop · Jul 11, 2003

In said:
Indeed you are, as am I.

My opinion is that your opinion is downright broken. ;-)

There were very good reasons for the restriction in C89.

This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.

AFAICT, there was NO good reason for this restriction in C89. Due to the
shift state issue, it provided no help when dealing with mb character
strings.

Dan

lawrence.jones · Jul 11, 2003

Dan Pop said:
This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.

Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.
As you've said, it is necessary to convert the format string to a
sequence of wide characters and look for one corresponding to a percent
sign. But what is the wide character code for a percent sign? It's
tempting to say that it's L'%', but remember that the wide character
encoding is allowed to be locale-specific, and the user is allowed to
change the current locale at any time, so that doesn't work without
something like the restriction under discussion. (With the restriction,
of course, you don't even need to use a wide character constant, '%' is
sufficient).

Without it, you'd be forced to call mbtowc on "%" every time to get the
current encoding, but the implementation must behave as if no library
function calls mbtowc, so you'd also have to save and restore its state
around the call. That was considered to be unacceptable overhead to
require, thus the restriction. (Which, as I've said before, was
innocuous at the time since no one was even contemplating an
implementation where it did not hold.)

-Larry Jones

I stand FIRM in my belief of what's right! I REFUSE to
compromise my principles! -- Calvin

lawrence.jones · Jul 11, 2003

Dan Pop said:
The work on Unicode started in 1986, which is a good three years before
the adoption of C89.

But it hadn't gotten very far by the time C89 was finished (which was,
remember, a year before it was published due to procedural snafus). The
16-bit camp and the 32-bit camp were both deeply entrenched and fighting
with each other, leading to the eventual schism between the ISO 10646
folks and the Unicode folks that wasn't reconciled until fairly
recently. There wasn't even concensus among the masses that a universal
character set was practical, achievable, or even desirable.

-Larry Jones

Everything's gotta have rules, rules, rules! -- Calvin

Kevin Easton · Jul 11, 2003

Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.
As you've said, it is necessary to convert the format string to a
sequence of wide characters and look for one corresponding to a percent
sign. But what is the wide character code for a percent sign? It's
tempting to say that it's L'%', but remember that the wide character
encoding is allowed to be locale-specific, and the user is allowed to
change the current locale at any time, so that doesn't work without
something like the restriction under discussion. (With the restriction,
of course, you don't even need to use a wide character constant, '%' is
sufficient).

Without it, you'd be forced to call mbtowc on "%" every time to get the
current encoding, but the implementation must behave as if no library
function calls mbtowc, so you'd also have to save and restore its state
around the call. That was considered to be unacceptable overhead to
require, thus the restriction. (Which, as I've said before, was
innocuous at the time since no one was even contemplating an
implementation where it did not hold.)

Why can't the implementation provide, for it's own use, a lookup table
of what_percent_looks_like_in_this_locale[] - after all, mbtowc clearly
has this information available.

- Kevin.

Jun Woong · Jul 11, 2003

Kevin Easton said:
Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.
As you've said, it is necessary to convert the format string to a
sequence of wide characters and look for one corresponding to a percent
sign. But what is the wide character code for a percent sign? It's
tempting to say that it's L'%', but remember that the wide character
encoding is allowed to be locale-specific, and the user is allowed to
change the current locale at any time, so that doesn't work without
something like the restriction under discussion. (With the restriction,
of course, you don't even need to use a wide character constant, '%' is
sufficient).

Without it, you'd be forced to call mbtowc on "%" every time to get the
current encoding, but the implementation must behave as if no library
function calls mbtowc, so you'd also have to save and restore its state
around the call. That was considered to be unacceptable overhead to
require, thus the restriction. (Which, as I've said before, was
innocuous at the time since no one was even contemplating an
implementation where it did not hold.)

Click to expand...

Why can't the implementation provide, for it's own use, a lookup table
of what_percent_looks_like_in_this_locale[] - after all, mbtowc clearly
has this information available.

One reason I can think is portability. One easier (but not portable)
way than you said is to take advantage of an internal access to the
state of the conversion.

Kevin Easton · Jul 11, 2003

Jun Woong said:
Kevin Easton said:

(e-mail address removed) wrote: [ ...implementing _Printf, and '%' == L'%'... ]

Without it, you'd be forced to call mbtowc on "%" every time to get the
current encoding, but the implementation must behave as if no library
function calls mbtowc, so you'd also have to save and restore its state
around the call. That was considered to be unacceptable overhead to
require, thus the restriction. (Which, as I've said before, was
innocuous at the time since no one was even contemplating an
implementation where it did not hold.)

Click to expand...

Why can't the implementation provide, for it's own use, a lookup table
of what_percent_looks_like_in_this_locale[] - after all, mbtowc clearly
has this information available.

Click to expand...

One reason I can think is portability. One easier (but not portable)
way than you said is to take advantage of an internal access to the
state of the conversion.

There are plenty of library functions that have unacceptable overheads
when implemented in a portable manner, but can usually be efficiently
implemented in a non-portable way. In particular, strcmp() comes to
mind - so I don't think the possibility of a portable implementation
suffering unacceptable overhead when a non-portable implementation
wouldn't is sufficient reason to add the restriction.

- Kevin.

Jun Woong · Jul 11, 2003

Dan Pop said:
UCS did exist when C99 was drafted, yet the broken text is still there.

I've already said that I agree with your position that C99 shouldn't
have had the text. I guess it was a mistake.

The work on Unicode started in 1986, which is a good three years before
the adoption of C89.

Its publication was certainly after C90's.

What *exactly* was it buying to C90?

The text in C90 didn't make a major problem in practice at that time.

[...]

This doesn explain anything at all about the necessity of having
'a' == L'a', does it?

Read in context, please.

char foo[] = "\x70\x70\x01\x02";
char bar[MB_CUR_MAX];

Assuming that str[] contains a valid multibyte character sequence,
'\x70' is a shift character and redundant shift characters are
allowed,

mbtowc(&wc, str, sizeof(str)-1);
wctomb(bar, wc);

the sequence in bar[] can be "\x70\x01\x02". Is this wrong?

Click to expand...

I can't see anything wrong with that. Where is the problem?

DP> Convert the format string to wide characters and use only wide character
~~~~~~~~~~~~~~~~~~~
DP> constants in the implementation of printf. Generate the output as wide
~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
DP> characters and convert them to multibyte characters before actually
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DP> outputting them. [...]
~~~~~~~~~~~~~~~

And what the hell is wrong with

if (wc == L'%') /* conversion specifier */

which does NOT depend on that guarantee and is what I have suggested as
the portable solution to your problem?

Nope, it still depends on the guarantee. If there is no guarantee like
that, wc can have a different value from L'%' depending on locales,
even if wc contains a wide percent character in that locale.

Then, why did you invoke *portability* arguments for the usefulness of
the guarantee under discussion?

See above. And the reason I mentioned the other way is to say that an
implementer can rely on the implementation details if he doesn't care
about portability.

Nope, the code was equally easy to write in pure C89, without relying on
the guarantee, as demonstrated above.

In an incorrect way.

Jun Woong · Jul 11, 2003

Kevin Easton said:
There are plenty of library functions that have unacceptable overheads
when implemented in a portable manner, but can usually be efficiently
implemented in a non-portable way. In particular, strcmp() comes to
mind - so I don't think the possibility of a portable implementation
suffering unacceptable overhead when a non-portable implementation
wouldn't is sufficient reason to add the restriction.

The story can change, if the committee thought over a possibility for
uses to want to write a similar code in a portable way like that.
Without such a guarantee, the only way you, as an user of an
implementation who don't know about the implementation details, can
write a similar code is to use a technique that's somewhat complicated
and has overhead.

Jun Woong · Jul 11, 2003

Dan Pop said:
This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.

The reason I didn't ask what they were is not that I'm not immune to
it. It's because I know what they are.

Kevin Easton · Jul 11, 2003

Jun Woong said:
The story can change, if the committee thought over a possibility for
uses to want to write a similar code in a portable way like that.
Without such a guarantee, the only way you, as an user of an
implementation who don't know about the implementation details, can
write a similar code is to use a technique that's somewhat complicated
and has overhead.

That's already true - a completely portable implementation of ROT13 is
far more complicaed and has more overhead than an implementation that
assumes ASCII.

- Kevin.

Dan Pop · Jul 14, 2003

In said:
Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.

And the trivial solution is btowc(), rather than imposing even *more*
conditions on the encoding of the character sets used by a conforming
implementation.

It doesn't look like the design of btowc() was beyond the capabilities of
the X3J11 committee, and its necessity is obvious, given the restrictions
of use of mbtowc().

But even mbtowc() could be safely used by printf for this purpose, right
before calling it on the first character of the format string, which
already assumes the initial shift state: converting % is not going to
cause any change of shift state.

Dan

lawrence.jones · Jul 14, 2003

Dan Pop said:
And the trivial solution is btowc(), rather than imposing even *more*
conditions on the encoding of the character sets used by a conforming
implementation.

btowc() didn't exist in C90 (it was added in AM1), so it hardly
qualifies as a "trivial solution". (And I'm not sure what you mean by
"imposing even *more* conditions on the encoding", C imposes very few
conditions.)

It doesn't look like the design of btowc() was beyond the capabilities of
the X3J11 committee, and its necessity is obvious, given the restrictions
of use of mbtowc().

No, it's necessity was *not* obvious -- the restriction served the
purpose just as well. The committee did not have sufficient expertise
in this area to be comfortable inventing a complete solution. The small
group of experts advising us recommended that we adopt just the minimum
set of basic capabilities and they would then go off and consider a more
complete solution to be adopted as an amendment later. They had no
problem with the restriction, nor did they advise removing it in the
amendment that they ultimately produced. As I've said numerous times
now to no avail, at the time, *NO ONE* had even contemplated an
environment where the restriction was a problem. Much like the
restriction on the encoding of the digits, it was viewed as recognizing
the way the world worked; no one expected anyone to seriously propose an
encoding that would run afoul of it.

But even mbtowc() could be safely used by printf for this purpose, right
before calling it on the first character of the format string, which
already assumes the initial shift state: converting % is not going to
cause any change of shift state.

That's true for printf() and friends, but that was just an *example* of
the kinds of problems the restriction was intended to address, it was
not the sole problem. User code (particularly third-party library code)
cannot so easily avoid the state problems.

-Larry Jones

I can feel my brain beginning to atrophy already. -- Calvin

Jun Woong · Jul 15, 2003

Dan Pop said:
But even mbtowc() could be safely used by printf for this purpose, right
before calling it on the first character of the format string, which
already assumes the initial shift state: converting % is not going to
cause any change of shift state.

How could it be safe without saving and restoring the state
information, if an user interleaves a call to printf() between
two calls to mbtowc(), the latter of which depends on the state
changed by the former?

Dan Pop · Jul 15, 2003

How could it be safe without saving and restoring the state
information, if an user interleaves a call to printf() between
two calls to mbtowc(), the latter of which depends on the state
changed by the former?

We have already agreed that a portable implementation of printf *must*
use mbtowc to parse the format string, haven't we? Wasn't implementing
printf in a portable way *your* argument?

Dan

Dan Pop · Jul 15, 2003

In said:
btowc() didn't exist in C90 (it was added in AM1), so it hardly
qualifies as a "trivial solution".

I know that it didn't exist in C90, but this doesn't make it a less
trivial solution, as explained below. Once the problem was identified,
there were two solutions: the wrong one, which the committee chose, and
the correct one: provide the required conversion function.

(And I'm not sure what you mean by
"imposing even *more* conditions on the encoding", C imposes very few
conditions.)

It shouldn't impose *any*, because it claims that the issue is beyond its
scope. Yet, it imposes several conditions:

1. The encoding of any member of the base character set, when stored
in a char, has a non-negative value.

2. The digit characters have contiguous encodings.

3. The members of the base character set have the same value when encoded
as character constants, wide character constants and multibyte
characters in the initial shift state.

4. Whatever I can't remember or I'm not even aware of.

No, it's necessity was *not* obvious -- the restriction served the
purpose just as well.

The problem was well understood and choosing the restriction as its
solution was obviously the WRONG thing. For the reason already explained.

Since mbtowc was already in C89, adding btowc wouldn't have required any
extra amount of expertise or put a significant load on the implementor.

Dan

lawrence.jones · Jul 15, 2003

Dan Pop said:
I know that it didn't exist in C90, but this doesn't make it a less
trivial solution, as explained below. Once the problem was identified,
there were two solutions: the wrong one, which the committee chose, and
the correct one: provide the required conversion function.

It must be nice to see everything in black and white and not have to
worry about those annoying shades of gray.

1. The encoding of any member of the base character set, when stored
in a char, has a non-negative value.

That's a restriction on the implementation of type char, not a
restriction on the character set.

2. The digit characters have contiguous encodings.

That is a restriction on the character set. It also happens to be a
very desirable characteristic of a coded character set; so desirable
that no one has ever reported meeting one that doesn't have it.

3. The members of the base character set have the same value when encoded
as character constants, wide character constants and multibyte
characters in the initial shift state.

Twenty years ago that appeared to fall into the same category as the
previous restriction. Today, it does not.

-Larry Jones

Geez, I gotta have a REASON for everything? -- Calvin

Jun Woong · Jul 15, 2003

Dan Pop said:
We have already agreed that a portable implementation of printf *must*
use mbtowc to parse the format string, haven't we?

I've already said that an implementation is not allowed to use mbtowc
for that purpose. As said repeatedly C89's support for the extended
character set was not enough.

Dan Pop · Jul 16, 2003

In said:
It must be nice to see everything in black and white and not have to
worry about those annoying shades of gray.

Especially when they don't exist. If the btowc solution had any
drawbacks, you'd have a point. But since its only the committee
solution that has drawbacks, you don't.

Dan

Can't solve problems! please Help	0	Sep 26, 2022
Questions on ISO C character constants	1	Nov 8, 2011
Questions on character constants	2	Dec 12, 2010
Can someone tell me what's wrong with this question on StackOverflow?	0	Aug 18, 2023
writing wide chars	2	Aug 14, 2006
How can I fix my pattern coding error in c++	0	Mar 18, 2023
Unicode (UTF-8) in C	13	Mar 16, 2014
Unicode characters length (NOT size)	5	Jan 15, 2009

Multi-byte chars

Randy Howard

Jun Woong

Dan Pop

lawrence.jones

lawrence.jones

Kevin Easton

Jun Woong

Kevin Easton

Jun Woong

Jun Woong

Jun Woong

Kevin Easton

Dan Pop

lawrence.jones

Jun Woong

Dan Pop

Dan Pop

lawrence.jones

Jun Woong

Dan Pop

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads