N
Neil Booth
What is the behaviour of mbtowc following an attempt to convert an
invalid character sequence? My belief is that, if the encoding
is state-independent, then mbtowc should continue to work if given
a valid sequence in a subsequent call, and that if the encoding
is state-dependent, to have defined behaviour we need to reset
state to the initial state by passing a NULL pointer.
So the libc on my machine behaves as below. Is this non-conforming
like I tend to believe? If not, mbtowc would be pretty useless
in practice IMHO.
Neil.
#include <assert.h>
#include <locale.h>
#include <stdlib.h>
/* Valid 2-byte shift-JIS character, not valid UTF-8 sequence. */
const char sjis[] = "\x95\x5c";
/* Valid UTF-8, of course. */
const char space[] = " ";
int main (void)
{
wchar_t wc;
setlocale (LC_CTYPE, "ja_JP.UTF-8");
/* Assert it is not state-dependent. */
assert (mbtowc (&wc, 0, 1) == 0);
/* Assert my charset beliefs. */
assert (mbtowc (&wc, space, sizeof space) == 1);
assert (mbtowc (&wc, sjis, sizeof sjis) == -1);
/* Redundant assertion that we're not state-dependent, but
just in case some state needs resetting. */
assert (mbtowc (&wc, 0, 1) == 0);
/* This assertion fails - is this a bug? */
assert (mbtowc (&wc, space, sizeof space) == 1);
return 0;
}
$ ./a.out
assertion "mbtowc (&wc, space, sizeof space) == 1" failed: file
"/tmp/test.c", line 28, function "main"
Abort trap
$
invalid character sequence? My belief is that, if the encoding
is state-independent, then mbtowc should continue to work if given
a valid sequence in a subsequent call, and that if the encoding
is state-dependent, to have defined behaviour we need to reset
state to the initial state by passing a NULL pointer.
So the libc on my machine behaves as below. Is this non-conforming
like I tend to believe? If not, mbtowc would be pretty useless
in practice IMHO.
Neil.
#include <assert.h>
#include <locale.h>
#include <stdlib.h>
/* Valid 2-byte shift-JIS character, not valid UTF-8 sequence. */
const char sjis[] = "\x95\x5c";
/* Valid UTF-8, of course. */
const char space[] = " ";
int main (void)
{
wchar_t wc;
setlocale (LC_CTYPE, "ja_JP.UTF-8");
/* Assert it is not state-dependent. */
assert (mbtowc (&wc, 0, 1) == 0);
/* Assert my charset beliefs. */
assert (mbtowc (&wc, space, sizeof space) == 1);
assert (mbtowc (&wc, sjis, sizeof sjis) == -1);
/* Redundant assertion that we're not state-dependent, but
just in case some state needs resetting. */
assert (mbtowc (&wc, 0, 1) == 0);
/* This assertion fails - is this a bug? */
assert (mbtowc (&wc, space, sizeof space) == 1);
return 0;
}
$ ./a.out
assertion "mbtowc (&wc, space, sizeof space) == 1" failed: file
"/tmp/test.c", line 28, function "main"
Abort trap
$