Validating multibyte strings

S

Simon Morgan

Hi,

The following code is meant to validate a string of multibyte characters
by using mbcheck() to call mblen() on each character on the string passed
to it. The problem is that it isn't working how I expect. I've included in
the comments what I think mbcheck() should be returning for each string
given my understanding of how the multibyte system works.

#include <stdio.h>
#include <stdlib.h>

int mbcheck(const char *);

int main(void) {
char *a[] = {
"\x05\x87\x80\x36\xed\xaa", /* 0 */
"\x20\xe4\x50\x88\x3f", /* -1 */
"\xde\xad\xbe\xef", /* -1 */
"\x8a\x60\x92\x74\x41" /* 0 */
};
int i;

for (i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
printf("%d\n", mbcheck(a));
puts("--");
}

return 0;
}

int mbcheck(const char *s) {
int n;

for (mblen(NULL, 0); ; s += n) {
printf("checking %#.8x\n", *s);
if ((n = mblen(s, MB_CUR_MAX)) <= 0)
return n;
printf("%d\n", n);
}
}

Does mblen() rely on a locale being set? Reading the man page it doesn't
look like it. This code is for an exercise in the book "C Programming: A
Modern Approach". The strings are supposedly Shift-JIS encoded kanji and I
have no idea which locale that relates to if there is one.

Also could somebody please explain to me what's with all the hexadecimal
f's in the output? As you've probably realised I'm still learning C but
seeing as s points to a char shouldn't printf() only be reading 1 byte and
padding the output with 0?

Many thanks.
 
S

Simon Morgan

Does mblen() rely on a locale being set? Reading the man page it doesn't
look like it. This code is for an exercise in the book "C Programming: A
Modern Approach". The strings are supposedly Shift-JIS encoded kanji and I
have no idea which locale that relates to if there is one.

OK so apparently it does rely on a locale being set and the man page does
mention it (albeit in the NOTES section which I was stupid enough not to
read). However I'd still like to know:
 
S

Skarmander

Simon Morgan wrote:
No. The char is converted to an integer before being printed in unsigned
hexadecimal format. I'm guessing plain char is signed on your system, so
it's a sign-extended integer you're seeing. Try

printf("checking %#x\n", (unsigned char) *s);

instead.

S.
 
R

Richard Bos

Simon Morgan said:
int mbcheck(const char *s) {
int n;

for (mblen(NULL, 0); ; s += n) {
printf("checking %#.8x\n", *s);
if ((n = mblen(s, MB_CUR_MAX)) <= 0)
return n;
printf("%d\n", n);
}
}

Does mblen() rely on a locale being set?

Yes:
# The behavior of the multibyte character functions is affected by the
# LC_CTYPE category of the current locale.
The strings are supposedly Shift-JIS encoded kanji and I
have no idea which locale that relates to if there is one.

Neither do I.
Also could somebody please explain to me what's with all the hexadecimal
f's in the output?

Sign extension of a negative integer.
As you've probably realised I'm still learning C but seeing as s points
to a char shouldn't printf() only be reading 1 byte and
padding the output with 0?

No. Because of the default integer promotions, all chars passed to a
variadic function (such as printf()) are actually passed as ints.

Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,057
Members
48,769
Latest member
Clifft

Latest Threads

Top