multibyte,wchar_t and mblen(),wcslen()

M

Marcel Ruff

Hi,

i have the question on how to determine the
string length of a wide string and a multibyte string:

1. Number of letters (one letter may use three bytes)
2. Number of bytes

In the code snippet *p points to one chinese word which
i copy/pasted from my browser from some chinese homepage,
followed by a german umlaut.

So i would expect '2 letters' but '5 bytes'

In the code comments the results are given.

So how can i determine the number of bytes versus the number
of letters?

Thanks
Marcel



/*
gcc -Wall -pedantic -o Hello Hello.c -std=c99
*/
#include <stdio.h> /* printf() */
#include <wchar.h> /* wprintf (use -std=c99 compile flag */
#include <locale.h> /* setlocal() */
#include <string.h> /* strlen() */
#include <stdlib.h> /* mbstowcs() mblen() */

int main()
{
setlocale(LC_CTYPE, "en_US");
char *p = "統ö";

size_t len = mblen(p, 100);
printf("multibyte string: '%s' strlen=%d mblen=%d\n", p, (int)strlen(p), (int)len);
/* --> multibyte string: '統ö' strlen=5 mblen=1 */

wchar_t pwcs[126];
mbstowcs(pwcs, p, 124);
// wcslen() computes the number of wide characters in the string??
printf("wide string: '%ls' wcslen=%d\n", pwcs, (int)wcslen(pwcs));
/* --> wide string: '統ö' wcslen=5 */
return 0;
}
 
B

Ben Bacarisse

Marcel Ruff said:
Hi,

i have the question on how to determine the
string length of a wide string and a multibyte string:

1. Number of letters (one letter may use three bytes)
2. Number of bytes

In the code snippet *p points to one chinese word which
i copy/pasted from my browser from some chinese homepage,
followed by a german umlaut.

So i would expect '2 letters' but '5 bytes'

In the code comments the results are given.

So how can i determine the number of bytes versus the number
of letters?

Thanks
Marcel



/*
gcc -Wall -pedantic -o Hello Hello.c -std=c99
*/
#include <stdio.h> /* printf() */
#include <wchar.h> /* wprintf (use -std=c99 compile flag */
#include <locale.h> /* setlocal() */
#include <string.h> /* strlen() */
#include <stdlib.h> /* mbstowcs() mblen() */

int main()
{
setlocale(LC_CTYPE, "en_US");

You need to check the return is non-null or you may not be setting
anything here. Unless you are sure that the default encoding will be
multi-byte one, I would specify it. In other words I would use:

if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL) {
...
}

Also, if setlocale succeeds, you might like to print the returned
string. The standard does not say much about what the string will
mean but in practise *you* will probably know what it means!
char *p = "統ö";

It is unwise to use anything but 7-bit characters when posting
code. You should use something like:

char *p = "\xC3\xB6"; // the UTF-8 encoding of o with diaeresis

I am not at all sure what bytes are in your string, so I can not be
sure about what follows.
size_t len = mblen(p, 100);
printf("multibyte string: '%s' strlen=%d mblen=%d\n", p, (int)strlen(p), (int)len);
/* --> multibyte string: '統ö' strlen=5 mblen=1 */

That looks to me like the default encoding for en_US on your system
accepts the first byte of that string as character on its own (i.e. it
is a plain 8-bit character set encoding). Post code and output when
you (a) specify a locale setting that includes an encoding (as above) and
(b) your example string is written out using hex in the source.
wchar_t pwcs[126];
mbstowcs(pwcs, p, 124);
// wcslen() computes the number of wide characters in the string??
printf("wide string: '%ls' wcslen=%d\n", pwcs, (int)wcslen(pwcs));
/* --> wide string: '統ö' wcslen=5 */
return 0;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top