mblen and mbrlen

P

Paul King

I've been getting inconsistent results with mblen and mbrlen on
Solaris.

Although mblen accepts a multibyte string, mbrlen always rejects it,
reporting an encoding error. The mbstate_t variable is valid (it has
been zeroed with memset, and I have confirmed that it is in a valid
initial state with mbsinit directly before the call to mbrlen).

Can anyone shed any light on why this could be happening ?
 
A

Andreas Kahari

I've been getting inconsistent results with mblen and mbrlen on
Solaris.

Although mblen accepts a multibyte string, mbrlen always rejects it,
reporting an encoding error. The mbstate_t variable is valid (it has
been zeroed with memset, and I have confirmed that it is in a valid
initial state with mbsinit directly before the call to mbrlen).

Can anyone shed any light on why this could be happening ?


Maybe you could provide a trimmed down runnable program that
exhibits the behaviour that you describe?
 
D

Dan Pop

In said:
Maybe you could provide a trimmed down runnable program that
exhibits the behaviour that you describe?

Even then, it would be problematic: mbrlen is not a C89 function and
the Solaris libraries do not claim C99 conformance. Posting to a Sun
newsgroup may be a better idea.

Dan
 
P

Paul King

Even then, it would be problematic: mbrlen is not a C89 function and
the Solaris libraries do not claim C99 conformance. Posting to a Sun
newsgroup may be a better idea.

Thanks. I could probably trim down a program to work, although I
suspect that the locale might be an issue. But I've already tried
every comparison I can think of and it seems to be entirely a
difference between which strings the two routines will accept.

A multibyte (UTF-8 I beleive) string that fails on mbrlen but not
mblen is:

"\316\272\316\261\316\273\316\267\316\274\316\255\317\201\316\261\316\272\317\214\317\203\316\274\316\265!"
 
T

those who know me have no need of my name

in comp.lang.c i read:
I could probably trim down a program to work, although I
suspect that the locale might be an issue.

absolutely it is involved, LC_CTYPE affects it's behavior, so if the
current setting isn't utf-8 and you feed mb(r)len a pointer to a utf-8
string there's no telling just what will be the result. i agree that
it's very odd that they don't return the same value, but merely odd not
incorrect (since this would be undefined behavior anything is possible).
A multibyte (UTF-8 I beleive) string that fails on mbrlen but not
mblen is:

mb(r)len doesn't determine the length of a string, only the number of bytes
involved in a single multi-byte character, so ...
"\316\272\316\261\316\273\316\267\316\274\316\255\317\201\316\261\316\272\317\214\317\203\316\274\316\265!"

if this is indeed utf-8 then mblen and mbrlen are only determining the
length of the first character, which is "\316\272" (i.e., u+03ba -- greek
small letter kappa), so i would expect a return value of 2 from either so
long as LC_CTYPE is set correctly (i.e., you have first called setlocale
with appropriate arguments).
 
D

Dan Pop

In said:
mb(r)len doesn't determine the length of a string, only the number of bytes
involved in a single multi-byte character, so ...


if this is indeed utf-8 then mblen and mbrlen are only determining the
length of the first character, which is "\316\272" (i.e., u+03ba -- greek
small letter kappa), so i would expect a return value of 2 from either so
long as LC_CTYPE is set correctly (i.e., you have first called setlocale
with appropriate arguments).

Not necessarily: utf-8 may be the multibyte character encoding used in the
C locale. Only the implementation documentation can tell, but I doubt
that one and the same implementation supports more than one encoding
method for multibyte characters, depending on the locale setting
(although this is allowed by the standard).

In principle, UCS-4 should provide proper support for *any* locale.
That's why it was created in the first place.

Dan
 
P

Paul King

those who know me have no need of my name said:
mb(r)len doesn't determine the length of a string, only the number of bytes
involved in a single multi-byte character, so ...

Yes, I am aware of that - however since we are dealing with a variable
length coding system I thought it best to supply the whole string.
if this is indeed utf-8 then mblen and mbrlen are only determining the
length of the first character, which is "\316\272" (i.e., u+03ba -- greek
small letter kappa), so i would expect a return value of 2 from either so
long as LC_CTYPE is set correctly (i.e., you have first called setlocale
with appropriate arguments).

The locale should have been set correctly - although it's not easy to
check and the setup relies on the environment rather than calling
setlocale in the program. I am getting the correct result (as you
say, 2) for mblen - and in fact it walks the whole string. mbrlen
gives up on the first character.
 
D

Dan Pop

In said:
The locale should have been set correctly - although it's not easy to
check and the setup relies on the environment rather than calling
setlocale in the program.

If you don't call setlocale, you're in the C locale. All you can control
from the environment is the "" locale, but this locale still has to be
made the current locale with a setlocale call.

Dan
 
P

Paul King

If you don't call setlocale, you're in the C locale. All you can control
from the environment is the "" locale, but this locale still has to be
made the current locale with a setlocale call.

My mistake. I've just checked the code again and there is a
setlocale() call (using the locale from the environment variable
LC_CTYPE). The locale used is an alias which SHOULD be pointing at a
suitable locale but I can't remember how to verify that.
 
D

Dan Pop

In said:
My mistake. I've just checked the code again and there is a
setlocale() call (using the locale from the environment variable
LC_CTYPE). The locale used is an alias which SHOULD be pointing at a
suitable locale but I can't remember how to verify that.

For starters, check the return value of setlocale(). If it's a null
pointer, you're still in the "C" locale.

Dan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,266
Messages
2,571,077
Members
48,772
Latest member
Backspace Studios

Latest Threads

Top