mbstowcs and wcsstr problems

K

Kelvin Moss

Hi all,

I am trying to search within wide strings (unicode characters) using
wcsstr (on Unix). My problem is that my src or dest strings may or may
not be wide strings. The code I have written seems to fail if I apply
mbstowcs to a wide string. It works correctly if both strings are non
wide or if I don't apply mbstows on a wide string.

So my questions are
1) What's the behavior of applying mbstows on a wide sring. I was
expecting it would have left it unaffected.
2) Is there am api to find if a given string is a unicode string and
doesn't require mbstows?

I haven't worked on wide strings so please correct.

Thanks ..
 
T

Thomas Lumley

Kelvin said:
I am trying to search within wide strings (unicode characters) using
wcsstr (on Unix). My problem is that my src or dest strings may or may
not be wide strings. The code I have written seems to fail if I apply
mbstowcs to a wide string.

Don't Do That. The definition of mbstowcs specifies that the input is
a (possibly multibyte) character string. If you pass it an argument
that is a wide string, or an array of doubles, or a picture of an
orang-utan, it won't be able to cope.

Sometimes it may be able to tell that you have lied to it (because the
argument contains something that isn't a valid multibyte character
sequence) and it will return -1. Otherwise, if your wide character
type includes zero bytes for many wide characters then it is likely to
see one of these and think it is a terminating \0. Or worse things
may happen.
It works correctly if both strings are non
wide or if I don't apply mbstows on a wide string.
Good.

So my questions are
1) What's the behavior of applying mbstows on a wide sring. I was
expecting it would have left it unaffected.

See above
2) Is there am api to find if a given string is a unicode string and
doesn't require mbstows?

No.

First note that "unicode string" is not sufficient identification.
Unicode represents any character you are likely to encounter as a
number. In addition you need to specify an "encoding" that says how
those numbers are stored.

Even if you know (as mbstowcs assumes it does) the encoding that you
use, you can't reliably tell from the contents of a piece of memory
whether it contains a multibyte character string or a wide character
string [or an array of doubles, or a picture of an orang-utan]

For example, if your multibyte encoding is UTF-8 and your wchar_t is a
32-bit unsigned int then the sequence
0x48 0x49 0x00 0x00
could be a multibyte character string "AB" followed by a terminating
\0, followed coincidentally by another \0, or it could be a
single-character string in Chinese. The computer can't tell, so you
have to keep track.

String manipulation was certainly easier in the old days, at least for
English-speaking people with dollars as their currency unit.

-thomas
 
S

Stephen Sprunk

Kelvin Moss said:
I am trying to search within wide strings (unicode characters) using
wcsstr (on Unix). My problem is that my src or dest strings may or may
not be wide strings. The code I have written seems to fail if I apply
mbstowcs to a wide string. It works correctly if both strings are non
wide or if I don't apply mbstows on a wide string.

That's what one should expect.
So my questions are
1) What's the behavior of applying mbstows on a wide sring. I was
expecting it would have left it unaffected.

Passing wchar_t* to a multi-byte function is invalid since those
functions are defined to take char* parameters. It's unlikely they will
work correctly or leave your data unchanged because you're lying to the
function about what type you're giving them. Your compiler should issue
a diagnostic when you do that; are you ignoring them? Or are you using
the wrong types so pervasively that the compiler can't figure it out?
2) Is there am api to find if a given string is a unicode string and
doesn't require mbstows?

It's simple deduction on your part. If you have a char*, it's a narrow
string, possibly multi-byte. If you have a wchar_t*, it's a wide
string. By definition.

Note that there's no such thing as a "unicode string". There are
various encodings of characters, some narrow, some multibyte, and some
wide. "Unicode" may mean you have UCS-4, UCS-2, UTF-16, UTF-8, UTF-7,
etc. encoding; you need to think about the encoding your data is in, not
just whether it's "unicode".
I haven't worked on wide strings so please correct.

Don't pass wchar_t*'s to mb functions, and don't pass char*'s to wcs
functions.

Keep your strings in the appropriate type, and convert when you need the
other type. That's all there is to it.

S
 
K

Kelvin Moss

Thomas Lumley wrote:

Even if you know (as mbstowcs assumes it does) the encoding that you
use, you can't reliably tell from the contents of a piece of memory
whether it contains a multibyte character string or a wide character
string [or an array of doubles, or a picture of an orang-utan]

Does mbstowcs assume it knows the encoding of the string?
Or, does it try to find the encoding of the character on its own.

I think the latter.

Thanks ..
 
S

Stephen Sprunk

Kelvin Moss said:
Thomas said:
Even if you know (as mbstowcs assumes it does) the encoding that you
use, you can't reliably tell from the contents of a piece of memory
whether it contains a multibyte character string or a wide character
string [or an array of doubles, or a picture of an orang-utan]

Does mbstowcs assume it knows the encoding of the string?
Or, does it try to find the encoding of the character on its own.

I think the latter.

It deduces the correct encoding from the locale you set.

Unfortunately, getting the locale set correctly is
implementation-dependent, though setlocale(LC_ALL, "") generally works
if the user's environment is set up correctly. If you want to use a
locale other than the user's default, you're on your own to figure out
how to do that.

S
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top