Multibyte string length

  • Thread starter Zygmunt Krynicki
  • Start date
Z

Zygmunt Krynicki

Hello
I've browsed the FAQ but apparently it lacks any questions concenring wide
character strings. I'd like to calculate the length of a multibyte string
without converting the whole string.

Zygmunt

PS: The whole multibyte string vs wide character string concept is broken
IMHO since it allows wchar_t not to be large enough to contain a full
character (rendering both types virtually the same). What's the point of
standartizing wide characters if the standard makes portable usage of such
mechanism a programming hell? Feel free to disagree.

PS2: On my implementation wchar_t is 'big enough' so I might overcome the
problem in some other way but I'd like to see some fully portable approach.
 
D

Dan Pop

In said:
I've browsed the FAQ but apparently it lacks any questions concenring wide
character strings. I'd like to calculate the length of a multibyte string
without converting the whole string.

Use the mblen function from the standard C library in a loop, until it
returns 0. The number of mblen calls returning a positive value is the
number of multibyte characters in that string.
PS: The whole multibyte string vs wide character string concept is broken
IMHO since it allows wchar_t not to be large enough to contain a full
character (rendering both types virtually the same). What's the point of
standartizing wide characters if the standard makes portable usage of such
mechanism a programming hell? Feel free to disagree.

The bit you're missing is that the standard doesn't impose one character
set or another for wide characters. If the implementor decides to use
ASCII as the character set for wide characters, wchar_t need not be any
wider than char. But wchar_t is supposed to be wide enough for the
character set chosen by the implementor for wide characters.

Dan
 
S

Sheldon Simms

The bit you're missing is that the standard doesn't impose one character
set or another for wide characters. If the implementor decides to use
ASCII as the character set for wide characters, wchar_t need not be any
wider than char. But wchar_t is supposed to be wide enough for the
character set chosen by the implementor for wide characters.

I don't think he's missing that at all. He's simply pointing out that
the standard makes it pretty much impossible to use wide characters
portably (unless you only use wide characters with values between 0
and 127, of course).

Had the standard mandated, for instance, that wide characters be at
least 32 bits wide, then each wide character would be wide enough for
any character set and it would be possible to write portable code
using wide characters as long as the code had no character set
dependency.

The OP also seems to be griping about certain implementations using
unicode as a character set that have 16 bit wchar_t. Since it is
impossible to represent every unicode character in 16 bits, wide
character strings become 'multiwchar_t' encodings (UTF-16), which
defeats the whole purpose of wide characters and wide character strings

- Sheldon
 
N

NumLockOff

Sheldon Simms said:
I don't think he's missing that at all. He's simply pointing out that
the standard makes it pretty much impossible to use wide characters
portably (unless you only use wide characters with values between 0
and 127, of course).

Had the standard mandated, for instance, that wide characters be at
least 32 bits wide, then each wide character would be wide enough for
any character set and it would be possible to write portable code
using wide characters as long as the code had no character set
dependency.

The OP also seems to be griping about certain implementations using
unicode as a character set that have 16 bit wchar_t. Since it is
impossible to represent every unicode character in 16 bits, wide
character strings become 'multiwchar_t' encodings (UTF-16), which
defeats the whole purpose of wide characters and wide character strings

- Sheldon
It is just the evolution of the Unicode standard. Surrogares were added at
U+D800 to include more FarEastern characters. It has become now similar to a
mbcs mess. Could they have originally specified 32 bit charecters? maybe,
but in early 1990s, 16 bit characters were considered a major waste and
opposed. UTF8 was pretty much invented to solve the purpose of older 8bit
character systems to be able to read vanilla english text without code
change. With the memory and processing power costs plummeting, we now feel
that 32 bits is fine. At this point 32 bits seemd to be enough! Who knows
what will happen once we make the "first contact" :)
 
S

Sheldon Simms

It is just the evolution of the Unicode standard. Surrogares were added at
U+D800 to include more FarEastern characters. It has become now similar to a
mbcs mess.

Unicode is not the problem. 16 bit wchar_t is the problem.
 
D

Dan Pop

In said:
I don't think he's missing that at all. He's simply pointing out that
the standard makes it pretty much impossible to use wide characters
portably (unless you only use wide characters with values between 0
and 127, of course).

Had the standard mandated, for instance, that wide characters be at
least 32 bits wide, then each wide character would be wide enough for
any character set and it would be possible to write portable code
using wide characters as long as the code had no character set
dependency.

Nope, it wouldn't, as long as the standard doesn't specify a certain
character set for the wide characters. Imagine that you need to output
the character e with an acute accent. How do you do that *portably*, if
you have the additional guarantee that wchar_t is at least 32-bit wide?

Dan
 
S

Sheldon Simms

Nope, it wouldn't, as long as the standard doesn't specify a certain
character set for the wide characters. Imagine that you need to output
the character e with an acute accent. How do you do that *portably*, if
you have the additional guarantee that wchar_t is at least 32-bit wide?

I never meant to say that sort of thing could be done portably.

I was going on the assumption that the OP's assertion "it allows wchar_t
not to be large enough to contain a full character" was true, and thinking
about two implementations using the same execution character set where
one implementation used a wchar_t that was too small for the character
set.

It seems to me now, however, that an implementation in which wchar_t is
not "large enough to contain a full character" would be non-conforming,
since 7.17.2 states:

wchar_t which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character set
specified among the supported locales;

In any case, my statment was based on the assumption of multiple
implementations using a common (but arbitrary) character set, and that
is an unportable assumption by itself, so I retract my assertion.

-Sheldon
 
Z

Zygmunt Krynicki

Nope, it wouldn't, as long as the standard doesn't specify a certain
character set for the wide characters. Imagine that you need to output
the character e with an acute accent. How do you do that *portably*, if
you have the additional guarantee that wchar_t is at least 32-bit wide?

Dan

To clarify

Not my problem really, and not a reall one either as any specific program
knows its output encoding most probably. Hovever imagine I wish to write
a portable code for wide character regular expressions. Now the whole purpose
of wide characters is obvious; to be able to address all sorts of
characters and encodings, not just plain ascii, in a portable way.

Not to speak names it is common that the INTERNAL encoding used inside
program routines is often different than EXTERNAL encoding used to
store/transfer text.

Now we know that many external encodings use multibyte sequences for
various reasons which are not important here. We also know how inefficient
or uncomfortable it is to develop algorithms for multibyte sequence
character strings. It is much easier to assume that any single charater
can fit into some data type. Wether it's wchar_t or foo_t is not
important.

Now if wchar_t is not forced to able to contain a full character then
again we are stuck at our multibyte (multi-some-unit) character
sequence with all of its inconveniances. This IMHO defeats the whole
purpose of wchar_t.

Of course it is not clear which character encoding is the best one (or rather
since there is no perfect encoding which one should be made the standard).
Unicode seems to help alot providing UTF-8 as external and 32bit Unicode
as internal encoding. This has all sorts of benefits and non-benefits that
are not important here.

Also hardware doesn't need to have 32 bit wide data types so it
would be problematic to create conforming implementations

BTW: Thank you all for participating in this discussion :)

Regards
Zygmunt Krynicki
 
T

those who know me have no need of my name

in comp.lang.c i read:
Now if wchar_t is not forced to able to contain a full character then
again we are stuck at our multibyte (multi-some-unit) character
sequence with all of its inconveniances. This IMHO defeats the whole
purpose of wchar_t.

wchar_t is required to have a range that can handle all the code points
which can arise from the use of any locale supported by the implementation.
c99 takes this further: the implementation can indicate to the programmer
if iso-10646 is directly supported (though the encoding is *not* required
to be ucs-4), and the creation of the \U and \u escapes so that iso-10646
code points can be used directly.
Also hardware doesn't need to have 32 bit wide data types so it
would be problematic to create conforming implementations

hardware may not necessarily have a 32 bit wide integer type, but the
standard mandates that long be at least 32 value bits wide (sign + 31 for
signed long). so, there *is* always a 32 bit type available.
 
S

Sheldon Simms

in comp.lang.c i read:


wchar_t is required to have a range that can handle all the code points
which can arise from the use of any locale supported by the implementation.
c99 takes this further: the implementation can indicate to the programmer
if iso-10646 is directly supported (though the encoding is *not* required
to be ucs-4)

I guess you're saying the encoding is not required to be ucs-4 because
the standard doesn't explicitly say so:

6.10.8.2
...
__STDC_ISO_10646__ An integer constant of the form yyyymmL (for
example, 199712L), intended to indicate that values of type wchar_t
are the coded representations of the characters defined by ISO/IEC
10646, along with all amendments and technical corrigenda as of the
specified year and month.

But if the encoding is not ucs-4, then what could it possibly be?
7.17.2 says

wchar_t which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character set
specified among the supported locales;

As I read this, it means that in implementations implementing ISO 10646
must have a wchar_t capable of representing over 1 million distinct
values. Given this requirement, ucs-4 seems to be the only reasonable
encoding to use for ISO 10646 wide character strings.

Would an implementation that used utf-8 encoding in wide character
strings composed of 32-bit wchar_t be conforming?

-Sheldon
 
M

Micah Cowan

Sheldon Simms said:
I guess you're saying the encoding is not required to be ucs-4 because
the standard doesn't explicitly say so:

6.10.8.2
...
__STDC_ISO_10646__ An integer constant of the form yyyymmL (for
example, 199712L), intended to indicate that values of type wchar_t
are the coded representations of the characters defined by ISO/IEC
10646, along with all amendments and technical corrigenda as of the
specified year and month.

But if the encoding is not ucs-4, then what could it possibly be?
7.17.2 says

wchar_t which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character set
specified among the supported locales;

As I read this, it means that in implementations implementing ISO 10646
must have a wchar_t capable of representing over 1 million distinct
values. Given this requirement, ucs-4 seems to be the only reasonable
encoding to use for ISO 10646 wide character strings.

No; the ISO 10646 and Unicode standards are 16-bit
encodings. Some 16-bit codes work together (high/low surrogates)
to produce the effect of a "single" character from two encoded
characters; however, that does not change the fact that the
standards themselves claim to present 16-bit encodings (Actually,
for ISO 10646 I'm making some assumptions, as I've not read it;
only Unicode). Not only this, but while support is in place for
character codes 0x10000 and above, no character codes have
actually been defined for these values, and so UCS-2/UTF-16 can
safely be used to encode "all members of the largest extended
character set".
Would an implementation that used utf-8 encoding in wide character
strings composed of 32-bit wchar_t be conforming?

I don't think so, no.

-Micah
 
S

Sheldon Simms

No; the ISO 10646 and Unicode standards are 16-bit
encodings.

Unicode 4.0 p.1:
Unicode provides for three encoding forms: a 32-bit form (UTF-32),
a 16-bit form (UTF- 16), and an 8-bit form (UTF-8).
Some 16-bit codes work together (high/low surrogates)
to produce the effect of a "single" character from two encoded
characters; however, that does not change the fact that the
standards themselves claim to present 16-bit encodings.

Unicode 4.0 p.1:
The Unicode Standard specifies a numeric value (code point) and a
name for each of its characters.
...
The Unicode Standard provides 1,114,112 code points,

Unicode 4.0 p.28:
UTF-32 is the simplest Unicode encoding form. Each Unicode code
point is represented directly by a single 32-bit code unit.
Because of this, UTF-32 has a one-to-one relationship between
encoded character and code unit;
...
In the UTF-16 encoding form, ... code points in the supplementary
planes, in the range U+10000..U+10FFFF, are instead represented
as pairs of 16-bit code units.
...
The distinction between characters represented with one versus
two 16-bit code units means that formally UTF-16 is a variable-
width encoding form.
Not only this, but while support is in place for
character codes 0x10000 and above, no character codes have
actually been defined for these values, and so UCS-2/UTF-16 can
safely be used to encode "all members of the largest extended
character set".

Unicode 4.0 p.1:
The Unicode Standard, Version 4.0, contains 96,382 characters
from the world's scripts.
...
The unified Han subset contains 70,207 ideographic characters

Examples of characters at code points greater than or equal to
0x10000 are "Musical Symbols", "Mathematical Alphanumeric Symbols",
and "CJK Unified Ideographs Extension B"

http://www.unicode.org/charts/

My conclusion is that 16 bit values can NOT in fact encode "all
members of the largest extended character set", if that character
set is Unicode. This means that 16 bit wchar_t is NOT conforming
on implementations that claim to implement Unicode, and that
the only acceptable encoding for wide character strings in such
an implementations is UCS-4

-Sheldon
 
D

Dan Pop

In said:
I guess you're saying the encoding is not required to be ucs-4 because
the standard doesn't explicitly say so:

6.10.8.2
...
__STDC_ISO_10646__ An integer constant of the form yyyymmL (for
example, 199712L), intended to indicate that values of type wchar_t
are the coded representations of the characters defined by ISO/IEC
10646, along with all amendments and technical corrigenda as of the
specified year and month. ^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^
But if the encoding is not ucs-4, then what could it possibly be?
7.17.2 says

wchar_t which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character set
specified among the supported locales;

Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
as being "the largest extended character set specified among the
supported locales" and, therefore, having wchar_t defined as char?
As I read this, it means that in implementations implementing ISO 10646
must have a wchar_t capable of representing over 1 million distinct
values.

It depends on the actual value of the __STDC_ISO_10646__, which could
point to an earlier version of ISO 10646, or not be defined at all,
as in my ASCII example above.
Given this requirement, ucs-4 seems to be the only reasonable
encoding to use for ISO 10646 wide character strings.

If the implementation chooses to support a recent enough version of the
ISO 10646. Which the standard allows but doesn't require. The first
incarnation of ISO 10646 only specified 34203 characters, so a 16-bit
wchar_t would be enough for an implementation defining __STDC_ISO_10646__.
Would an implementation that used utf-8 encoding in wide character
strings composed of 32-bit wchar_t be conforming?

No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
to six octets). They are clearly intended to be used in multibyte
character strings, which are composed of plain char's (e.g. printf's
format string).

Dan
 
S

Sheldon Simms

Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
as being "the largest extended character set specified among the
supported locales" and, therefore, having wchar_t defined as char?

Nothing. However, I was only talking about cases where "the largest
extended character set" is Unicode.
It depends on the actual value of the __STDC_ISO_10646__, which could
point to an earlier version of ISO 10646

All right. It might suck to know that your preferred implementation
is not capable of keeping up with ISO 10646 since it's stuck with a
16 bit wchar_t, but I guess that's a problem for the implementors
users of such an implementation, and off topic here.
If the implementation chooses to support a recent enough version of the
ISO 10646. Which the standard allows but doesn't require.

That's what I thought.
No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
to six octets). They are clearly intended to be used in multibyte
character strings, which are composed of plain char's (e.g. printf's
format string).

My intention was to express that each of the 32 bit wide characters
contain the value of one octet of the UTF-8 encoding. I didn't
think that would be conforming.
 
D

Dan Pop

In said:
Nothing. However, I was only talking about cases where "the largest
extended character set" is Unicode.


All right. It might suck to know that your preferred implementation
is not capable of keeping up with ISO 10646 since it's stuck with a
16 bit wchar_t, but I guess that's a problem for the implementors
users of such an implementation, and off topic here.

Once you're talking about cases where "the largest extended character
set" is Unicode *only*, you're off-topic here, anyway.

However, I can see no reason why a certain implementation would be stuck
with a 16 bit wchar_t, once its intended market is asking for more. For
the time being, there is little market pressure for a wider wchar_t,
however, the 16-bit codes covering practically all locales of interest.

Widening wchar_t to 32-bit is not a no-cost decision: think about
programs manipulating huge amounts of wchar_t data.
My intention was to express that each of the 32 bit wide characters
contain the value of one octet of the UTF-8 encoding. I didn't
think that would be conforming.

Of course it wouldn't: wchar_t objects are supposed to contain character
values, not *encoded* character values. Encoded character values can be
stored in multibyte character strings only.

Dan
 
M

Micah Cowan

Sheldon Simms said:
Unicode 4.0 p.1:
Unicode provides for three encoding forms: a 32-bit form (UTF-32),
a 16-bit form (UTF- 16), and an 8-bit form (UTF-8).

I didn't mean quite what I wrote: What I meant was "Unicode
character codes have a width of 16 bits". This was true
regardless of the number of encodings available (Unicode 3.0 plus
addenda had UTF-32), yet sect. 2.2 still said "Unicode character
codes have a width of 16 bits". This appears to have been removed
from Unicode 4.0.
Unicode 4.0 p.1:
The Unicode Standard specifies a numeric value (code point) and a
name for each of its characters.
...
The Unicode Standard provides 1,114,112 code points,

Hm. The same area in Unicode 3.0 said "Using a 16-bit encoding
means that code values are available for more than 65,000
characters." They clearly supported more than that; sloppy
wording on their part.
Unicode 4.0 p.28:
UTF-32 is the simplest Unicode encoding form. Each Unicode code
point is represented directly by a single 32-bit code unit.
Because of this, UTF-32 has a one-to-one relationship between
encoded character and code unit;
...
In the UTF-16 encoding form, ... code points in the supplementary
planes, in the range U+10000..U+10FFFF, are instead represented
as pairs of 16-bit code units.
...
The distinction between characters represented with one versus
two 16-bit code units means that formally UTF-16 is a variable-
width encoding form.

Okay. Here's the chief difference then. In Unicode 3.0, UTF-16
was formally considered the one-to-one representation (which was
kind of sticky when you deal with surrogates; having to pretend
that they're really two separate characters...).
My conclusion is that 16 bit values can NOT in fact encode "all
members of the largest extended character set", if that character
set is Unicode. This means that 16 bit wchar_t is NOT conforming
on implementations that claim to implement Unicode, and that
the only acceptable encoding for wide character strings in such
an implementations is UCS-4

Alright, then: but it *is* conforming provided that they claim to
conform to a Unicode standard preceding 4.0 whose entire
character could be represented in 16 bits.

I hadn't gotten around to reading the 4.0 yet; I'm pleased to see
that they've eschewed all the "pay no attention to the man behind
the curtain; Unicode *is* a 16-bit character set... that seemed
to be present in 3.0". Perhaps they had already remedied some of
this in their addenda: I didn't read many of those except some of
the new character codespaces.

-Micah
 
S

Sheldon Simms

Of course it wouldn't: wchar_t objects are supposed to contain character
values, not *encoded* character values. Encoded character values can be
stored in multibyte character strings only.

This gets back to the problem the original poster had. He seemed to
be confronted with an implementation that used 16 bit wchar_t and
encoded wide character strings (including characters outside of
Unicode's Basic Multilingual Plane) in UTF-16, a variable length
encoding.

I expressed the view that such an implementation would be non-conforming.
 
D

Dingo

Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
as being "the largest extended character set specified among the
supported locales" and, therefore, having wchar_t defined as char?


It depends on the actual value of the __STDC_ISO_10646__, which could
point to an earlier version of ISO 10646, or not be defined at all,
as in my ASCII example above.

The way I read it, __STDC_ISO_10646__ doesn't indicate the Unicode
version that defines the extended character set. It is just states
the version where wchar_t encodings may be found.

A seven-bit ASCII implementation with wchar_t defined as char could
define the most recent value for __STDC_ISO_10646__ and be conforming.
ASCII encodings map directly to the most recent version of ISO 10646.
And a char is wide enough to hold "the largest extended character set
among the supported locales."
 
D

Dan Pop

In said:
The way I read it, __STDC_ISO_10646__ doesn't indicate the Unicode
version that defines the extended character set. It is just states
the version where wchar_t encodings may be found.

A seven-bit ASCII implementation with wchar_t defined as char could
define the most recent value for __STDC_ISO_10646__ and be conforming.
ASCII encodings map directly to the most recent version of ISO 10646.
And a char is wide enough to hold "the largest extended character set
among the supported locales."

As I read it, it is the whole ISO/IEC 10646 specification that must be
supported by wchar_t, once this macro is defined. The words "along
with all amendments and technical corrigenda as of the specified year
and month" clearly suggest this interpretation to me. Of course, only
comp.std.c can say which interpretation is the intended one.

Dan
 
D

Dan Pop

In said:
This gets back to the problem the original poster had. He seemed to
be confronted with an implementation that used 16 bit wchar_t and
encoded wide character strings (including characters outside of
Unicode's Basic Multilingual Plane) in UTF-16, a variable length
encoding.

Couldn't find anything suggesting this in OP's post:

From: "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org>
Organization: Customers chello Poland
Date: Thu, 09 Oct 2003 12:54:00 GMT
Subject: Multibyte string length

Hello
I've browsed the FAQ but apparently it lacks any questions concenring wide
character strings. I'd like to calculate the length of a multibyte string
without converting the whole string.

Zygmunt

PS: The whole multibyte string vs wide character string concept is broken
IMHO since it allows wchar_t not to be large enough to contain a full
character (rendering both types virtually the same). What's the point of
standartizing wide characters if the standard makes portable usage of such
mechanism a programming hell? Feel free to disagree.

PS2: On my implementation wchar_t is 'big enough' so I might overcome the
problem in some other way but I'd like to see some fully portable approach.

He seemed to be worried about wchar_t not being wide enough for its
intended purpose, but the C standard makes it quite clear that this cannot
be the case, by definition, for the simple reason that it is the
implementor who decides what the extended character set actually is.

Dan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top