Unicode characters length (NOT size)

J

jonathanztaub

I'm not a C programmer but have to fix some legacy application written
in C (deployed on Linux).
It seems that the "strlen" function returns the number of bytes for a
character array.
I need to get the number of characters. For example, some Japanese
characters are 3 bytes per character. If I had a character array
containing two of these characters, I would like to get 2 (length) as
opposed to 6 (byte size). I cannot make extensive code modifications
to this application and would like to make all necessary changes
within the function scope:

static int verifyChar(char *dst, char *src, int n)
{
/* check length */
if (strlen(src) > 100)
return -1;

return fieldCopy(dst, src, n);
}

I tried playing around with wcslen (wchar.h) and other conversion
methods I found on the internet like the following:

wchar_t wcs[10];
char mbs[10]="\u00A9"; /* copyright character, though it is a
single byte */
char *ptr = mbs; /* pointer to the mbs string */
int length;

/* Determine the length of the multibyte string pointed to by
*/
/* mbs. Store the multibyte characters in the wchar_t array
*/
/* pointed to by wcs.
*/
length = mbsrtowcs(wcs, (const char**)&ptr, SIZE, NULL);

Length returns the value of -1. However, if I changed the mbs value to
"abcd", it'll return four.


Please keep in mind that I'm not C savvy.
Thanks.
 
J

jonathanztaub

Actually, if you are certain it is legal UTF-8, all you need to do is
count the number of bytes for which c & 0xc0 != 0x80, since all UTF-8
characters start with such a byte, and they do not have such a byte in
any other position.

-- Richard

Mmmmmmm.... not sure about "since all UTF-8
 
B

Ben Bacarisse

Mmmmm..... let me try to be a little more specific.
I have a database table column in MySQL which can store up to 100
characters. The encoding used in the database is UTF-8. The column can
store 100 characters regardless if they are single bytes or
multibytes. For example, it can store 100 Japanese characters which
have the size of 300 bytes.
The *src points to a character array which may have multibyte
characters in it. Apparently, they are already encoded in UTF-8
(strlen returns the correct byte size). However, I need to check that
the character array (string) does not exceed 100 characters and NOT
that its size is greater than 100 bytes.

static int verifyChar(char *dst, char *src, int n)
{
if (strlen(src) > 100)
return -1;

return fieldCopy(dst, src, n);
}

You can use C's mbxxx functions to do this provided that have told the
C runtime that you are using UTF-8. You do this with an appropriate
call to setlocale. I am being vague about this because the C standard
does not say anything about these details and you don't say what
system you are using. On mine, I'd write

setlocale(LC_ALL, "en_GB.UTF-8");

since you don't want the more usual behaviour of picking up the
setting from the environment.

However... If all you need to do is count UTF-8 sequences it is a few
lines of C. This cropped up a few days ago. You can do it in a loop
like this:

size_t mb_strlen(const unsigned char *s)
{
size_t l = 0;
while (*s != 0) {
l += *s < 128 || *s >= 0xC0;
s++;
}
return l;
}

Neater (more tricky) code was also posted but this makes the test
explicit. Note that invalid sequences will not be detected and that
might be very import for your DB. Obviously that, too, can be coded
in a few lines, but mblen, mbtowc etc probably are already debugged!

Open source UTF-8 libraries are also available but you'd want to be
doing a bit more than counting a testing for validity to make it worth
while using one.
 
S

Stephen Sprunk

Mmmmm..... let me try to be a little more specific.
I have a database table column in MySQL which can store up to 100
characters. The encoding used in the database is UTF-8. The column can
store 100 characters regardless if they are single bytes or
multibytes. For example, it can store 100 Japanese characters which
have the size of 300 bytes.

My money says that the database is actually storing characters in UTF-16
(or UTF-32) internally and it translates them to/from UTF-8 externally.
The limitations don't make sense otherwise. Not that it matters...
The *src points to a character array which may have multibyte
characters in it. Apparently, they are already encoded in UTF-8
(strlen returns the correct byte size). However, I need to check that
the character array (string) does not exceed 100 characters and NOT
that its size is greater than 100 bytes.

static int verifyChar(char *dst, char *src, int n)
{
if (strlen(src) > 100)
return -1;

return fieldCopy(dst, src, n);
}

If you _know_ the string is in UTF-8, the simplest way to count
characters is to write your own utf8len() function. Several examples
were posted here recently in another thread, including one by me.

S
 
S

Stephen Sprunk

Mmmmmmm.... not sure about "since all UTF-8 characters start with such
a byte". An ascii encoded 'A' has exactly the same value as UTF-8
encoded 'A'. If a file is saved in UTF-8 without BOM, and it only
contains the letter 'A', you would not be able to tell the difference.

You do not need to be able to tell the difference because UTF-8 was
specifically designed so that characters 0x00 to 0x7F would be encoded
identically to US-ASCII, i.e. for that range there is no difference!
The difference appears with characters 0x80 and higher, which require
multiple bytes.

The above test (c&0xC0!=0x80) is a bit tricky, but it takes advantage of
how UTF-8 is encoded by ignoring "trailing" bytes. The alternate test
(c<=0x7F||c>=0xC0) counts single or leading bytes. All characters are
encoded either as singles for US-ASCII or as one leading plus one or
more trailing bytes for non-US-ASCII, so the two tests are functionally
equivalent.

S
 
J

jonathanztaub

You can use C's mbxxx functions to do this provided that have told the
C runtime that you are using UTF-8.  You do this with an appropriate
call to setlocale.  I am being vague about this because the C standard
does not say anything about these details and you don't say what
system you are using.  On mine, I'd write

  setlocale(LC_ALL, "en_GB.UTF-8");

since you don't want the more usual behaviour of picking up the
setting from the environment.

However...  If all you need to do is count UTF-8 sequences it is a few
lines of C.  This cropped up a few days ago.  You can do it in a loop
like this:

  size_t mb_strlen(const unsigned char *s)
  {
       size_t l = 0;
       while (*s != 0) {
            l += *s < 128 || *s >= 0xC0;
            s++;
       }
       return l;
  }

Neater (more tricky) code was also posted but this makes the test
explicit.  Note that invalid sequences will not be detected and that
might be very import for your DB.  Obviously that, too, can be coded
in a few lines, but mblen, mbtowc etc probably are already debugged!

Open source UTF-8 libraries are also available but you'd want to be
doing a bit more than counting a testing for validity to make it worth
while using one.

Thanks. This is what I needed and seems to work. I also tried setting
the locale as another post suggested and it worked for my initial
"method" though it had some side effects. Anyway, this is definitely
much easier and seems to be working. I was also finally able to find
similar solution elsewhere on the internet.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top