Count chars (not bytes) in UTF8 strings

D

David RF

Hi friends, is this function correct? (count chars in UTF8 strings)

size_t strchars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}
 
B

Ben Bacarisse

David RF said:
Hi friends, is this function correct? (count chars in UTF8 strings)

size_t strchars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}

There are two small problems. One is that the name is in the
implementation space. it would be better not to call it strxxx where
xxx starts with a lowercase letter. A related point is that the name
is not ideal -- it is really too general.

The second is a little bit more important depending on the use you
plan to make of it. You don't count valid UTF-8 characters, you
simply count UTF-8 "start" bytes (and lone single-byte encodings).
This may not matter, but you should comment the fact at the very
least. The standard mbstowcs function reports an error if the string
contains (in the part it examines) an invalid encoding.
 
D

david

There are two small problems.  One is that the name is in the
implementation space.  it would be better not to call it strxxx where
xxx starts with a lowercase letter.  A related point is that the name
is not ideal -- it is really too general.

The second is a little bit more important depending on the use you
plan to make of it.  You don't count valid UTF-8 characters, you
simply count UTF-8 "start" bytes (and lone single-byte encodings).
This may not matter, but you should comment the fact at the very
least.  The standard mbstowcs function reports an error if the string
contains (in the part it examines) an invalid encoding.

Thanks, strChars() seems a better name ;)
About the second point, don't know if that's the correct way:

#include <stdio.h>
#include <string.h>

size_t strChars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}

int main(void)
{
int i, len;

printf("\nWithout strchars:\n");
printf("%s\n", "Cañaveral");
len = (int)strlen("Cañaveral");
for (i = 0; i < len; i++) putchar('-');
printf("\n");
/*
ouput:
Without strchars:
Cañaveral
----------
^ One more char
*/

printf("\nWith strchars:\n");
printf("%s\n", "Cañaveral");
len = (int)strChars("Cañaveral");
for (i = 0; i < len; i++) putchar('-');
printf("\n");

/*
ouput:
With strchars:
Cañaveral
 
B

Ben Bacarisse

Best not to quote sigs.
Thanks, strChars() seems a better name ;)
About the second point, don't know if that's the correct way:

The count is correct when the string is correct. The point is your
function returns an answer when there is none. There are strings that
contain collections of byte values that do not correspond to UTF-8
encoded characters and your function will return a number when
presented with these. That is not an error by itself but it is
considered good practise to count only valid character encodings.
Anyway, as I said, it simply may not matter for your application.

<snip example>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top