Count chars (not bytes) in UTF8 strings

David RF · May 8, 2009

Hi friends, is this function correct? (count chars in UTF8 strings)

size_t strchars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}

Antoninus Twink · May 8, 2009

Hi friends, is this function correct? (count chars in UTF8 strings)

Yes it is.

You could also use mbstowcs(NULL, s, 0).

david · May 8, 2009

Yes it is.

You could also use mbstowcs(NULL, s, 0).

Thanks

Ben Bacarisse · May 8, 2009

David RF said:
Hi friends, is this function correct? (count chars in UTF8 strings)

size_t strchars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}

There are two small problems. One is that the name is in the
implementation space. it would be better not to call it strxxx where
xxx starts with a lowercase letter. A related point is that the name
is not ideal -- it is really too general.

The second is a little bit more important depending on the use you
plan to make of it. You don't count valid UTF-8 characters, you
simply count UTF-8 "start" bytes (and lone single-byte encodings).
This may not matter, but you should comment the fact at the very
least. The standard mbstowcs function reports an error if the string
contains (in the part it examines) an invalid encoding.

david · May 8, 2009

There are two small problems. One is that the name is in the
implementation space. it would be better not to call it strxxx where
xxx starts with a lowercase letter. A related point is that the name
is not ideal -- it is really too general.

The second is a little bit more important depending on the use you
plan to make of it. You don't count valid UTF-8 characters, you
simply count UTF-8 "start" bytes (and lone single-byte encodings).
This may not matter, but you should comment the fact at the very
least. The standard mbstowcs function reports an error if the string
contains (in the part it examines) an invalid encoding.

Thanks, strChars() seems a better name

About the second point, don't know if that's the correct way:

#include <stdio.h>
#include <string.h>

size_t strChars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}

int main(void)
{
int i, len;

printf("\nWithout strchars:\n");
printf("%s\n", "Cañaveral");
len = (int)strlen("Cañaveral");
for (i = 0; i < len; i++) putchar('-');
printf("\n");
/*
ouput:
Without strchars:
Cañaveral
----------
^ One more char
*/

printf("\nWith strchars:\n");
printf("%s\n", "Cañaveral");
len = (int)strChars("Cañaveral");
for (i = 0; i < len; i++) putchar('-');
printf("\n");

/*
ouput:
With strchars:
Cañaveral

Ben Bacarisse · May 8, 2009

Best not to quote sigs.

Thanks, strChars() seems a better name
About the second point, don't know if that's the correct way:

The count is correct when the string is correct. The point is your
function returns an answer when there is none. There are strings that
contain collections of byte values that do not correspond to UTF-8
encoded characters and your function will return a number when
presented with these. That is not an error by itself but it is
considered good practise to count only valid character encodings.
Anyway, as I said, it simply may not matter for your application.

<snip example>

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
toupper UTF8 string	9	Sep 24, 2009
C language. work with text	3	Dec 10, 2021
Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Code working properly in VS code for every test case but assigned wrong when submitted why?	0	Aug 21, 2022
Fibonacci	0	May 13, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
compressing charatcers	35	Apr 2, 2014

Count chars (not bytes) in UTF8 strings

David RF

Antoninus Twink

david

Ben Bacarisse

david

Ben Bacarisse

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads