Count chars (not bytes) in UTF8 strings

David RF · May 8, 2009

Hi friends, is this function correct? (count chars in UTF8 strings)

size_t strchars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}

Antoninus Twink · May 8, 2009

Hi friends, is this function correct? (count chars in UTF8 strings)

Yes it is.

You could also use mbstowcs(NULL, s, 0).

david · May 8, 2009

Yes it is.

You could also use mbstowcs(NULL, s, 0).

Thanks

Ben Bacarisse · May 8, 2009

David RF said:
Hi friends, is this function correct? (count chars in UTF8 strings)

size_t strchars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}

There are two small problems. One is that the name is in the
implementation space. it would be better not to call it strxxx where
xxx starts with a lowercase letter. A related point is that the name
is not ideal -- it is really too general.

The second is a little bit more important depending on the use you
plan to make of it. You don't count valid UTF-8 characters, you
simply count UTF-8 "start" bytes (and lone single-byte encodings).
This may not matter, but you should comment the fact at the very
least. The standard mbstowcs function reports an error if the string
contains (in the part it examines) an invalid encoding.

david · May 8, 2009

There are two small problems. One is that the name is in the
implementation space. it would be better not to call it strxxx where
xxx starts with a lowercase letter. A related point is that the name
is not ideal -- it is really too general.

The second is a little bit more important depending on the use you
plan to make of it. You don't count valid UTF-8 characters, you
simply count UTF-8 "start" bytes (and lone single-byte encodings).
This may not matter, but you should comment the fact at the very
least. The standard mbstowcs function reports an error if the string
contains (in the part it examines) an invalid encoding.

Thanks, strChars() seems a better name

About the second point, don't know if that's the correct way:

#include <stdio.h>
#include <string.h>

size_t strChars(const char *s)
{
size_t i = 0;

if (!s) return 0;
while (*s) {
if ((*s & 0xc0) != 0x80) i++;
s++;
}
return i;
}

int main(void)
{
int i, len;

printf("\nWithout strchars:\n");
printf("%s\n", "Cañaveral");
len = (int)strlen("Cañaveral");
for (i = 0; i < len; i++) putchar('-');
printf("\n");
/*
ouput:
Without strchars:
Cañaveral
----------
^ One more char
*/

printf("\nWith strchars:\n");
printf("%s\n", "Cañaveral");
len = (int)strChars("Cañaveral");
for (i = 0; i < len; i++) putchar('-');
printf("\n");

/*
ouput:
With strchars:
Cañaveral

Ben Bacarisse · May 8, 2009

Best not to quote sigs.

Thanks, strChars() seems a better name
About the second point, don't know if that's the correct way:

The count is correct when the string is correct. The point is your
function returns an answer when there is none. There are strings that
contain collections of byte values that do not correspond to UTF-8
encoded characters and your function will return a number when
presented with these. That is not an error by itself but it is
considered good practise to count only valid character encodings.
Anyway, as I said, it simply may not matter for your application.

<snip example>

RSA implementation issues in public key pem loader function	0	May 21, 2025
AES-128 Clipboard Protector: Auto-Encrypt Ctrl+C, Smart-Decrypt Ctrl+V (C++ Windows Hook)	7	Mar 24, 2026
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Unable to read input from keyboard, in below C code, for a BST.	0	Jul 20, 2025
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Can I count the number of times a video is played?	2	Oct 28, 2025

Count chars (not bytes) in UTF8 strings

David RF

Antoninus Twink

david

Ben Bacarisse

david

Ben Bacarisse

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads