Count chars (not bytes) in UTF8 strings

Discussion in 'C Programming' started by David RF, May 8, 2009.

  1. David RF

    David RF Guest

    Hi friends, is this function correct? (count chars in UTF8 strings)

    size_t strchars(const char *s)
    {
    size_t i = 0;

    if (!s) return 0;
    while (*s) {
    if ((*s & 0xc0) != 0x80) i++;
    s++;
    }
    return i;
    }
     
    David RF, May 8, 2009
    #1
    1. Advertising

  2. On 8 May 2009 at 10:04, David RF wrote:
    > Hi friends, is this function correct? (count chars in UTF8 strings)


    Yes it is.

    You could also use mbstowcs(NULL, s, 0).
     
    Antoninus Twink, May 8, 2009
    #2
    1. Advertising

  3. David RF

    david Guest

    On 8 mayo, 13:11, Antoninus Twink <> wrote:
    > On  8 May 2009 at 10:04, David RF wrote:
    >
    > > Hi friends, is this function correct? (count chars in UTF8 strings)

    >
    > Yes it is.
    >
    > You could also use mbstowcs(NULL, s, 0).


    Thanks
     
    david, May 8, 2009
    #3
  4. David RF <> writes:

    > Hi friends, is this function correct? (count chars in UTF8 strings)
    >
    > size_t strchars(const char *s)
    > {
    > size_t i = 0;
    >
    > if (!s) return 0;
    > while (*s) {
    > if ((*s & 0xc0) != 0x80) i++;
    > s++;
    > }
    > return i;
    > }


    There are two small problems. One is that the name is in the
    implementation space. it would be better not to call it strxxx where
    xxx starts with a lowercase letter. A related point is that the name
    is not ideal -- it is really too general.

    The second is a little bit more important depending on the use you
    plan to make of it. You don't count valid UTF-8 characters, you
    simply count UTF-8 "start" bytes (and lone single-byte encodings).
    This may not matter, but you should comment the fact at the very
    least. The standard mbstowcs function reports an error if the string
    contains (in the part it examines) an invalid encoding.

    --
    Ben.
     
    Ben Bacarisse, May 8, 2009
    #4
  5. David RF

    david Guest

    On 8 mayo, 16:22, Ben Bacarisse <> wrote:

    > There are two small problems.  One is that the name is in the
    > implementation space.  it would be better not to call it strxxx where
    > xxx starts with a lowercase letter.  A related point is that the name
    > is not ideal -- it is really too general.
    >
    > The second is a little bit more important depending on the use you
    > plan to make of it.  You don't count valid UTF-8 characters, you
    > simply count UTF-8 "start" bytes (and lone single-byte encodings).
    > This may not matter, but you should comment the fact at the very
    > least.  The standard mbstowcs function reports an error if the string
    > contains (in the part it examines) an invalid encoding.
    >
    > --
    > Ben.


    Thanks, strChars() seems a better name ;)
    About the second point, don't know if that's the correct way:

    #include <stdio.h>
    #include <string.h>

    size_t strChars(const char *s)
    {
    size_t i = 0;

    if (!s) return 0;
    while (*s) {
    if ((*s & 0xc0) != 0x80) i++;
    s++;
    }
    return i;
    }

    int main(void)
    {
    int i, len;

    printf("\nWithout strchars:\n");
    printf("%s\n", "Cañaveral");
    len = (int)strlen("Cañaveral");
    for (i = 0; i < len; i++) putchar('-');
    printf("\n");
    /*
    ouput:
    Without strchars:
    Cañaveral
    ----------
    ^ One more char
    */

    printf("\nWith strchars:\n");
    printf("%s\n", "Cañaveral");
    len = (int)strChars("Cañaveral");
    for (i = 0; i < len; i++) putchar('-');
    printf("\n");

    /*
    ouput:
    With strchars:
    Cañaveral
    ---------
    */

    return 0;
    }
     
    david, May 8, 2009
    #5
  6. david <> writes:

    > On 8 mayo, 16:22, Ben Bacarisse <> wrote:
    >
    >> ...  You don't count valid UTF-8 characters, you
    >> simply count UTF-8 "start" bytes (and lone single-byte encodings).
    >>
    >> This may not matter, but you should comment the fact at the very
    >> least.  The standard mbstowcs function reports an error if the string
    >> contains (in the part it examines) an invalid encoding.
    >>
    >> --
    >> Ben.


    Best not to quote sigs.

    > Thanks, strChars() seems a better name ;)
    > About the second point, don't know if that's the correct way:


    The count is correct when the string is correct. The point is your
    function returns an answer when there is none. There are strings that
    contain collections of byte values that do not correspond to UTF-8
    encoded characters and your function will return a number when
    presented with these. That is not an error by itself but it is
    considered good practise to count only valid character encodings.
    Anyway, as I said, it simply may not matter for your application.

    <snip example>

    --
    Ben.
     
    Ben Bacarisse, May 8, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David N. Welton

    bytes, chars, and strings, oh my!

    David N. Welton, Oct 5, 2005, in forum: Java
    Replies:
    5
    Views:
    449
    Thomas Fritsch
    Oct 6, 2005
  2. F. GEIGER
    Replies:
    0
    Views:
    1,581
    F. GEIGER
    May 27, 2005
  3. Kosio

    Floats to chars and chars to floats

    Kosio, Sep 16, 2005, in forum: C Programming
    Replies:
    44
    Views:
    1,295
    Tim Rentsch
    Sep 23, 2005
  4. Hongyu
    Replies:
    9
    Views:
    916
    James Kanze
    Aug 8, 2008
  5. gry
    Replies:
    2
    Views:
    740
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page