toupper UTF8 string

Discussion in 'C Programming' started by David RF, Sep 24, 2009.

  1. David RF

    David RF Guest

    Hi friends, here I am trying to avoid wchar_t in UTF8 strings.
    glad to hear some critics

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    /*
    Return a new allocate string
    Upper from (a - z) and (ÿþýüûúùø÷öõôóòñðïîíìëêéèçæåäãâáà)
    -61 = first byte
    -65 = ÿ
    -96 = à
    */
    char *stoupper(const char *s)
    {
    size_t len;
    char *p = NULL;
    int c = 0;

    if (s) {
    len = strlen(s);
    p = malloc(len + 1);
    if (p) {
    while (*s) {
    if ((*s >= 'a') && (*s <= 'z')) {
    c = *p = *s - 'a' + 'A';
    } else if ((c == -61) && ((*s <= -65) && (*s >= -96))) {
    c = *p = *s - 32;
    } else {
    c = *p = *s;
    }
    p++;
    s++;
    }
    *p = '\0';
    p -= len;
    }
    }
    return p;
    }

    int main(void)
    {
    char *s = "María tiene moño, Ramón tiene un camión.";

    s = stoupper(s);
    printf("%s\n", s);
    return 0;
    }
    David RF, Sep 24, 2009
    #1
    1. Advertising

  2. David RF <> writes:

    > Hi friends, here I am trying to avoid wchar_t in UTF8 strings.


    Why? Without knowing why, it is almost impossible to comment on the
    code. It relies on a set of assumptions that might be acceptable but
    I can't tell without knowing why you are not using C's multi-byte
    string functions.

    For example you assume char is signed.

    > glad to hear some critics
    >
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <string.h>
    >
    > /*
    > Return a new allocate string
    > Upper from (a - z) and (ÿþýüûúùø÷öõôóòñðïîíìëêéèçæåäãâáà)
    > -61 = first byte
    > -65 = ÿ
    > -96 = à
    > */


    It can't work for ÿ (there is a Ÿ but it is not where your code
    expects it to be) and upper-casing ÷ to × is just odd!

    <snip>
    --
    Ben.
    Ben Bacarisse, Sep 24, 2009
    #2
    1. Advertising

  3. David RF

    David RF Guest

    On 24 sep, 16:09, Ben Bacarisse <> wrote:

    > It can't work for ÿ (there is a Ÿ but it is not where your code
    > expects it to be) and upper-casing ÷ to × is just odd!


    You're right

    > I can't tell without knowing why you are not using C's multi-byte
    > string functions.


    Perhaps is time to take a look to those libraries :)
    David RF, Sep 24, 2009
    #3
  4. David RF

    David RF Guest

    On 24 sep, 16:09, Ben Bacarisse <> wrote:

    > Why?  Without knowing why, it is almost impossible to comment on the
    > code.  It relies on a set of assumptions that might be acceptable but
    > I can't tell without knowing why you are not using C's multi-byte
    > string functions.


    Another way to do this? I am a rookie using wchars

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <locale.h>
    #include <wchar.h>
    #include <wctype.h>

    char *stoupper(const char *s)
    {
    char *p = NULL;
    wchar_t wc;
    size_t len;
    int mblen;

    if (s) {
    len = strlen(s);
    p = malloc(len + 1);
    if (p) {
    while (*s) {
    mbtowc(&wc, s, MB_CUR_MAX);
    wc = towupper(wc);
    mblen = wctomb(p, wc);
    p += mblen;
    s += mblen;
    }
    *p = '\0';
    p -= len;
    }
    }
    return p;
    }

    int main(void)
    {
    char *s = "María tiene moño, Ramón tiene un camión.";

    setlocale(LC_CTYPE, "");
    s = stoupper(s);
    if (s) {
    printf("%s\n", s);
    free(s);
    }
    return 0;
    }

    Thanks again Ben
    David RF, Sep 24, 2009
    #4
  5. David RF <> writes:
    <snip>
    > Another way to do this? I am a rookie using wchars
    >
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <string.h>
    > #include <locale.h>
    > #include <wchar.h>
    > #include <wctype.h>
    >
    > char *stoupper(const char *s)
    > {
    > char *p = NULL;
    > wchar_t wc;
    > size_t len;
    > int mblen;
    >
    > if (s) {
    > len = strlen(s);
    > p = malloc(len + 1);
    > if (p) {
    > while (*s) {
    > mbtowc(&wc, s, MB_CUR_MAX);


    I'd make a few small changes here. (1) mbtowc tells you how many chars
    it used to make the wide one. You can use this later on to confirm
    your assumption that the overall length is not changed by
    upper-casing. (2) you can pass len instead of MB_CUR_MAX so long as
    you update it using the return from mbtowc. This means there is no
    possibility of ever looked past the end of s even with an ill-formed
    UTF-8 string. (3) mbtowc might fail (and it call tell you when the
    string has run out) so you can put the call in the while loop test:

    while ((mblen = mbtowc(&wc, s, len)) > 0) ...

    > wc = towupper(wc);
    > mblen = wctomb(p, wc);


    I'd use a new variable so that...

    > p += mblen;
    > s += mblen;


    .... here you can put the brakes on if you find the two lengths are not
    the same.

    > }
    > *p = '\0';
    > p -= len;
    > }
    > }
    > return p;
    > }


    <snip>
    --
    Ben.
    Ben Bacarisse, Sep 25, 2009
    #5
  6. David RF

    David RF Guest

    On 25 sep, 02:29, Ben Bacarisse <> wrote:
    > I'd make a few small changes here. (1) mbtowc tells you how many chars
    > it used to make the wide one.  You can use this later on to confirm
    > your assumption that the overall length is not changed by
    > upper-casing.  (2) you can pass len instead of MB_CUR_MAX so long as
    > you update it using the return from mbtowc.  This means there is no
    > possibility of ever looked past the end of s even with an ill-formed
    > UTF-8 string.  (3) mbtowc might fail (and it call tell you when the
    > string has run out) so you can put the call in the while loop test:
    >
    >   while ((mblen = mbtowc(&wc, s, len)) > 0) ...
    >
    > >                            wc = towupper(wc);
    > >                            mblen = wctomb(p, wc);

    >
    > I'd use a new variable so that...
    >
    > >                            p += mblen;
    > >                            s += mblen;

    >
    > ... here you can put the brakes on if you find the two lengths are not
    > the same.
    >
    > >                    }
    > >                    *p = '\0';
    > >                    p -= len;
    > >            }
    > >    }
    > >    return p;
    > > }


    Thanks again Ben, I miss Pascal (a lot) :)

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <locale.h>
    #include <wchar.h>
    #include <wctype.h>

    static char *stoupper(const char *s)
    {
    char *p = NULL, *oldp;
    size_t len;
    wchar_t wc;
    int wclen, mclen;

    if (s) {
    len = strlen(s);
    oldp = p = malloc(len + MB_CUR_MAX + 1);
    if (p) {
    while ((wclen = mbtowc(&wc, s, len)) > 0) {
    /* I know, too many casts, but makes -Wconversion flag happy */
    mclen = wctomb(p, (wchar_t)towupper((wint_t)wc));
    /* Strange ... but I always trust Ben :) */
    if (mclen > wclen) {
    len += (size_t)(mclen - wclen);
    mclen = (int)(p - oldp);
    /* realloc it's a pain, but what else can I do? */
    p = realloc(oldp, len);
    if (!p) {
    free(oldp);
    return NULL;
    }
    oldp = p;
    }
    p += mclen;
    s += wclen;
    }
    *p = '\0';
    p -= len;
    }
    }
    return p;
    }

    int main(void)
    {
    char *s = "María tiene moño, Ramón tiene un camión.";

    setlocale(LC_CTYPE, "");
    s = stoupper(s);
    if (s) {
    printf("%s\n", s);
    free(s);
    }
    return 0;
    }
    David RF, Sep 25, 2009
    #6
  7. David RF

    Nobody Guest

    On Thu, 24 Sep 2009 05:33:19 -0700, David RF wrote:

    > Hi friends, here I am trying to avoid wchar_t in UTF8 strings.
    > glad to hear some critics


    Convert to wchar_t[], use towupper(), convert back to UTF-8.

    Note: the C standard doesn't guarantee that wchar_t is Unicode, nor does
    it provide any function which can reliably convert between a specific
    encoding and wchar_t (mbstowcs/wcstombs use the locale's encoding, and the
    details of locales are implementation-defined).

    Also, note that converting a string to upper-case isn't quite as simple as
    replacing each character with another character. For some characters, the
    upper-case equivalent consists of multiple characters; e.g. the upper-case
    equivalent of "ß" (German sharp s) is "SS".
    Nobody, Sep 25, 2009
    #7
  8. David RF

    James Kuyper Guest

    Joe Wright wrote:
    ....
    > Nor does the C Standard know anything at all about Unicode.


    It may not know enough about Unicode, but it does know something: see
    6.4.3 and Annex D.
    James Kuyper, Sep 26, 2009
    #8
  9. David RF

    Nobody Guest

    On Sat, 26 Sep 2009 13:25:23 -0400, James Kuyper wrote:

    > Joe Wright wrote:
    > ...
    >> Nor does the C Standard know anything at all about Unicode.

    >
    > It may not know enough about Unicode, but it does know something: see
    > 6.4.3 and Annex D.


    Also 6.10.8p2:

    __STDC_ISO_10646__ A decimal constant of the form yyyymmL
    (for example, 199712L), intended to
    indicate that values of type wchar_t are
    the coded representations of the
    characters defined by ISO/IEC 10646,
    along with all amendments and technical
    corrigenda as of the specified year and
    month.

    So wchar_t *might* be Unicode, and if it is, the implementation will state
    this. But it isn't required to be.

    If it isn't, then you have to either:

    a) figure out how to convert UTF-8 to/from wchar_t, in which case you can
    then use towupper(), or

    b) convert UTF-8 to/from Unicode codepoints yourself (easy enough), but
    then you need to write your own towupper() equivalent (which
    basically means that you need to get the tables).
    Nobody, Sep 27, 2009
    #9
  10. Nobody <> writes:
    > On Sat, 26 Sep 2009 13:25:23 -0400, James Kuyper wrote:
    >
    >> Joe Wright wrote:
    >> ...
    >>> Nor does the C Standard know anything at all about Unicode.

    >>
    >> It may not know enough about Unicode, but it does know something: see
    >> 6.4.3 and Annex D.

    >
    > Also 6.10.8p2:
    >
    > __STDC_ISO_10646__ A decimal constant of the form yyyymmL
    > (for example, 199712L), intended to
    > indicate that values of type wchar_t are
    > the coded representations of the
    > characters defined by ISO/IEC 10646,
    > along with all amendments and technical
    > corrigenda as of the specified year and
    > month.
    >
    > So wchar_t *might* be Unicode, and if it is, the implementation will state
    > this. But it isn't required to be.
    >
    > If it isn't, then you have to either:
    >
    > a) figure out how to convert UTF-8 to/from wchar_t, in which case you can
    > then use towupper(), or


    If wchar_t values don't represent Unicode code points, then converting
    from UTF-8 to wchar_t might not be possible. For example, wchar_t
    might be only 16 bits.

    > b) convert UTF-8 to/from Unicode codepoints yourself (easy enough), but
    > then you need to write your own towupper() equivalent (which
    > basically means that you need to get the tables).


    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Sep 27, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kerri

    String.ToUpper

    Kerri, Oct 26, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    1,063
    Chris Botha
    Oct 27, 2003
  2. Replies:
    2
    Views:
    456
  3. Replies:
    4
    Views:
    510
  4. gaga
    Replies:
    16
    Views:
    653
    Daniel Kay
    Sep 23, 2007
  5. gry
    Replies:
    2
    Views:
    719
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page