toupper UTF8 string

David RF · Sep 24, 2009

Hi friends, here I am trying to avoid wchar_t in UTF8 strings.
glad to hear some critics

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
Return a new allocate string
Upper from (a - z) and (ÿþýüûúùø÷öõôóòñðïîíìëêéèçæåäãâáà)
-61 = first byte
-65 = ÿ
-96 = à
*/
char *stoupper(const char *s)
{
size_t len;
char *p = NULL;
int c = 0;

if (s) {
len = strlen(s);
p = malloc(len + 1);
if (p) {
while (*s) {
if ((*s >= 'a') && (*s <= 'z')) {
c = *p = *s - 'a' + 'A';
} else if ((c == -61) && ((*s <= -65) && (*s >= -96))) {
c = *p = *s - 32;
} else {
c = *p = *s;
}
p++;
s++;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "María tiene moño, Ramón tiene un camión.";

s = stoupper(s);
printf("%s\n", s);
return 0;
}

Ben Bacarisse · Sep 24, 2009

David RF said:
Hi friends, here I am trying to avoid wchar_t in UTF8 strings.

Why? Without knowing why, it is almost impossible to comment on the
code. It relies on a set of assumptions that might be acceptable but
I can't tell without knowing why you are not using C's multi-byte
string functions.

For example you assume char is signed.

glad to hear some critics

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
Return a new allocate string
Upper from (a - z) and (Ã¿Ã¾Ã½Ã¼Ã»ÃºÃ¹Ã¸Ã·Ã¶ÃµÃ´Ã³Ã²Ã±Ã°Ã¯Ã®ÃÃ¬Ã«ÃªÃ©Ã¨Ã§Ã¦Ã¥Ã¤Ã£Ã¢Ã¡Ã )
-61 = first byte
-65 = Ã¿
-96 = Ã
*/

It can't work for Ã¿ (there is a Å¸ but it is not where your code
expects it to be) and upper-casing Ã· to Ã— is just odd!

<snip>

David RF · Sep 24, 2009

It can't work for Ã¿ (there is a Å¸ but it is not where your code
expects it to be) and upper-casing Ã· to Ã— is just odd!

You're right

I can't tell without knowing why you are not using C's multi-byte
string functions.

Perhaps is time to take a look to those libraries

David RF · Sep 24, 2009

Why? Without knowing why, it is almost impossible to comment on the
code. It relies on a set of assumptions that might be acceptable but
I can't tell without knowing why you are not using C's multi-byte
string functions.

Another way to do this? I am a rookie using wchars

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

char *stoupper(const char *s)
{
char *p = NULL;
wchar_t wc;
size_t len;
int mblen;

if (s) {
len = strlen(s);
p = malloc(len + 1);
if (p) {
while (*s) {
mbtowc(&wc, s, MB_CUR_MAX);
wc = towupper(wc);
mblen = wctomb(p, wc);
p += mblen;
s += mblen;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "María tiene moño, Ramón tiene un camión.";

setlocale(LC_CTYPE, "");
s = stoupper(s);
if (s) {
printf("%s\n", s);
free(s);
}
return 0;
}

Thanks again Ben

Ben Bacarisse · Sep 25, 2009

David RF said:
Another way to do this? I am a rookie using wchars

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

char *stoupper(const char *s)
{
char *p = NULL;
wchar_t wc;
size_t len;
int mblen;

if (s) {
len = strlen(s);
p = malloc(len + 1);
if (p) {
while (*s) {
mbtowc(&wc, s, MB_CUR_MAX);

I'd make a few small changes here. (1) mbtowc tells you how many chars
it used to make the wide one. You can use this later on to confirm
your assumption that the overall length is not changed by
upper-casing. (2) you can pass len instead of MB_CUR_MAX so long as
you update it using the return from mbtowc. This means there is no
possibility of ever looked past the end of s even with an ill-formed
UTF-8 string. (3) mbtowc might fail (and it call tell you when the
string has run out) so you can put the call in the while loop test:

while ((mblen = mbtowc(&wc, s, len)) > 0) ...

wc = towupper(wc);
mblen = wctomb(p, wc);

I'd use a new variable so that...

p += mblen;
s += mblen;

.... here you can put the brakes on if you find the two lengths are not
the same.

}
*p = '\0';
p -= len;
}
}
return p;
}

<snip>

David RF · Sep 25, 2009

I'd make a few small changes here. (1) mbtowc tells you how many chars
it used to make the wide one. You can use this later on to confirm
your assumption that the overall length is not changed by
upper-casing. (2) you can pass len instead of MB_CUR_MAX so long as
you update it using the return from mbtowc. This means there is no
possibility of ever looked past the end of s even with an ill-formed
UTF-8 string. (3) mbtowc might fail (and it call tell you when the
string has run out) so you can put the call in the while loop test:

while ((mblen = mbtowc(&wc, s, len)) > 0) ...

I'd use a new variable so that...

... here you can put the brakes on if you find the two lengths are not
the same.

Thanks again Ben, I miss Pascal (a lot)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

static char *stoupper(const char *s)
{
char *p = NULL, *oldp;
size_t len;
wchar_t wc;
int wclen, mclen;

if (s) {
len = strlen(s);
oldp = p = malloc(len + MB_CUR_MAX + 1);
if (p) {
while ((wclen = mbtowc(&wc, s, len)) > 0) {
/* I know, too many casts, but makes -Wconversion flag happy */
mclen = wctomb(p, (wchar_t)towupper((wint_t)wc));
/* Strange ... but I always trust Ben

*/
if (mclen > wclen) {
len += (size_t)(mclen - wclen);
mclen = (int)(p - oldp);
/* realloc it's a pain, but what else can I do? */
p = realloc(oldp, len);
if (!p) {
free(oldp);
return NULL;
}
oldp = p;
}
p += mclen;
s += wclen;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "María tiene moño, Ramón tiene un camión.";

setlocale(LC_CTYPE, "");
s = stoupper(s);
if (s) {
printf("%s\n", s);
free(s);
}
return 0;
}

Nobody · Sep 25, 2009

Hi friends, here I am trying to avoid wchar_t in UTF8 strings.
glad to hear some critics

Convert to wchar_t[], use towupper(), convert back to UTF-8.

Note: the C standard doesn't guarantee that wchar_t is Unicode, nor does
it provide any function which can reliably convert between a specific
encoding and wchar_t (mbstowcs/wcstombs use the locale's encoding, and the
details of locales are implementation-defined).

Also, note that converting a string to upper-case isn't quite as simple as
replacing each character with another character. For some characters, the
upper-case equivalent consists of multiple characters; e.g. the upper-case
equivalent of "ß" (German sharp s) is "SS".

James Kuyper · Sep 26, 2009

Joe Wright wrote:
....

Nor does the C Standard know anything at all about Unicode.

It may not know enough about Unicode, but it does know something: see
6.4.3 and Annex D.

Nobody · Sep 27, 2009

Joe Wright wrote:
...

It may not know enough about Unicode, but it does know something: see
6.4.3 and Annex D.

Also 6.10.8p2:

__STDC_ISO_10646__ A decimal constant of the form yyyymmL
(for example, 199712L), intended to
indicate that values of type wchar_t are
the coded representations of the
characters defined by ISO/IEC 10646,
along with all amendments and technical
corrigenda as of the specified year and
month.

So wchar_t *might* be Unicode, and if it is, the implementation will state
this. But it isn't required to be.

If it isn't, then you have to either:

a) figure out how to convert UTF-8 to/from wchar_t, in which case you can
then use towupper(), or

b) convert UTF-8 to/from Unicode codepoints yourself (easy enough), but
then you need to write your own towupper() equivalent (which
basically means that you need to get the tables).

Keith Thompson · Sep 27, 2009

Nobody said:
Also 6.10.8p2:

__STDC_ISO_10646__ A decimal constant of the form yyyymmL
(for example, 199712L), intended to
indicate that values of type wchar_t are
the coded representations of the
characters defined by ISO/IEC 10646,
along with all amendments and technical
corrigenda as of the specified year and
month.

So wchar_t *might* be Unicode, and if it is, the implementation will state
this. But it isn't required to be.

If it isn't, then you have to either:

a) figure out how to convert UTF-8 to/from wchar_t, in which case you can
then use towupper(), or

If wchar_t values don't represent Unicode code points, then converting
from UTF-8 to wchar_t might not be possible. For example, wchar_t
might be only 16 bits.

STRING - Remove small letters from string	1	Jan 20, 2023
Code working properly in VS code for every test case but assigned wrong when submitted why?	0	Aug 21, 2022
string reversal	4	Jun 23, 2011
Please help with C programming to save GPS reception data in Raspberry Pi.	0	Dec 8, 2022
Fibonacci	0	May 13, 2023
wcstombs() problem	16	Feb 23, 2012
Command Line Arguments	0	Mar 7, 2023
Lexical Analysis on C++	1	Oct 31, 2023

toupper UTF8 string

David RF

Ben Bacarisse

David RF

David RF

Ben Bacarisse

David RF

Nobody

James Kuyper

Nobody

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads