toupper UTF8 string

D

David RF

Hi friends, here I am trying to avoid wchar_t in UTF8 strings.
glad to hear some critics

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
Return a new allocate string
Upper from (a - z) and (ÿþýüûúùø÷öõôóòñðïîíìëêéèçæåäãâáà)
-61 = first byte
-65 = ÿ
-96 = à
*/
char *stoupper(const char *s)
{
size_t len;
char *p = NULL;
int c = 0;

if (s) {
len = strlen(s);
p = malloc(len + 1);
if (p) {
while (*s) {
if ((*s >= 'a') && (*s <= 'z')) {
c = *p = *s - 'a' + 'A';
} else if ((c == -61) && ((*s <= -65) && (*s >= -96))) {
c = *p = *s - 32;
} else {
c = *p = *s;
}
p++;
s++;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "María tiene moño, Ramón tiene un camión.";

s = stoupper(s);
printf("%s\n", s);
return 0;
}
 
B

Ben Bacarisse

David RF said:
Hi friends, here I am trying to avoid wchar_t in UTF8 strings.

Why? Without knowing why, it is almost impossible to comment on the
code. It relies on a set of assumptions that might be acceptable but
I can't tell without knowing why you are not using C's multi-byte
string functions.

For example you assume char is signed.
glad to hear some critics

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
Return a new allocate string
Upper from (a - z) and (ÿþýüûúùø÷öõôóòñðïîíìëêéèçæåäãâáà)
-61 = first byte
-65 = ÿ
-96 = à
*/

It can't work for ÿ (there is a Ÿ but it is not where your code
expects it to be) and upper-casing ÷ to × is just odd!

<snip>
 
D

David RF

It can't work for ÿ (there is a Ÿ but it is not where your code
expects it to be) and upper-casing ÷ to × is just odd!

You're right
I can't tell without knowing why you are not using C's multi-byte
string functions.

Perhaps is time to take a look to those libraries :)
 
D

David RF

Why?  Without knowing why, it is almost impossible to comment on the
code.  It relies on a set of assumptions that might be acceptable but
I can't tell without knowing why you are not using C's multi-byte
string functions.

Another way to do this? I am a rookie using wchars

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

char *stoupper(const char *s)
{
char *p = NULL;
wchar_t wc;
size_t len;
int mblen;

if (s) {
len = strlen(s);
p = malloc(len + 1);
if (p) {
while (*s) {
mbtowc(&wc, s, MB_CUR_MAX);
wc = towupper(wc);
mblen = wctomb(p, wc);
p += mblen;
s += mblen;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "María tiene moño, Ramón tiene un camión.";

setlocale(LC_CTYPE, "");
s = stoupper(s);
if (s) {
printf("%s\n", s);
free(s);
}
return 0;
}

Thanks again Ben
 
B

Ben Bacarisse

David RF said:
Another way to do this? I am a rookie using wchars

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

char *stoupper(const char *s)
{
char *p = NULL;
wchar_t wc;
size_t len;
int mblen;

if (s) {
len = strlen(s);
p = malloc(len + 1);
if (p) {
while (*s) {
mbtowc(&wc, s, MB_CUR_MAX);

I'd make a few small changes here. (1) mbtowc tells you how many chars
it used to make the wide one. You can use this later on to confirm
your assumption that the overall length is not changed by
upper-casing. (2) you can pass len instead of MB_CUR_MAX so long as
you update it using the return from mbtowc. This means there is no
possibility of ever looked past the end of s even with an ill-formed
UTF-8 string. (3) mbtowc might fail (and it call tell you when the
string has run out) so you can put the call in the while loop test:

while ((mblen = mbtowc(&wc, s, len)) > 0) ...
wc = towupper(wc);
mblen = wctomb(p, wc);

I'd use a new variable so that...
p += mblen;
s += mblen;

.... here you can put the brakes on if you find the two lengths are not
the same.
}
*p = '\0';
p -= len;
}
}
return p;
}

<snip>
 
D

David RF

I'd make a few small changes here. (1) mbtowc tells you how many chars
it used to make the wide one.  You can use this later on to confirm
your assumption that the overall length is not changed by
upper-casing.  (2) you can pass len instead of MB_CUR_MAX so long as
you update it using the return from mbtowc.  This means there is no
possibility of ever looked past the end of s even with an ill-formed
UTF-8 string.  (3) mbtowc might fail (and it call tell you when the
string has run out) so you can put the call in the while loop test:

  while ((mblen = mbtowc(&wc, s, len)) > 0) ...


I'd use a new variable so that...


... here you can put the brakes on if you find the two lengths are not
the same.

Thanks again Ben, I miss Pascal (a lot) :)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

static char *stoupper(const char *s)
{
char *p = NULL, *oldp;
size_t len;
wchar_t wc;
int wclen, mclen;

if (s) {
len = strlen(s);
oldp = p = malloc(len + MB_CUR_MAX + 1);
if (p) {
while ((wclen = mbtowc(&wc, s, len)) > 0) {
/* I know, too many casts, but makes -Wconversion flag happy */
mclen = wctomb(p, (wchar_t)towupper((wint_t)wc));
/* Strange ... but I always trust Ben :) */
if (mclen > wclen) {
len += (size_t)(mclen - wclen);
mclen = (int)(p - oldp);
/* realloc it's a pain, but what else can I do? */
p = realloc(oldp, len);
if (!p) {
free(oldp);
return NULL;
}
oldp = p;
}
p += mclen;
s += wclen;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "María tiene moño, Ramón tiene un camión.";

setlocale(LC_CTYPE, "");
s = stoupper(s);
if (s) {
printf("%s\n", s);
free(s);
}
return 0;
}
 
N

Nobody

Hi friends, here I am trying to avoid wchar_t in UTF8 strings.
glad to hear some critics

Convert to wchar_t[], use towupper(), convert back to UTF-8.

Note: the C standard doesn't guarantee that wchar_t is Unicode, nor does
it provide any function which can reliably convert between a specific
encoding and wchar_t (mbstowcs/wcstombs use the locale's encoding, and the
details of locales are implementation-defined).

Also, note that converting a string to upper-case isn't quite as simple as
replacing each character with another character. For some characters, the
upper-case equivalent consists of multiple characters; e.g. the upper-case
equivalent of "ß" (German sharp s) is "SS".
 
J

James Kuyper

Joe Wright wrote:
....
Nor does the C Standard know anything at all about Unicode.

It may not know enough about Unicode, but it does know something: see
6.4.3 and Annex D.
 
N

Nobody

Joe Wright wrote:
...

It may not know enough about Unicode, but it does know something: see
6.4.3 and Annex D.

Also 6.10.8p2:

__STDC_ISO_10646__ A decimal constant of the form yyyymmL
(for example, 199712L), intended to
indicate that values of type wchar_t are
the coded representations of the
characters defined by ISO/IEC 10646,
along with all amendments and technical
corrigenda as of the specified year and
month.

So wchar_t *might* be Unicode, and if it is, the implementation will state
this. But it isn't required to be.

If it isn't, then you have to either:

a) figure out how to convert UTF-8 to/from wchar_t, in which case you can
then use towupper(), or

b) convert UTF-8 to/from Unicode codepoints yourself (easy enough), but
then you need to write your own towupper() equivalent (which
basically means that you need to get the tables).
 
K

Keith Thompson

Nobody said:
Also 6.10.8p2:

__STDC_ISO_10646__ A decimal constant of the form yyyymmL
(for example, 199712L), intended to
indicate that values of type wchar_t are
the coded representations of the
characters defined by ISO/IEC 10646,
along with all amendments and technical
corrigenda as of the specified year and
month.

So wchar_t *might* be Unicode, and if it is, the implementation will state
this. But it isn't required to be.

If it isn't, then you have to either:

a) figure out how to convert UTF-8 to/from wchar_t, in which case you can
then use towupper(), or

If wchar_t values don't represent Unicode code points, then converting
from UTF-8 to wchar_t might not be possible. For example, wchar_t
might be only 16 bits.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,533
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top