Simple conversion problem

Pakt · Feb 11, 2010

Hi all,

I am hoping someone can provide some help on what I expect is a simple
function. I want to mostly strip out non ascii characters (those >
127) from a utf-8 string, except for a small set of exceptions.
Going into this I thought it would be an easy task, but I've found
surprisingly little information on this sort of conversion/
transliteration. My lack of unicode experience certainly hasn't
helped and the number of days I've spent on this task is embarassing.

I spent a lot of time fiddling with libiconv, but it's
incomprehensible and there are simply _no_ examples of
transliteration with iconv. So, I need to reinvent the wheel, albeit
a simpler wheel.

The simpler wheel so far:
...
wchar_t test_string[]= L"jeudi 11 février, le 31e anniversaire de la
révolution";
int index=0;

while (test_string[index]) {
if (test_string[index] > 127) {
//preserve most accented characters, but strip funny quotes
etc...
if (!((test_string[index] > '\u00C0') &&
(test_string[index] < '\u017F'))) {
test_string[index]='';
}
else {
// Do some transliteration to the accented characters
here...
test_string[index]=translit_lookup(test_string[index]);
}
}
index++;
}
printf("String now:%s\n",test_string);
....

Does anyone please have any light to shed on this?

Along the same lines (but more complicated), the string is supplied by
the user so I can't really guarantee that it is in utf-8, let alone
UCS-2. Is is a good idea to first convert the string (whatever is
supplied) using mbstowcs before attempting the above?

Thanks in advance for any saving of bacon.

Ben Bacarisse · Feb 11, 2010

Pakt said:
I am hoping someone can provide some help on what I expect is a simple
function. I want to mostly strip out non ascii characters (those >
127) from a utf-8 string, except for a small set of exceptions.
Going into this I thought it would be an easy task, but I've found
surprisingly little information on this sort of conversion/
transliteration. My lack of unicode experience certainly hasn't
helped and the number of days I've spent on this task is embarassing.

I spent a lot of time fiddling with libiconv, but it's
incomprehensible and there are simply _no_ examples of
transliteration with iconv. So, I need to reinvent the wheel, albeit
a simpler wheel.

The simpler wheel so far:
..
wchar_t test_string[]= L"jeudi 11 fÃ©vrier, le 31e anniversaire de la
rÃ©volution";

Your string is now not UTF-8 encoded. It is a wide string which makes
things simpler and is certainly one way to go.

int index=0;

while (test_string[index]) {
if (test_string[index] > 127) {
//preserve most accented characters, but strip funny quotes
etc...
if (!((test_string[index] > '\u00C0') &&
(test_string[index] < '\u017F'))) {

First, you want L'\u00CO' there. What you have written is quote
different (and it does not matter what it means -- it is not what you
want at all).

Secondly, that's a rather complex test. It would be simpler written
in the form:

x <= L'\u00C0' || x >= L'\u017F'

or you could just swap the if and else parts to get rid of the !.

test_string[index]='';

You can't set a wchar_t to ''. In fact '' is a syntax error. The way
to remove a character is to copy the string (you can copy it to itself
if you like) but to not copy those characters that you don't want.

}
else {
// Do some transliteration to the accented characters
here...
test_string[index]=translit_lookup(test_string[index]);

I'd put all the work into a function like this. The test for > 127,
and the range of ignored characters are all, logically, part of the
translation you are doing.

}
}
index++;
}
printf("String now:%s\n",test_string);

You need %ls to print a wide string.

...

Does anyone please have any light to shed on this?

I'd write it like this:

#include <stdio.h>
#include <wchar.h>

wchar_t translit_lookup(wchar_t in)
{
if (in <= 127)
return in; // unchanged
else if (in <= L'\u00C0' || in >= L'\u017F')
return 0; // ignore
else return '?'; // purely illustrative
}

wchar_t *process(wchar_t *wstr)
{
int src = 0, dst = 0;
while (wstr[src]) {
if ((wstr[dst] = translit_lookup(wstr[src])) != 0)
++dst;
++src;
}
return wstr;
}

int main(void)
{
wchar_t test_string[] =
L"jeudi 11 fÃ©vrier, le 31e â€œanniversaireâ€ de la rÃ©volution";
printf("String now: \"%ls\"\n", process(test_string));
return 0;
}

Along the same lines (but more complicated), the string is supplied by
the user so I can't really guarantee that it is in utf-8, let alone
UCS-2. Is is a good idea to first convert the string (whatever is
supplied) using mbstowcs before attempting the above?

Your big problem may be knowing the encoding. If you are lucky, the
locale will specify the encoding and you will have to deal only with
strings encoded as per the locale setting. If, so mbstowcs will be the
simplest way to go.

If this is not true, you have a bigger problem to solve but I won't
go into that now.

Ersek, Laszlo · Feb 11, 2010

I am hoping someone can provide some help on what I expect is a simple
function. I want to mostly strip out non ascii characters (those >
127) from a utf-8 string, except for a small set of exceptions.

Here's what I propose.

1. Write your source code in the locale (or rather, charset / character
encoding) you use otherwise. (UTF-8 is the best choice, probably.) I
will assume that this locale (charset) will enable you to type all the
characters that you'll want to allow. (You mention a not very big
accepted alphabet.)

2. In said source file, specify the accepted set of characters like
this:

static const wchar_t accepted[] = L"abcdefg....";

3. Run the compiler on your source while in the same locale, aiming at
conformance to C99 6.4.5 "String literals" p5. (Roughly speaking, the
compiler will initialize the "accepted" array via mbstowcs(),
interpreting the multibyte characters according to the current locale.)
In more concrete terms, this will probably mean one of the following
conversions:

* UTF-8 -> UTF-16
* UTF-8 -> UTF-32
* ISO 8859-1 -> UTF-16
* ISO 8859-1 -> UTF-32
* ISO 8859-15 -> UTF-16
* ISO 8859-15 -> UTF-32

As said before, the "source" side of the conversion is determined by an
implementation-defined current locale when the compiler is run. The
target side is not mentioned by the C99 standard (or rather I don't
remember it).

FWIW, in case of gcc, you can explicitly specify both sides with
-finput-charset and -fwide-exec-charset, respectively. This shouldn't be
necessary, though.

4. Accept multibyte input from the user and convert it to a wide string
via mbstowcs() or mbsrtowcs(), or do both steps at once by way of
fscanf(). I would advise against fwscanf() if you want input files to be
portable.

If the stream to be used comes from another part of the program that you
have no control over, use the fwide() function to query the orientation
of the stream, and if it's already wide-oriented, use fwscanf() or
fgetws().

Don't forget to initialize the locale first via setlocale(LC_ALL, "") or
setlocale(LC_CTYPE, ""). In the end, you should have an array of
wchar_t, comparable against "accepted", even if the locale used at
compilation time and the locale used at execution time differ.

5. Use the wcsspn() and wcspbrk() functions in lock-step to find
sequences of accepted and not-accepted characters.

6. Output accepted sequences like written under 4.

Please anybody point out mistakes in the above, I didn't try it yet. I
hope to write an example demonstrating it later, like

static int
filter_stream_linewise(FILE *in_stream, FILE *out_stream,
const wchar_t *accepted)
{
/* ... */
}

int
main(int argc, char **argv)
{
/* ... */
res = filter_stream_linewise(stdin, stdout, L"...");
/* ... */
}

Cheers,
lacos

Pakt · Feb 14, 2010

Thank you both very much for the advice, both have been very helpful.

Out of interest I compiled and ran Ben's suggestion and it worked
perfectly, but I have a quick question about an embarassingly simple
modification to the tranlit_lookup function.

If you are still reading, I would like to modify it so that any
accented characters within the range u00C0->u017F are returned as is,
instead of as '?', thereby (hopefully) preserving those characters in
the original string.

wchar_t translit_lookup(wchar_t in)
{
if (in <= 127)
return in; // unchanged
else if (in <= L'\u00C0' || in >= L'\u017F')
return 0; // ignore

// else return '?'; // * Change this line from this

else return in; //* to this
}

But, when I do this my output only shows:

String now: "

I am puzzled by why this happens, and how I should modify it to work
correctly. Thank you again for any assistance.

Ersek, Laszlo · Feb 14, 2010

If you are still reading, I would like to modify it so that any
accented characters within the range u00C0->u017F are returned as is,
instead of as '?', thereby (hopefully) preserving those characters in
the original string.

wchar_t translit_lookup(wchar_t in)
{
if (in <= 127)
return in; // unchanged
else if (in <= L'\u00C0' || in >= L'\u017F')
return 0; // ignore

// else return '?'; // * Change this line from this

else return in; //* to this
}

But, when I do this my output only shows:

String now: "

I am puzzled by why this happens, and how I should modify it to work
correctly. Thank you again for any assistance.

Please pipe the output of the program into "hexdump -C". The trailing
double-quote is not shown eiter; I think the output of the program may
be messing up your terminal.

What does "locale" say right before you compile the program? What does
it say right before you execute it? Also, please post the output of

LC_ALL=C grep 'de la r' source.c | hexdump -C

--o--

You could also merge the two "return in" statements under a single
condition.

return (in <= L'\u00C0' || in >= L'\u017F') ? L'\0' : in;

--o--

Ben's process() from

http://groups.google.com/group/comp.lang.c/msg/7dbb3e49ba19eeed

doesn't seem to NUL-terminate the string early enough when at least one
ignored wide character occurs.

-o-

I believe you cannot, in general, rely on

(wchar_t)0xABCD == L'\uABCD'

Consequently, if the ISO/IEC 10646 four-digit short identifier of
character X precedes (when interpreted as a hexadecimal string) that of
character Y, that doesn't imply that their wchar_t representations will
have the same relationship.

I think you should use an explicit alphabet of accepted characters for
portability, or at least check for __STDC_ISO_10646__:

C99 6.10.8 "Predefined macro names", paragraph 2:

----v----
The following macro names are conditionally defined by the
implementation:

[...]

__STDC_ISO_10646__

An integer constant of the form yyyymmL (for example, 199712L), intended
to indicate that values of type wchar_t are the coded representations of
the characters defined by ISO/IEC 10646, along with all amendments and
technical corrigenda as of the specified year and month.
----^----

See also

http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

----v----
One final comment about the choice of the wide character representation
is necessary at this point. We have said above that the natural choice
is using Unicode or ISO 10646. This is not required, but at least
encouraged, by the ISO C standard. The standard defines at least a macro
__STDC_ISO_10646__ that is only defined on systems where the wchar_t
type encodes ISO 10646 characters. If this symbol is not defined one
should avoid making assumptions about the wide character representation.
If the programmer uses only the functions provided by the C library to
handle wide character strings there should be no compatibility problems
with other systems.
----^----

Cheers,
lacos

Ben Bacarisse · Feb 14, 2010

Pakt said:
Thank you both very much for the advice, both have been very helpful.

Out of interest I compiled and ran Ben's suggestion and it worked
perfectly, but I have a quick question about an embarassingly simple
modification to the tranlit_lookup function.

No, not embarrassing for you, but for me. I forget a call to
setlocale that is needed before you can convert wide chars to
multi-byte strings on output. I.e. you can do your transliteration,
but the printf won't work without it. My test worked because it
removed all problematic characters.

But, when I do this my output only shows:

String now: "

I am puzzled by why this happens, and how I should modify it to work
correctly. Thank you again for any assistance.

I also forgot to null-terminate the string in the "process" function.
Try this:

#include <stdio.h>
#include <locale.h>
#include <wchar.h>

wchar_t translit_lookup(wchar_t in)
{
if (in <= 127)
return in; // unchanged
else if (in <= L'\u00C0' || in >= L'\u017F')
return 0; // ignore
// else return '?'; // purely illustrative
else return in;
}

wchar_t *process(wchar_t *wstr)
{
int src = 0, dst = 0;
while (wstr[src]) {
if ((wstr[dst] = translit_lookup(wstr[src])) != 0)
++dst;
++src;
}
wstr[dst] = 0;
return wstr;
}

int main(void)
{
wchar_t test_string[] =
L"jeudi 11 fÃ©vrier, le 31e â€œanniversaireâ€ de la rÃ©volution";
setlocale(LC_ALL, "");
printf("String now: \"%ls\"\n", process(test_string));
return 0;
}

(I've left your change in place). Fingers crossed that I've not made
any more basic errors!

Ersek, Laszlo · Feb 14, 2010

You could also merge the two "return in" statements under a single
condition.

return (in <= L'\u00C0' || in >= L'\u017F') ? L'\0' : in;

How stupid. Sorry.

lacos
/facepalm

portability of iconv's //TRANSLIT ?	0	Mar 13, 2007
Any convenient and elegant way to do encoding conversion in C++?	5	Sep 23, 2006
[ANN] JRuby 1.6.0.RC2 released	0	Feb 9, 2011
Customizing character set conversions with an error handler	2	Mar 12, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
Ruby Weekly News 15th - 21st August 2005	2	Aug 23, 2005
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004

Simple conversion problem

Pakt

Ben Bacarisse

Ersek, Laszlo

Pakt

Ersek, Laszlo

Ben Bacarisse

Ersek, Laszlo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads