P
Pakt
Hi all,
I am hoping someone can provide some help on what I expect is a simple
function. I want to mostly strip out non ascii characters (those >
127) from a utf-8 string, except for a small set of exceptions.
Going into this I thought it would be an easy task, but I've found
surprisingly little information on this sort of conversion/
transliteration. My lack of unicode experience certainly hasn't
helped and the number of days I've spent on this task is embarassing.
I spent a lot of time fiddling with libiconv, but it's
incomprehensible and there are simply _no_ examples of
transliteration with iconv. So, I need to reinvent the wheel, albeit
a simpler wheel.
The simpler wheel so far:
...
wchar_t test_string[]= L"jeudi 11 février, le 31e anniversaire de la
révolution";
int index=0;
while (test_string[index]) {
if (test_string[index] > 127) {
//preserve most accented characters, but strip funny quotes
etc...
if (!((test_string[index] > '\u00C0') &&
(test_string[index] < '\u017F'))) {
test_string[index]='';
}
else {
// Do some transliteration to the accented characters
here...
test_string[index]=translit_lookup(test_string[index]);
}
}
index++;
}
printf("String now:%s\n",test_string);
....
Does anyone please have any light to shed on this?
Along the same lines (but more complicated), the string is supplied by
the user so I can't really guarantee that it is in utf-8, let alone
UCS-2. Is is a good idea to first convert the string (whatever is
supplied) using mbstowcs before attempting the above?
Thanks in advance for any saving of bacon.
I am hoping someone can provide some help on what I expect is a simple
function. I want to mostly strip out non ascii characters (those >
127) from a utf-8 string, except for a small set of exceptions.
Going into this I thought it would be an easy task, but I've found
surprisingly little information on this sort of conversion/
transliteration. My lack of unicode experience certainly hasn't
helped and the number of days I've spent on this task is embarassing.
I spent a lot of time fiddling with libiconv, but it's
incomprehensible and there are simply _no_ examples of
transliteration with iconv. So, I need to reinvent the wheel, albeit
a simpler wheel.
The simpler wheel so far:
...
wchar_t test_string[]= L"jeudi 11 février, le 31e anniversaire de la
révolution";
int index=0;
while (test_string[index]) {
if (test_string[index] > 127) {
//preserve most accented characters, but strip funny quotes
etc...
if (!((test_string[index] > '\u00C0') &&
(test_string[index] < '\u017F'))) {
test_string[index]='';
}
else {
// Do some transliteration to the accented characters
here...
test_string[index]=translit_lookup(test_string[index]);
}
}
index++;
}
printf("String now:%s\n",test_string);
....
Does anyone please have any light to shed on this?
Along the same lines (but more complicated), the string is supplied by
the user so I can't really guarantee that it is in utf-8, let alone
UCS-2. Is is a good idea to first convert the string (whatever is
supplied) using mbstowcs before attempting the above?
Thanks in advance for any saving of bacon.