Simple conversion problem

Discussion in 'C Programming' started by Pakt, Feb 11, 2010.

  1. Pakt

    Pakt Guest

    Hi all,

    I am hoping someone can provide some help on what I expect is a simple
    function. I want to mostly strip out non ascii characters (those >
    127) from a utf-8 string, except for a small set of exceptions.
    Going into this I thought it would be an easy task, but I've found
    surprisingly little information on this sort of conversion/
    transliteration. My lack of unicode experience certainly hasn't
    helped and the number of days I've spent on this task is embarassing.

    I spent a lot of time fiddling with libiconv, but it's
    incomprehensible and there are simply _no_ examples of
    transliteration with iconv. So, I need to reinvent the wheel, albeit
    a simpler wheel.

    The simpler wheel so far:
    ...
    wchar_t test_string[]= L"jeudi 11 février, le 31e anniversaire de la
    révolution";
    int index=0;

    while (test_string[index]) {
    if (test_string[index] > 127) {
    //preserve most accented characters, but strip funny quotes
    etc...
    if (!((test_string[index] > '\u00C0') &&
    (test_string[index] < '\u017F'))) {
    test_string[index]='';
    }
    else {
    // Do some transliteration to the accented characters
    here...
    test_string[index]=translit_lookup(test_string[index]);
    }
    }
    index++;
    }
    printf("String now:%s\n",test_string);
    ....

    Does anyone please have any light to shed on this?


    Along the same lines (but more complicated), the string is supplied by
    the user so I can't really guarantee that it is in utf-8, let alone
    UCS-2. Is is a good idea to first convert the string (whatever is
    supplied) using mbstowcs before attempting the above?

    Thanks in advance for any saving of bacon.
    Pakt, Feb 11, 2010
    #1
    1. Advertising

  2. Pakt <> writes:

    > I am hoping someone can provide some help on what I expect is a simple
    > function. I want to mostly strip out non ascii characters (those >
    > 127) from a utf-8 string, except for a small set of exceptions.
    > Going into this I thought it would be an easy task, but I've found
    > surprisingly little information on this sort of conversion/
    > transliteration. My lack of unicode experience certainly hasn't
    > helped and the number of days I've spent on this task is embarassing.
    >
    > I spent a lot of time fiddling with libiconv, but it's
    > incomprehensible and there are simply _no_ examples of
    > transliteration with iconv. So, I need to reinvent the wheel, albeit
    > a simpler wheel.
    >
    > The simpler wheel so far:
    > ..
    > wchar_t test_string[]= L"jeudi 11 février, le 31e anniversaire de la
    > révolution";


    Your string is now not UTF-8 encoded. It is a wide string which makes
    things simpler and is certainly one way to go.

    > int index=0;
    >
    > while (test_string[index]) {
    > if (test_string[index] > 127) {
    > //preserve most accented characters, but strip funny quotes
    > etc...
    > if (!((test_string[index] > '\u00C0') &&
    > (test_string[index] < '\u017F'))) {


    First, you want L'\u00CO' there. What you have written is quote
    different (and it does not matter what it means -- it is not what you
    want at all).

    Secondly, that's a rather complex test. It would be simpler written
    in the form:

    x <= L'\u00C0' || x >= L'\u017F'

    or you could just swap the if and else parts to get rid of the !.

    > test_string[index]='';


    You can't set a wchar_t to ''. In fact '' is a syntax error. The way
    to remove a character is to copy the string (you can copy it to itself
    if you like) but to not copy those characters that you don't want.

    > }
    > else {
    > // Do some transliteration to the accented characters
    > here...
    > test_string[index]=translit_lookup(test_string[index]);


    I'd put all the work into a function like this. The test for > 127,
    and the range of ignored characters are all, logically, part of the
    translation you are doing.

    > }
    > }
    > index++;
    > }
    > printf("String now:%s\n",test_string);


    You need %ls to print a wide string.

    > ...
    >
    > Does anyone please have any light to shed on this?


    I'd write it like this:

    #include <stdio.h>
    #include <wchar.h>

    wchar_t translit_lookup(wchar_t in)
    {
    if (in <= 127)
    return in; // unchanged
    else if (in <= L'\u00C0' || in >= L'\u017F')
    return 0; // ignore
    else return '?'; // purely illustrative
    }

    wchar_t *process(wchar_t *wstr)
    {
    int src = 0, dst = 0;
    while (wstr[src]) {
    if ((wstr[dst] = translit_lookup(wstr[src])) != 0)
    ++dst;
    ++src;
    }
    return wstr;
    }

    int main(void)
    {
    wchar_t test_string[] =
    L"jeudi 11 février, le 31e “anniversaire†de la révolution";
    printf("String now: \"%ls\"\n", process(test_string));
    return 0;
    }

    > Along the same lines (but more complicated), the string is supplied by
    > the user so I can't really guarantee that it is in utf-8, let alone
    > UCS-2. Is is a good idea to first convert the string (whatever is
    > supplied) using mbstowcs before attempting the above?


    Your big problem may be knowing the encoding. If you are lucky, the
    locale will specify the encoding and you will have to deal only with
    strings encoded as per the locale setting. If, so mbstowcs will be the
    simplest way to go.

    If this is not true, you have a bigger problem to solve but I won't
    go into that now.

    --
    Ben.
    Ben Bacarisse, Feb 11, 2010
    #2
    1. Advertising

  3. In article <>, Pakt <> writes:

    > I am hoping someone can provide some help on what I expect is a simple
    > function. I want to mostly strip out non ascii characters (those >
    > 127) from a utf-8 string, except for a small set of exceptions.


    Here's what I propose.

    1. Write your source code in the locale (or rather, charset / character
    encoding) you use otherwise. (UTF-8 is the best choice, probably.) I
    will assume that this locale (charset) will enable you to type all the
    characters that you'll want to allow. (You mention a not very big
    accepted alphabet.)

    2. In said source file, specify the accepted set of characters like
    this:

    static const wchar_t accepted[] = L"abcdefg....";

    3. Run the compiler on your source while in the same locale, aiming at
    conformance to C99 6.4.5 "String literals" p5. (Roughly speaking, the
    compiler will initialize the "accepted" array via mbstowcs(),
    interpreting the multibyte characters according to the current locale.)
    In more concrete terms, this will probably mean one of the following
    conversions:

    * UTF-8 -> UTF-16
    * UTF-8 -> UTF-32
    * ISO 8859-1 -> UTF-16
    * ISO 8859-1 -> UTF-32
    * ISO 8859-15 -> UTF-16
    * ISO 8859-15 -> UTF-32

    As said before, the "source" side of the conversion is determined by an
    implementation-defined current locale when the compiler is run. The
    target side is not mentioned by the C99 standard (or rather I don't
    remember it).

    FWIW, in case of gcc, you can explicitly specify both sides with
    -finput-charset and -fwide-exec-charset, respectively. This shouldn't be
    necessary, though.

    4. Accept multibyte input from the user and convert it to a wide string
    via mbstowcs() or mbsrtowcs(), or do both steps at once by way of
    fscanf(). I would advise against fwscanf() if you want input files to be
    portable.

    If the stream to be used comes from another part of the program that you
    have no control over, use the fwide() function to query the orientation
    of the stream, and if it's already wide-oriented, use fwscanf() or
    fgetws().

    Don't forget to initialize the locale first via setlocale(LC_ALL, "") or
    setlocale(LC_CTYPE, ""). In the end, you should have an array of
    wchar_t, comparable against "accepted", even if the locale used at
    compilation time and the locale used at execution time differ.

    5. Use the wcsspn() and wcspbrk() functions in lock-step to find
    sequences of accepted and not-accepted characters.

    6. Output accepted sequences like written under 4.


    Please anybody point out mistakes in the above, I didn't try it yet. I
    hope to write an example demonstrating it later, like

    static int
    filter_stream_linewise(FILE *in_stream, FILE *out_stream,
    const wchar_t *accepted)
    {
    /* ... */
    }

    int
    main(int argc, char **argv)
    {
    /* ... */
    res = filter_stream_linewise(stdin, stdout, L"...");
    /* ... */
    }

    Cheers,
    lacos
    Ersek, Laszlo, Feb 11, 2010
    #3
  4. Pakt

    Pakt Guest

    Thank you both very much for the advice, both have been very helpful.

    Out of interest I compiled and ran Ben's suggestion and it worked
    perfectly, but I have a quick question about an embarassingly simple
    modification to the tranlit_lookup function.

    If you are still reading, I would like to modify it so that any
    accented characters within the range u00C0->u017F are returned as is,
    instead of as '?', thereby (hopefully) preserving those characters in
    the original string.

    wchar_t translit_lookup(wchar_t in)
    {
    if (in <= 127)
    return in; // unchanged
    else if (in <= L'\u00C0' || in >= L'\u017F')
    return 0; // ignore

    // else return '?'; // * Change this line from this

    else return in; //* to this
    }

    But, when I do this my output only shows:

    String now: "

    I am puzzled by why this happens, and how I should modify it to work
    correctly. Thank you again for any assistance.
    Pakt, Feb 14, 2010
    #4
  5. In article <>, Pakt <> writes:

    > If you are still reading, I would like to modify it so that any
    > accented characters within the range u00C0->u017F are returned as is,
    > instead of as '?', thereby (hopefully) preserving those characters in
    > the original string.
    >
    > wchar_t translit_lookup(wchar_t in)
    > {
    > if (in <= 127)
    > return in; // unchanged
    > else if (in <= L'\u00C0' || in >= L'\u017F')
    > return 0; // ignore
    >
    > // else return '?'; // * Change this line from this
    >
    > else return in; //* to this
    > }
    >
    > But, when I do this my output only shows:
    >
    > String now: "
    >
    > I am puzzled by why this happens, and how I should modify it to work
    > correctly. Thank you again for any assistance.


    Please pipe the output of the program into "hexdump -C". The trailing
    double-quote is not shown eiter; I think the output of the program may
    be messing up your terminal.

    What does "locale" say right before you compile the program? What does
    it say right before you execute it? Also, please post the output of

    LC_ALL=C grep 'de la r' source.c | hexdump -C

    --o--

    You could also merge the two "return in" statements under a single
    condition.

    return (in <= L'\u00C0' || in >= L'\u017F') ? L'\0' : in;

    --o--

    Ben's process() from

    http://groups.google.com/group/comp.lang.c/msg/7dbb3e49ba19eeed

    doesn't seem to NUL-terminate the string early enough when at least one
    ignored wide character occurs.

    -o-

    I believe you cannot, in general, rely on

    (wchar_t)0xABCD == L'\uABCD'

    Consequently, if the ISO/IEC 10646 four-digit short identifier of
    character X precedes (when interpreted as a hexadecimal string) that of
    character Y, that doesn't imply that their wchar_t representations will
    have the same relationship.

    I think you should use an explicit alphabet of accepted characters for
    portability, or at least check for __STDC_ISO_10646__:

    C99 6.10.8 "Predefined macro names", paragraph 2:

    ----v----
    The following macro names are conditionally defined by the
    implementation:

    [...]

    __STDC_ISO_10646__

    An integer constant of the form yyyymmL (for example, 199712L), intended
    to indicate that values of type wchar_t are the coded representations of
    the characters defined by ISO/IEC 10646, along with all amendments and
    technical corrigenda as of the specified year and month.
    ----^----

    See also

    http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

    ----v----
    One final comment about the choice of the wide character representation
    is necessary at this point. We have said above that the natural choice
    is using Unicode or ISO 10646. This is not required, but at least
    encouraged, by the ISO C standard. The standard defines at least a macro
    __STDC_ISO_10646__ that is only defined on systems where the wchar_t
    type encodes ISO 10646 characters. If this symbol is not defined one
    should avoid making assumptions about the wide character representation.
    If the programmer uses only the functions provided by the C library to
    handle wide character strings there should be no compatibility problems
    with other systems.
    ----^----

    Cheers,
    lacos
    Ersek, Laszlo, Feb 14, 2010
    #5
  6. Pakt <> writes:

    > Thank you both very much for the advice, both have been very helpful.
    >
    > Out of interest I compiled and ran Ben's suggestion and it worked
    > perfectly, but I have a quick question about an embarassingly simple
    > modification to the tranlit_lookup function.


    No, not embarrassing for you, but for me. I forget a call to
    setlocale that is needed before you can convert wide chars to
    multi-byte strings on output. I.e. you can do your transliteration,
    but the printf won't work without it. My test worked because it
    removed all problematic characters.

    <snip>
    > But, when I do this my output only shows:
    >
    > String now: "
    >
    > I am puzzled by why this happens, and how I should modify it to work
    > correctly. Thank you again for any assistance.


    I also forgot to null-terminate the string in the "process" function.
    Try this:

    #include <stdio.h>
    #include <locale.h>
    #include <wchar.h>

    wchar_t translit_lookup(wchar_t in)
    {
    if (in <= 127)
    return in; // unchanged
    else if (in <= L'\u00C0' || in >= L'\u017F')
    return 0; // ignore
    // else return '?'; // purely illustrative
    else return in;
    }

    wchar_t *process(wchar_t *wstr)
    {
    int src = 0, dst = 0;
    while (wstr[src]) {
    if ((wstr[dst] = translit_lookup(wstr[src])) != 0)
    ++dst;
    ++src;
    }
    wstr[dst] = 0;
    return wstr;
    }

    int main(void)
    {
    wchar_t test_string[] =
    L"jeudi 11 février, le 31e “anniversaire†de la révolution";
    setlocale(LC_ALL, "");
    printf("String now: \"%ls\"\n", process(test_string));
    return 0;
    }

    (I've left your change in place). Fingers crossed that I've not made
    any more basic errors!

    --
    Ben.
    Ben Bacarisse, Feb 14, 2010
    #6
  7. In article <fR$lPqRS3H1P@ludens>, (Ersek, Laszlo) writes:
    > In article <>, Pakt <> writes:
    >
    >> wchar_t translit_lookup(wchar_t in)
    >> {
    >> if (in <= 127)
    >> return in; // unchanged
    >> else if (in <= L'\u00C0' || in >= L'\u017F')
    >> return 0; // ignore
    >>
    >> // else return '?'; // * Change this line from this
    >>
    >> else return in; //* to this
    >> }



    > You could also merge the two "return in" statements under a single
    > condition.
    >
    > return (in <= L'\u00C0' || in >= L'\u017F') ? L'\0' : in;


    How stupid. Sorry.

    lacos
    /facepalm
    Ersek, Laszlo, Feb 14, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?VGltOjouLg==?=

    Simple problem with loop and asp conversion

    =?Utf-8?B?VGltOjouLg==?=, May 27, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    379
    =?Utf-8?B?VGltOjouLg==?=
    May 27, 2004
  2. Gary

    Simple String Conversion?

    Gary, Feb 15, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    432
    Karl Seguin [MVP]
    Feb 15, 2006
  3. Newsgroup - Ann

    simple type conversion question

    Newsgroup - Ann, Aug 7, 2003, in forum: C++
    Replies:
    1
    Views:
    285
    Alf P. Steinbach
    Aug 7, 2003
  4. Steve
    Replies:
    0
    Views:
    799
    Steve
    Sep 24, 2006
  5. , India
    Replies:
    2
    Views:
    456
    Fraser Ross
    Sep 15, 2009
Loading...

Share This Page