Re: UTF-8 and wchar_t

Discussion in 'C Programming' started by Ersek, Laszlo, Mar 2, 2010.

  1. In article <86.com>, Michal Nazarewicz <> writes:
    > Hello everyone,
    >
    > I am facing a situation where I need to handle UTF-8 input along with
    > input from standard input (ie. locale dependent multibyte). In the end,
    > after some computations, concatenations, etc I need to output it to
    > standard output (again locale dependent multibyte).
    >
    > What I want to do is convert both the UTF-8 input as well as data from
    > standard input to an array of wchar_t and then output it using wprintf()
    > (or one of the other "wide" functions).


    Handle stdin and stdout like you intend, ie. with setlocale() and the
    implicit conversion provided by <stdio.h> functions.

    For the UTF-8 input coming from elsewhere: if you can stick with glibc,
    just call

    #include <iconv.h>

    convdesc = iconv_open("WCHAR_T", "UTF-8");

    http://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html

    Otherwise, you'll have to switch at least the LC_CTYPE locale category
    manually, and proceed with the separate input like with stdin.

    Cheers,
    lacos
     
    Ersek, Laszlo, Mar 2, 2010
    #1
    1. Advertising

  2. Ersek, Laszlo

    Nobody Guest

    On Tue, 02 Mar 2010 19:16:20 +0100, Ersek, Laszlo wrote:

    > For the UTF-8 input coming from elsewhere: if you can stick with glibc,
    > just call
    >
    > #include <iconv.h>
    >
    > convdesc = iconv_open("WCHAR_T", "UTF-8");


    Or you can write your own UTF-8 encoder/decoder. Personally, I wouldn't
    make iconv a requirement just for UTF-8. If you're going to be using iconv
    anyhow, then you may as well use it for this as well.
     
    Nobody, Mar 2, 2010
    #2
    1. Advertising

  3. Ersek, Laszlo

    Nick Guest

    Nobody <> writes:

    > On Tue, 02 Mar 2010 19:16:20 +0100, Ersek, Laszlo wrote:
    >
    >> For the UTF-8 input coming from elsewhere: if you can stick with glibc,
    >> just call
    >>
    >> #include <iconv.h>
    >>
    >> convdesc = iconv_open("WCHAR_T", "UTF-8");

    >
    > Or you can write your own UTF-8 encoder/decoder. Personally, I wouldn't
    > make iconv a requirement just for UTF-8. If you're going to be using iconv
    > anyhow, then you may as well use it for this as well.


    I found sqlite3_unicode a great help when doing that - lots of stuff to
    start you off and no copyright issues:

    You can find it at:
    <URL:http://ioannis.mpsounds.net/blog/2009/01/11/sqlite3_unicode-updated-for-sqlite3-v367/>

    (it's not that ioannis btw).
    --
    Online waterways route planner | http://canalplan.eu
    Plan trips, see photos, check facilities | http://canalplan.org.uk
     
    Nick, Mar 3, 2010
    #3
  4. In article <>, Nobody <> writes:
    > On Tue, 02 Mar 2010 19:16:20 +0100, Ersek, Laszlo wrote:
    >
    >> For the UTF-8 input coming from elsewhere: if you can stick with glibc,
    >> just call
    >>
    >> #include <iconv.h>
    >>
    >> convdesc = iconv_open("WCHAR_T", "UTF-8");

    >
    > Or you can write your own UTF-8 encoder/decoder.


    I rather would not try.

    Invalid byte sequences (overlong forms) have security implications:
    http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

    --o--

    I would have expected iconv() to support some kind of normalization too:

    Combining characters (when already in UCS/wchar_t):
    http://www.cl.cam.ac.uk/~mgk25/unicode.html#comb

    Normalizing to NFC (when already in UCS/wchar_t -- "NFC is the preferred
    form for Linux and WWW"):
    http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf

    But it seems this is not a reasonable expectation:

    (2001) http://sources.redhat.com/ml/libc-alpha/2001-09/msg00170.html
    (2004) http://sourceware.org/ml/libc-alpha/2004-01/msg00287.html
    (2009) http://www.mail-archive.com//msg15501.html

    Cheers,
    lacos
     
    Ersek, Laszlo, Mar 3, 2010
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jon Willeke

    wchar_t -> UTF-8?

    Jon Willeke, Feb 8, 2004, in forum: C++
    Replies:
    2
    Views:
    7,550
    Tilman Kuepper
    Feb 9, 2004
  2. Steven T. Hatton
    Replies:
    23
    Views:
    7,720
    Phlip
    Mar 12, 2006
  3. Replies:
    3
    Views:
    1,110
    James Kanze
    Aug 15, 2008
  4. Boris Du¹ek
    Replies:
    3
    Views:
    1,464
    Boris Du¹ek
    Nov 3, 2008
  5. Ersek, Laszlo

    Re: UTF-8 and wchar_t

    Ersek, Laszlo, Mar 2, 2010, in forum: C Programming
    Replies:
    1
    Views:
    415
    Mikko Rauhala
    Mar 3, 2010
Loading...

Share This Page