Unicode character in C++

Discussion in 'C++' started by liveshell, Apr 7, 2008.

  1. liveshell

    liveshell Guest

    Hi all,
    In my application, I am reading a file and storing it in a
    array of character. That is ascii format...now in certain situation I
    get unicode character (or lets say junk character). I want to know
    that whether it is plain ascii or Unicode...How can I ??

    Thanks,
    LiveShell
    liveshell, Apr 7, 2008
    #1
    1. Advertising

  2. liveshell a écrit :
    > Hi all,
    > In my application, I am reading a file and storing it in a
    > array of character. That is ascii format...now in certain situation I
    > get unicode character (or lets say junk character). I want to know
    > that whether it is plain ascii or Unicode...How can I ??


    Supposing your junk is UTF-8, you have to look for MSB equal to 1. This
    is how is is done in UTF-8: char 0-127 are the historical ascii char,
    and the number of ones in the MSB of the char gives the number of char
    that follow in the encoding:
    US-ASCII: 0xxxxxxx
    2 bytes: 10xxxxxx xxxxxxxx
    3 bytes: 110xxxxx xxxxxxxx xxxxxxxx
    4 bytes: 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx

    Michael
    Michael DOUBEZ, Apr 7, 2008
    #2
    1. Advertising

  3. liveshell

    James Kanze Guest

    On Apr 7, 2:50 pm, Michael DOUBEZ <> wrote:
    > liveshell a écrit :
    > > In my application, I am reading a file and storing it in a
    > > array of character. That is ascii format...now in certain situation I
    > > get unicode character (or lets say junk character). I want to know
    > > that whether it is plain ascii or Unicode...How can I ??


    > Supposing your junk is UTF-8, you have to look for MSB equal to 1. This
    > is how is is done in UTF-8: char 0-127 are the historical ascii char,
    > and the number of ones in the MSB of the char gives the number of char
    > that follow in the encoding:
    > US-ASCII: 0xxxxxxx
    > 2 bytes: 10xxxxxx xxxxxxxx
    > 3 bytes: 110xxxxx xxxxxxxx xxxxxxxx
    > 4 bytes: 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx


    Note too that the following bytes will always have 10 in their
    upper bits, so that should be something like:
    2 bytes: 10xxxxxx 10xxxxxx
    3 bytes: 110xxxxx 10xxxxxx 10xxxxxx
    4 bytes: 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Apr 8, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Velvet
    Replies:
    9
    Views:
    14,799
    Joerg Jooss
    Jan 19, 2006
  2. raavi
    Replies:
    2
    Views:
    908
    raavi
    Mar 2, 2006
  3. cgbusch
    Replies:
    6
    Views:
    7,488
    Mike Brown
    Sep 2, 2003
  4. Kenneth McDonald
    Replies:
    1
    Views:
    827
    Carl Banks
    Dec 27, 2006
  5. Tyler
    Replies:
    1
    Views:
    935
    Robert Klemme
    Jul 29, 2011
Loading...

Share This Page