c++ support for unicode, utf-8, encode/decode, ifstream, wstream?

Discussion in 'C++' started by =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=, Jan 20, 2006.

  1. Hi,
    I have an UNICODE text file endcoded in UTF-8.

    I should store the UNICODE strings in my program for example in
    std::wstring right? To be able to work on them normally, so that
    std::wstring foo; foo[5] would mean 5-th _character_, and not 5-th
    byte of UNICODE encoded string.

    How do I read a text from UTF-8 file into std::wstring? I need to do
    some conversion right? from utf-8 to internal format used by
    std::wstring (probably UCS-2 or -4 right?)

    Also, how to save back the string, and how to manipulate it (like,
    replace 4-th character, just str[4]=(wchar)'x' ?)

    Thanks
     
    =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=, Jan 20, 2006
    #1
    1. Advertising

  2. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    TB Guest

    Rafał Maj Raf256 sade:
    > Hi,
    > I have an UNICODE text file endcoded in UTF-8.
    >
    > I should store the UNICODE strings in my program for example in
    > std::wstring right? To be able to work on them normally, so that
    > std::wstring foo; foo[5] would mean 5-th _character_, and not 5-th
    > byte of UNICODE encoded string.
    >
    > How do I read a text from UTF-8 file into std::wstring? I need to do
    > some conversion right? from utf-8 to internal format used by
    > std::wstring (probably UCS-2 or -4 right?)
    >
    > Also, how to save back the string, and how to manipulate it (like,
    > replace 4-th character, just str[4]=(wchar)'x' ?)
    >


    Upon reading the UTF-8 data convert it internally to UTF-32 for
    easier parsing. The conversion process is quite easy to write.
    The problem with std::wstring is that it's templatized with
    wchar_t, and that primitive is at least on my machine only 2 bytes,
    and therefore not practical to use with unicode (unless you actually
    wish to use the abnormal UTF-16 variant in such a case).

    --
    TB @ SWEDEN
     
    TB, Jan 20, 2006
    #2
    1. Advertising

  3. TB wrote:

    > Upon reading the UTF-8 data convert it internally to UTF-32 for
    > easier parsing.


    How? Arent there ready to use functions/classes doing that? In std,
    perhaps in boost?

    > The conversion process is quite easy to write.


    > The problem with std::wstring is that it's templatized with
    > wchar_t, and that primitive is at least on my machine only 2 bytes,
    > and therefore not practical to use with unicode (unless you actually
    > wish to use the abnormal UTF-16 variant in such a case).


    Hm.. so which class is best to store any-language text string then?
     
    =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=, Jan 20, 2006
    #3
  4. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    P.J. Plauger Guest

    "Rafal Maj Raf256" <> wrote in
    message news:dqqid4$dn6$...

    > TB wrote:
    >
    >> Upon reading the UTF-8 data convert it internally to UTF-32 for
    >> easier parsing.

    >
    > How? Arent there ready to use functions/classes doing that? In std,
    > perhaps in boost?


    You'll find a few codecvt facets (the critters you need) in various places,
    but for a complete set of all that you're likely to need -- ready made,
    tested, and supported -- see our CoreX library.

    >> The conversion process is quite easy to write.


    No it isn't. At least not correctly and robustly.

    >> The problem with std::wstring is that it's templatized with
    >> wchar_t, and that primitive is at least on my machine only 2 bytes,
    >> and therefore not practical to use with unicode (unless you actually
    >> wish to use the abnormal UTF-16 variant in such a case).

    >
    > Hm.. so which class is best to store any-language text string then?


    Depends on your goals. In truth and reality, you can still get away quite
    nicely with UCS-2. Effectively, you ignore the exotic characters with
    code values above 0xffff more recently added. Your input converter
    then treats as erroneous any UTF-8 sequence that specifies a code
    value that's too big. But if you feel the need to support the complete
    Unicode set in its current form, you need to convert UTF-8 to UTF-16
    internally, and accept the fact that characters can occupy either one or
    two storage elements. Whatever your choice, CoreX has the
    conversion tools you need to carry it out.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Jan 20, 2006
    #4
  5. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    TB Guest

    Rafał Maj Raf256 sade:
    > TB wrote:
    >
    >> Upon reading the UTF-8 data convert it internally to UTF-32 for
    >> easier parsing.

    >
    > How? Arent there ready to use functions/classes doing that? In std,
    > perhaps in boost?
    >
    >> The conversion process is quite easy to write.

    >
    >> The problem with std::wstring is that it's templatized with
    >> wchar_t, and that primitive is at least on my machine only 2 bytes,
    >> and therefore not practical to use with unicode (unless you actually
    >> wish to use the abnormal UTF-16 variant in such a case).

    >
    > Hm.. so which class is best to store any-language text string then?


    If 'unsigned int' is 4 bytes on your machine, write a unicode
    implementation based on that primitive, or use an already available
    framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.

    --
    TB @ SWEDEN
     
    TB, Jan 20, 2006
    #5
  6. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    P.J. Plauger Guest

    "TB" <> wrote in message
    news:43d10b59$0$8259$...

    > Rafal Maj Raf256 sade:
    >> TB wrote:
    >>
    >>> Upon reading the UTF-8 data convert it internally to UTF-32 for
    >>> easier parsing.

    >>
    >> How? Arent there ready to use functions/classes doing that? In std,
    >> perhaps in boost?
    >>
    >>> The conversion process is quite easy to write.

    >>
    >>> The problem with std::wstring is that it's templatized with
    >>> wchar_t, and that primitive is at least on my machine only 2 bytes,
    >>> and therefore not practical to use with unicode (unless you actually
    >>> wish to use the abnormal UTF-16 variant in such a case).

    >>
    >> Hm.. so which class is best to store any-language text string then?

    >
    > If 'unsigned int' is 4 bytes on your machine, write a unicode
    > implementation based on that primitive, or use an already available
    > framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.


    Yep. It includes UTF-8 to UCS-4 too. And it's templatized on the
    internal character type. Forgot to mention that.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Jan 20, 2006
    #6
  7. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    JustBoo Guest

    On Fri, 20 Jan 2006 17:13:37 +0100, TB <> wrote:

    >If 'unsigned int' is 4 bytes on your machine, write a unicode
    >implementation based on that primitive, or use an already available
    >framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.


    Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
    Plauger. Bit of a difference I think.

    "If you have ten thousand regulations you destroy
    all respect for the law." - Winston Churchill
     
    JustBoo, Jan 21, 2006
    #7
  8. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    P.J. Plauger Guest

    "JustBoo" <> wrote in message
    news:...

    > On Fri, 20 Jan 2006 17:13:37 +0100, TB <> wrote:
    >
    >>If 'unsigned int' is 4 bytes on your machine, write a unicode
    >>implementation based on that primitive, or use an already available
    >>framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.

    >
    > Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
    > Plauger. Bit of a difference I think.


    Really? In what way? I certainly *advocate* using an already
    available framework, as did TB. If you can get a free one that
    does the job (and it's still sufficiently "free" after you locate
    it, download it, figure out how to build it, integrate it into
    your product, deal with the surprises, and test it to your
    satisfaction) by all means do so. I also *advocate* using CoreX,
    if you're sufficiently professional that USD 90 is cheaper than
    the above parenthetical exercise costs you in your time and
    peace of mind.

    But if you think I *advocate* something just because I make
    ninety bucks off it, then by all means avoid anything that's
    $OLD and stick with open sour¢e. Just don't measure me by
    your standards.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Jan 21, 2006
    #8
  9. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    JustBoo Guest

    On Sat, 21 Jan 2006 18:32:22 -0500, "P.J. Plauger"
    <> wrote:
    >"JustBoo" <> wrote in message
    >news:...
    >> Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
    >> Plauger. Bit of a difference I think.


    >Really? In what way? I certainly *advocate* using an already
    >available framework, as did TB.


    In what way? Well, from the simple *fact* you make money selling
    your products here. I make that the point of this paragraph. Can you
    deny that? It is a fact. Leave emotion out of it. Leave capitalism out
    of it. Leave your perception that this is an insult out of it, and all
    the rest. You sell your products here. I could truly *careless*
    whether you do or not. But you do. Please do not enumerate all the
    good you do for the free and not-so-free world by doing this. You sell
    them here on a consistent basis. Period. As the sun comes up every
    morning, it's just the obvious truth. Now please pay attention; in the
    *context* of this thread, I thought it important to point this out to
    the poster. That's it.

    Before getting your hackles up please read further.

    >If you can get a free one that
    >does the job (and it's still sufficiently "free" after you locate
    >it, download it, figure out how to build it, integrate it into
    >your product, deal with the surprises, and test it to your
    >satisfaction) by all means do so. I also *advocate* using CoreX,
    >if you're sufficiently professional that USD 90 is cheaper than
    >the above parenthetical exercise costs you in your time and
    >peace of mind.


    Fairly ironic that. I'm certain you don't remember, but *I have
    recommended* people look at your products on a regular basis, in
    this ng and others. Even in the real world. And ready for this, I have
    used precisely the exact same logic to justify the recommendation
    when *attacked* for doing so. That is the very definition of irony.

    [Note: I usually leave out the snarky remark about "sufficiently
    professional" though.] :)

    >But if you think I *advocate* something just because I make
    >ninety bucks off it, then by all means avoid anything that's
    >$OLD and stick with open sour¢e.


    Wow, an ocean's worth of assumption and presumptions to boot. You
    think me a socialist? Bwha. I'm a stone-cold capitalist. You've
    assumed far too much. <chuckling> You've read far too much into my
    simple statement of fact.

    And yes, I do think you advocate it because you make money from it.
    Welcome to the commerce of the human race. It's just human nature.
    I'll leave it up to you to decide if that is an insult or not.

    Trend your own posts, seriously. Look at what you respond to and what
    you always recommend. I believe you to be of a scientific mentality
    and if you are honest with yourself you will see truth. Noting more
    nothing less.

    And in the end, so what. As I'm sure one of your arguments would/will
    be: people are free to buy it or not, and you're making them aware of
    its existence. And there you have it. Try to read this post without
    emotion and perhaps you'll see my intent.

    >Just don't measure me by
    >your standards.

    [Insult acknowledged but not accepted; like a refused package]

    Once again, you assume far too much. Especially given that I simply
    pointed out that you sell products, which is true. Does being a
    capitalist bother you? Guilt perhaps? Note those are questions, not
    assumptions.

    "I didn't fight my way to the top of the food chain to be a
    vegetarian."

    Have a *prosperous* week. :)
     
    JustBoo, Jan 22, 2006
    #9
  10. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    P.J. Plauger Guest

    "JustBoo" <> wrote in message
    news:...

    > On Sat, 21 Jan 2006 18:32:22 -0500, "P.J. Plauger"
    > <> wrote:
    >>"JustBoo" <> wrote in message
    >>news:...
    >>> Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
    >>> Plauger. Bit of a difference I think.

    >
    >>Really? In what way? I certainly *advocate* using an already
    >>available framework, as did TB.

    >
    > In what way? Well, from the simple *fact* you make money selling
    > your products here. I make that the point of this paragraph. Can you
    > deny that?


    Uh, no.

    > [extensive rant elided]


    Got it. Now chill out.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Jan 22, 2006
    #10
  11. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    JustBoo Guest

    On Sun, 22 Jan 2006 12:13:04 -0500, "P.J. Plauger"
    <> wrote:

    >> [extensive rant elided]

    >
    >Got it. Now chill out.


    Extensive rant? Wow. obviously the "voice" I wrote the post with and
    the "voice" you read it with were in two different universes.

    I said repeatedly "without emotion." To me that means even and calm.
    But...

    Boo to airline stewardess on Plauger Air,
    "Miss, may I have a pillow?"

    Stewardess: "AHHHH! AHHH, STOP SCREAMING
    AT ME YOU ANGRY, BITTER, MEAN, ANGRY,
    UNREASONABLE COMMIE, SOCIALIST,
    BASTARD! AHHHHH, AHHHHHH!."

    <sigh> All righty then....

    Men occasionally stumble over the truth, but most
    of them pick themselves up and hurry off as if
    nothing had happened. - Winston Churchill

    Once again, Have a *prosperous* week.
    I'm out for some fun in the sun. :)
     
    JustBoo, Jan 22, 2006
    #11
  12. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    P.J. Plauger Guest

    "JustBoo" <> wrote in message
    news:eek:...

    > On Sun, 22 Jan 2006 12:13:04 -0500, "P.J. Plauger"
    > <> wrote:
    >
    >>> [extensive rant elided]

    >>
    >>Got it. Now chill out.

    >
    > Extensive rant? Wow. obviously the "voice" I wrote the post with and
    > the "voice" you read it with were in two different universes.
    >
    > I said repeatedly "without emotion." To me that means even and calm.
    > But...


    Yes, you said many things, but it's clear that you are
    intellectually dishonest. I stay in "dialogs" like this
    only when I feel a need to educate the lurkers, not in
    any hopes of finding common ground. Since I have no
    sermon to preach in this case, I see little reason to
    feed this particular troll.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Jan 22, 2006
    #12
  13. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    JustBoo Guest

    On Sun, 22 Jan 2006 14:11:07 -0500, "P.J. Plauger"
    <> wrote:
    >"JustBoo" <> wrote in message
    >news:eek:...


    >> Extensive rant? Wow. obviously the "voice" I wrote the post with and
    >> the "voice" you read it with were in two different universes.


    Well, I now "see" what voice you hear. It seems obvious to me you have
    quite literally no sense-of-humor. I mean you literally have no
    ability to sense when humor is being applied to a situation. I now
    have no doubt when someone makes a witty/funny remark in a meeting
    you're that guy that sits there stone-faced because you just don't get
    it.

    If I make a "thumbs-up" gesture and say how many fingers am I holding
    up you're the guy that has to launch into a deadly-dull bone-dry
    dissertation about why thumbs aren't fingers and blah, blah... yawn,
    blah.... Wow.

    The airline joke (or an attempt at one :) ) was so over the top that
    I figured *anyone* with a social IQ above a warm doorknob would
    get it. Disappointed again I see. And yes, that is an insult. YOU
    insulted me numerous times in the past posts and I did *not*
    reciprocate. I left them unanswered. Think about that.

    "I like a man who grins when he fights." - Winston Churchill

    >> I said repeatedly "without emotion." To me that means even and calm.
    >> But...

    >
    >Yes, you said many things, but it's clear that you are
    >intellectually dishonest. I stay in "dialogs" like this
    >only when I feel a need to educate the lurkers, not in
    >any hopes of finding common ground.


    Gah! Where the heck did that come from? Oh, right, no ability to sense
    humor. But there seems to be more... hmm, let's see. Admitted to
    hawking wares, okay, but he feels "a need to educate the lurkers."
    Hmm, he knows he has good products so should not be so insecure that
    he has to defend that at every turn and that was not even brought up
    in this thread, (but I am probably wrong about that...) but… oh no!
    Say isn't so. :-(

    Well, I think he has decided he is the Alpha Male of this (his?)
    newsgroup and needs to prove to his underlings, lackeys and lurkers
    that he will have the ego-driven Last Word(tm)(r), by gawd! Yes, it's
    Good To Be King.

    Man, that explains so much of the behavior, tone and tenor of the
    petty back-biting Superior Sheep in this group. The lesser
    monkey-sheep are emulating The Alpha Males behavior. Much of it to
    curry favor and not be "chewed on" by the emulating monkey-sheep.
    Ding! Because after all, they are so superior to the sub-creatures
    that will not submit.

    Man, what a small tiny vacuous life to live. To each his own.

    "To perceive is to suffer." - Aristotle

    >Since I have no
    >sermon to preach in this case, I see little reason to
    >feed this particular troll.


    And you call me *intellectually* dishonest? Bwha. Look in the mirror,
    pal.

    Life should NOT be a journey to the grave with the intention of
    arriving safely in a well preserved body, but to skid in sideways,
    champagne in one hand, chocolate in the other, body thoroughly
    used up and worn out, screaming "WOO HOO BOO- What a Ride!"
    - Terry Pratchett
     
    JustBoo, Jan 23, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Harald Kirsch
    Replies:
    2
    Views:
    2,143
    Harald Kirsch
    Aug 28, 2003
  2. anonymous
    Replies:
    1
    Views:
    635
  3. Kless

    Decode/encode Unicode

    Kless, Aug 28, 2008, in forum: Ruby
    Replies:
    4
    Views:
    147
    Kless
    Aug 28, 2008
  4. peter pilsl
    Replies:
    2
    Views:
    150
    peter pilsl
    Oct 1, 2004
  5. Alan Franzoni
    Replies:
    0
    Views:
    211
    Alan Franzoni
    Jul 27, 2012
Loading...

Share This Page