Write non-unicode in file?

Discussion in 'C++' started by Immortal Nephi, Apr 13, 2010.

  1. I define wofstream variable. Why do data.txt contain non-unicode
    text? It should be unicode text.

    int main()
    {
    wofstream wFile;
    wFile.open( L"Data.txt", std::ios::eek:ut );
    wFile << L"Hello World!" << std::endl;
    wFile.close();

    return 0;
    }
     
    Immortal Nephi, Apr 13, 2010
    #1
    1. Advertising

  2. Immortal Nephi

    Jonathan Lee Guest

    On Apr 13, 1:53 pm, Immortal Nephi <> wrote:
    >         I define wofstream variable.  Why do data.txt contain non-unicode
    > text?  It should be unicode text.


    Looks like you have to set the locale. Google took me to here:

    http://www.velocityreviews.com/forums/t517679-wofstream.html

    which describes a similar problem and is answered by James Kanze
    (looks like an archive of this newsgroup or something).

    Other than that, though, wouldn't "Hello World!" look exactly
    the same in UTF-8 as ASCII? Or are you expecting UTF-16?
    (I'm not familiar with wofstream...)

    --Jonathan
     
    Jonathan Lee, Apr 13, 2010
    #2
    1. Advertising

  3. On Apr 13, 1:01 pm, Victor Bazarov <> wrote:
    > Immortal Nephi wrote:
    > >    I define wofstream variable.  Why do data.txt contain non-unicode
    > > text?  It should be unicode text.

    >
    > How do you know it contains non-Unicode text?  What gives it away?  Do
    > you look at the file with a hex editor?  What do you see?


    Look at file's byte size. It reports 13 bytes.
    You know that it is non-unicode. You can look
    at hex editor. It tells you the same. It should
    show 0x00 each between two characters. The byte
    will be 26.


    > > int main()
    > > {
    > >    wofstream wFile;
    > >    wFile.open( L"Data.txt", std::ios::eek:ut );
    > >    wFile << L"Hello World!" << std::endl;
    > >    wFile.close();

    >
    > >    return 0;
    > > }

    >
    > V
    > --
    > Please remove capital 'A's when replying by e-mail
    > I do not respond to top-posted replies, please don't ask
     
    Immortal Nephi, Apr 13, 2010
    #3
  4. Immortal Nephi

    Jonathan Lee Guest

    On Apr 13, 3:28 pm, Immortal Nephi <> wrote:
    > Look at file's byte size.  It reports 13 bytes.
    > You know that it is non-unicode.  You can look
    > at hex editor.  It tells you the same.  It should
    > show 0x00 each between two characters.  The byte
    > will be 26.


    You really should look up what UTF-8 is.

    --Jonathan
     
    Jonathan Lee, Apr 13, 2010
    #4
  5. On Apr 13, 12:35 pm, Jonathan Lee <> wrote:
    > On Apr 13, 3:28 pm, Immortal Nephi <> wrote:
    >
    > > Look at file's byte size.  It reports 13 bytes.
    > > You know that it is non-unicode.  You can look
    > > at hex editor.  It tells you the same.  It should
    > > show 0x00 each between two characters.  The byte
    > > will be 26.

    >
    > You really should look up what UTF-8 is.


    More specifically, you both should look up Unicode. He is using the
    old definition of Unicode from the old standard of Unicode where there
    was only one encoding. Today, we call this encoding UCS-2 (which is
    very similar to UTF-16).

    Jonathan Lee and Nephi both make the same mistake. They both assume
    "Unicode" means a particular encoding scheme of Unicode Code Points,
    Jonathan UTF-8 and Nephi UCS-2. It would help the conversation to stop
    making such assumptions or mistakes.

    I personally hate standard C++'s handling of all things unicode, so
    I'm sorry I cannot add more to this conversation.
     
    Joshua Maurice, Apr 13, 2010
    #5
  6. On Apr 13, 12:35 pm, Jonathan Lee <> wrote:
    > On Apr 13, 3:28 pm, Immortal Nephi <> wrote:
    >
    > > Look at file's byte size.  It reports 13 bytes.
    > > You know that it is non-unicode.  You can look
    > > at hex editor.  It tells you the same.  It should
    > > show 0x00 each between two characters.  The byte
    > > will be 26.

    >
    > You really should look up what UTF-8 is.


    Odd. Did google just drop my reply? Sorry if this is a double post of
    roughly the same content.

    Anyway, you are both making the same mistake. Unicode is not a single
    encoding scheme. At least, the current standard of Unicode does not
    have a single encoding scheme. Unicode is a mapping from characters to
    numbers. It specifies several encodings, such as UTF-8, UTF-16, and
    UTF-32. The old Unicode standard had characters for which 16 bits
    would suffice, and specified only a single encoding UCS-2 (which is
    very similar to UTF-16). Hence the unfortunate habit of some people
    calling UCS-2 Unicode, which Nephi is doing now. However, it is just
    as big of a mistake to say that Unicode is UTF-8 as Jonathan just did.
    It would help the conversation to use correct terms and stop making
    assumptions. Perhaps Nephi has a requirement for UTF-16 / UCS-2
    encoding (a valid business use case), in which case Jonathan was
    incorrect when he said "look up UTF-8".
     
    Joshua Maurice, Apr 13, 2010
    #6
  7. Immortal Nephi

    Jonathan Lee Guest

    On Apr 13, 6:28 pm, Joshua Maurice <> wrote:
    > On Apr 13, 12:35 pm, Jonathan Lee <> wrote:
    > > You really should look up what UTF-8 is.

    > Odd. Did google just drop my reply? Sorry if this is a double post of
    > roughly the same content.
    >
    > Anyway, you are both making the same mistake.


    Well.. no. My point was that he should look up UTF-8 to
    discover that there is more than the UTF-16 way of
    encoding Unicode. I think Victor was implying something
    of the same. If "Hello World!" were encoded in an
    unspecified Unicode encoding, *how would* Immortal
    Nephi know it weren't in Unicode?

    --Jonathan
     
    Jonathan Lee, Apr 14, 2010
    #7
  8. Immortal Nephi

    James Kanze Guest

    On Apr 14, 5:40 pm, Juha Nieminen <> wrote:
    > On 04/13/2010 08:53 PM, Immortal Nephi wrote:


    > > Why do data.txt contain non-unicode text? It should be
    > > unicode text.


    > What do you mean by "unicode text". Unicode only defines
    > integral values and the characters they correspond to. It does
    > not define how those integral values should be stored in a
    > file.


    Recent versions do. Unicode defines both an "encoding" (a
    mapping between characters and integral values) and a number of
    different "encoding forms", which define how these integral
    values are represented in a linear sequence of 8 bit bytes.

    Of course, the original poster hasn't indicated what encoding
    form he is expecting: UTF-8, UTF-16LE, etc.

    > There's no such a thing as "unicode text".


    > When you store unicode characters in a file, you have to
    > decide on an encoding format. UTF-8 is one popular encoding
    > format. UTF-16 is another.


    For all practical purposes, the only encoding form used outside
    of a program should be UTF-8. This is the standard encoding
    form for the Internet, for example.

    --
    James Kanze
     
    James Kanze, Apr 15, 2010
    #8
  9. Immortal Nephi

    James Kanze Guest

    On 15 Apr, 11:14, Juha Nieminen <> wrote:
    > On 04/15/2010 03:23 AM, James Kanze wrote:


    > > For all practical purposes, the only encoding form used outside
    > > of a program should be UTF-8. This is the standard encoding
    > > form for the Internet, for example.


    > Why? For example Japanese text takes considerably less space when
    > encoded with UTF-16 instead of UTF-8 (most characters take 2 bytes with
    > UTF-16 but 3 bytes with UTF-8) so it's more space-efficient.


    Oh, not for technical reasons. Just because the Internet is
    8 bits, and practically everything reads and writes in 8 bit
    units. Use any encoding format but UTF-8, and you'll be bit by
    issues of byte order sooner or later.

    --
    James Kanze
     
    James Kanze, Apr 15, 2010
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Holger Joukl
    Replies:
    5
    Views:
    544
    Ben Finney
    Dec 13, 2006
  2. Asterix
    Replies:
    5
    Views:
    722
    Matt Nordhoff
    Aug 31, 2008
  3. Rob Knop
    Replies:
    1
    Views:
    292
  4. Jeremy
    Replies:
    1
    Views:
    811
    Alex Willmer
    Jan 11, 2011
  5. Jochen Lehmeier

    DBD::Oracle, Unicode, non-UTF8-non-ASCII strings

    Jochen Lehmeier, Jul 23, 2009, in forum: Perl Misc
    Replies:
    0
    Views:
    412
    Jochen Lehmeier
    Jul 23, 2009
Loading...

Share This Page