Write non-unicode in file?

Immortal Nephi · Apr 13, 2010

I define wofstream variable. Why do data.txt contain non-unicode
text? It should be unicode text.

int main()
{
wofstream wFile;
wFile.open( L"Data.txt", std::ios:

ut );
wFile << L"Hello World!" << std::endl;
wFile.close();

return 0;
}

Jonathan Lee · Apr 13, 2010

I define wofstream variable. Why do data.txt contain non-unicode
text? It should be unicode text.

Looks like you have to set the locale. Google took me to here:

http://www.velocityreviews.com/forums/t517679-wofstream.html

which describes a similar problem and is answered by James Kanze
(looks like an archive of this newsgroup or something).

Other than that, though, wouldn't "Hello World!" look exactly
the same in UTF-8 as ASCII? Or are you expecting UTF-16?
(I'm not familiar with wofstream...)

--Jonathan

Immortal Nephi · Apr 13, 2010

How do you know it contains non-Unicode text? What gives it away? Do
you look at the file with a hex editor? What do you see?

Look at file's byte size. It reports 13 bytes.
You know that it is non-unicode. You can look
at hex editor. It tells you the same. It should
show 0x00 each between two characters. The byte
will be 26.

Jonathan Lee · Apr 13, 2010

Look at file's byte size. It reports 13 bytes.
You know that it is non-unicode. You can look
at hex editor. It tells you the same. It should
show 0x00 each between two characters. The byte
will be 26.

You really should look up what UTF-8 is.

--Jonathan

Joshua Maurice · Apr 13, 2010

You really should look up what UTF-8 is.

More specifically, you both should look up Unicode. He is using the
old definition of Unicode from the old standard of Unicode where there
was only one encoding. Today, we call this encoding UCS-2 (which is
very similar to UTF-16).

Jonathan Lee and Nephi both make the same mistake. They both assume
"Unicode" means a particular encoding scheme of Unicode Code Points,
Jonathan UTF-8 and Nephi UCS-2. It would help the conversation to stop
making such assumptions or mistakes.

I personally hate standard C++'s handling of all things unicode, so
I'm sorry I cannot add more to this conversation.

Joshua Maurice · Apr 13, 2010

You really should look up what UTF-8 is.

Odd. Did google just drop my reply? Sorry if this is a double post of
roughly the same content.

Anyway, you are both making the same mistake. Unicode is not a single
encoding scheme. At least, the current standard of Unicode does not
have a single encoding scheme. Unicode is a mapping from characters to
numbers. It specifies several encodings, such as UTF-8, UTF-16, and
UTF-32. The old Unicode standard had characters for which 16 bits
would suffice, and specified only a single encoding UCS-2 (which is
very similar to UTF-16). Hence the unfortunate habit of some people
calling UCS-2 Unicode, which Nephi is doing now. However, it is just
as big of a mistake to say that Unicode is UTF-8 as Jonathan just did.
It would help the conversation to use correct terms and stop making
assumptions. Perhaps Nephi has a requirement for UTF-16 / UCS-2
encoding (a valid business use case), in which case Jonathan was
incorrect when he said "look up UTF-8".

Jonathan Lee · Apr 14, 2010

Odd. Did google just drop my reply? Sorry if this is a double post of
roughly the same content.

Anyway, you are both making the same mistake.

Well.. no. My point was that he should look up UTF-8 to
discover that there is more than the UTF-16 way of
encoding Unicode. I think Victor was implying something
of the same. If "Hello World!" were encoded in an
unspecified Unicode encoding, *how would* Immortal
Nephi know it weren't in Unicode?

--Jonathan

James Kanze · Apr 15, 2010

On 04/13/2010 08:53 PM, Immortal Nephi wrote:

What do you mean by "unicode text". Unicode only defines
integral values and the characters they correspond to. It does
not define how those integral values should be stored in a
file.

Recent versions do. Unicode defines both an "encoding" (a
mapping between characters and integral values) and a number of
different "encoding forms", which define how these integral
values are represented in a linear sequence of 8 bit bytes.

Of course, the original poster hasn't indicated what encoding
form he is expecting: UTF-8, UTF-16LE, etc.

There's no such a thing as "unicode text".

When you store unicode characters in a file, you have to
decide on an encoding format. UTF-8 is one popular encoding
format. UTF-16 is another.

For all practical purposes, the only encoding form used outside
of a program should be UTF-8. This is the standard encoding
form for the Internet, for example.

James Kanze · Apr 15, 2010

On 04/15/2010 03:23 AM, James Kanze wrote:

Why? For example Japanese text takes considerably less space when
encoded with UTF-16 instead of UTF-8 (most characters take 2 bytes with
UTF-16 but 3 bytes with UTF-8) so it's more space-efficient.

Oh, not for technical reasons. Just because the Internet is
8 bits, and practically everything reads and writes in 8 bit
units. Use any encoding format but UTF-8, and you'll be bit by
issues of byte order sooner or later.

How to write a unicode string to file	0	Mar 30, 2007
Converting EBCDIC to Unicode	3	Sep 28, 2010
UNICODE I/O	11	Mar 27, 2008
Predefined MACROs are not Implemented yet.	0	Feb 27, 2010
Is this a BIG bug in all VC++ versions? About Unicode CR/LF translation.	0	Nov 13, 2010
Hardcoding a Unicode String(looks not work)	4	Jun 26, 2011
wofstream	3	Jun 26, 2007
Character operations in C++	2	Jan 28, 2024

Write non-unicode in file?

Immortal Nephi

Jonathan Lee

Immortal Nephi

Jonathan Lee

Joshua Maurice

Joshua Maurice

Jonathan Lee

James Kanze

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads