Write non-unicode in file?

I

Immortal Nephi

I define wofstream variable. Why do data.txt contain non-unicode
text? It should be unicode text.

int main()
{
wofstream wFile;
wFile.open( L"Data.txt", std::ios::eek:ut );
wFile << L"Hello World!" << std::endl;
wFile.close();

return 0;
}
 
J

Jonathan Lee

        I define wofstream variable.  Why do data.txt contain non-unicode
text?  It should be unicode text.

Looks like you have to set the locale. Google took me to here:

http://www.velocityreviews.com/forums/t517679-wofstream.html

which describes a similar problem and is answered by James Kanze
(looks like an archive of this newsgroup or something).

Other than that, though, wouldn't "Hello World!" look exactly
the same in UTF-8 as ASCII? Or are you expecting UTF-16?
(I'm not familiar with wofstream...)

--Jonathan
 
I

Immortal Nephi

How do you know it contains non-Unicode text?  What gives it away?  Do
you look at the file with a hex editor?  What do you see?

Look at file's byte size. It reports 13 bytes.
You know that it is non-unicode. You can look
at hex editor. It tells you the same. It should
show 0x00 each between two characters. The byte
will be 26.
 
J

Jonathan Lee

Look at file's byte size.  It reports 13 bytes.
You know that it is non-unicode.  You can look
at hex editor.  It tells you the same.  It should
show 0x00 each between two characters.  The byte
will be 26.

You really should look up what UTF-8 is.

--Jonathan
 
J

Joshua Maurice

You really should look up what UTF-8 is.

More specifically, you both should look up Unicode. He is using the
old definition of Unicode from the old standard of Unicode where there
was only one encoding. Today, we call this encoding UCS-2 (which is
very similar to UTF-16).

Jonathan Lee and Nephi both make the same mistake. They both assume
"Unicode" means a particular encoding scheme of Unicode Code Points,
Jonathan UTF-8 and Nephi UCS-2. It would help the conversation to stop
making such assumptions or mistakes.

I personally hate standard C++'s handling of all things unicode, so
I'm sorry I cannot add more to this conversation.
 
J

Joshua Maurice

You really should look up what UTF-8 is.

Odd. Did google just drop my reply? Sorry if this is a double post of
roughly the same content.

Anyway, you are both making the same mistake. Unicode is not a single
encoding scheme. At least, the current standard of Unicode does not
have a single encoding scheme. Unicode is a mapping from characters to
numbers. It specifies several encodings, such as UTF-8, UTF-16, and
UTF-32. The old Unicode standard had characters for which 16 bits
would suffice, and specified only a single encoding UCS-2 (which is
very similar to UTF-16). Hence the unfortunate habit of some people
calling UCS-2 Unicode, which Nephi is doing now. However, it is just
as big of a mistake to say that Unicode is UTF-8 as Jonathan just did.
It would help the conversation to use correct terms and stop making
assumptions. Perhaps Nephi has a requirement for UTF-16 / UCS-2
encoding (a valid business use case), in which case Jonathan was
incorrect when he said "look up UTF-8".
 
J

Jonathan Lee

Odd. Did google just drop my reply? Sorry if this is a double post of
roughly the same content.

Anyway, you are both making the same mistake.

Well.. no. My point was that he should look up UTF-8 to
discover that there is more than the UTF-16 way of
encoding Unicode. I think Victor was implying something
of the same. If "Hello World!" were encoded in an
unspecified Unicode encoding, *how would* Immortal
Nephi know it weren't in Unicode?

--Jonathan
 
J

James Kanze

On 04/13/2010 08:53 PM, Immortal Nephi wrote:
What do you mean by "unicode text". Unicode only defines
integral values and the characters they correspond to. It does
not define how those integral values should be stored in a
file.

Recent versions do. Unicode defines both an "encoding" (a
mapping between characters and integral values) and a number of
different "encoding forms", which define how these integral
values are represented in a linear sequence of 8 bit bytes.

Of course, the original poster hasn't indicated what encoding
form he is expecting: UTF-8, UTF-16LE, etc.
There's no such a thing as "unicode text".
When you store unicode characters in a file, you have to
decide on an encoding format. UTF-8 is one popular encoding
format. UTF-16 is another.

For all practical purposes, the only encoding form used outside
of a program should be UTF-8. This is the standard encoding
form for the Internet, for example.
 
J

James Kanze

On 04/15/2010 03:23 AM, James Kanze wrote:
Why? For example Japanese text takes considerably less space when
encoded with UTF-16 instead of UTF-8 (most characters take 2 bytes with
UTF-16 but 3 bytes with UTF-8) so it's more space-efficient.

Oh, not for technical reasons. Just because the Internet is
8 bits, and practically everything reads and writes in 8 bit
units. Use any encoding format but UTF-8, and you'll be bit by
issues of byte order sooner or later.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top