Confusion between UTF-8 and Unicode

C

Celia

I've looked up UTF-8 and Unicode in the Wikipedia, and at Dictionary.com,
but I'm not grokking it yet.

From what I understand:

Unicode:
Every human language character 'a', '7', '*', etc is converted into a
16 bit number.

UTF-8:
Every human language character is converted into a 1 or 2 byte number
to make it align with ASCII and be useable with non Unicode enabled apps.


According to Wikipedia:
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-
length character encoding for Unicode...


If these are correct descriptions then that would make UTF-8 _not_
something which is on top of, or for, unicode, but a variation of unicode.
I thought Unicode _is_ a character encoding.



Please show me my ignorance.
Non-technical analogies would be particularly helpful.



-C
 
S

shakah

I think Unicode is always two 8-bit bytes per character, UTF-8 is
either 1- or 2-bytes per character.
 
M

Malte

Celia wrote:


This is, I think, as good a description as any ;-)

malte@linux:~> whatis unicode
Unicode (7) - (unknown subject)
unicode (7) - the Universal Character Set
malte@linux:~> whatis utf-8
UTF-8 (7) - (unknown subject)
utf-8 (7) - an ASCII compatible multi-byte Unicode encoding

http://www.google.dk/search?hl=da&lr=&oi=defmore&q=define:UTF-8

The above link will show you a lot of links to stuff that basically says
the same.
 
S

Steve W. Jackson

Celia <[email protected]> said:
I've looked up UTF-8 and Unicode in the Wikipedia, and at Dictionary.com,
but I'm not grokking it yet.

From what I understand:

Unicode:
Every human language character 'a', '7', '*', etc is converted into a
16 bit number.

UTF-8:
Every human language character is converted into a 1 or 2 byte number
to make it align with ASCII and be useable with non Unicode enabled apps.


According to Wikipedia:
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-
length character encoding for Unicode...


If these are correct descriptions then that would make UTF-8 _not_
something which is on top of, or for, unicode, but a variation of unicode.
I thought Unicode _is_ a character encoding.



Please show me my ignorance.
Non-technical analogies would be particularly helpful.



-C

Rather than looking up somebody's definition, pay a visit to a solid
source: <http://www.unicode.org/>. That should get you rolling.

= Steve =
 
A

Alan Moore

If these are correct descriptions then that would make UTF-8 _not_
something which is on top of, or for, unicode, but a variation of unicode.
I thought Unicode _is_ a character encoding.

An encoding is a way to translate a stream of bits (on disk, in
memory, etc.) into characters. Unicode is not an encoding, it's a
character set: a way of assigning numeric values to characters, like
ASCII. With ASCII, we never needed to make the distinction between a
character set an an encoding, because each byte represents a
character. But Unicode characters can have values up to 2^32, which
means we would need four bytes to represent each character if we were
to use the same approach to encoding as we do with ASCII.
(Originally, Unicode character values only went up to 2^16, but they
discovered that wasn't sufficient.) That was a pretty difficult pill
for programmers and administrators to swallow, so UTF-8 was invented
as a compromise. Characters in the 7-bit ASCII range only take one
byte to encode, while two bytes can convey the extended ASCII
characters plus many characters from other Western character sets. As
the character values become larger, UTF-8 becomes less efficient than
a simple numeric-value encoding, but if you're dealing mainly with
ASCII, it works very well.

This is all terribly simplified but I hope I've made the point:
Unicode is not an encoding, it's a character set (THE character set,
if you will). UTF-8 is an encoding of the Unicode character set.
 
E

Edwin Martin

Celia said:
I've looked up UTF-8 and Unicode in the Wikipedia, and at Dictionary.com,
but I'm not grokking it yet.

UTF-8 is an encoding of Unicode in such a way that a plain ASCII file is
also a valid UTF-8 file (with the same contents, ofcourse).

See also:

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)

http://www.joelonsoftware.com/articles/Unicode.html

Edwin Martin
 
L

Lee Fesperman

Celia said:
I've looked up UTF-8 and Unicode in the Wikipedia, and at Dictionary.com,
but I'm not grokking it yet.

From what I understand:

Unicode:
Every human language character 'a', '7', '*', etc is converted into a
16 bit number.

UTF-8:
Every human language character is converted into a 1 or 2 byte number
to make it align with ASCII and be useable with non Unicode enabled apps.

BTW, UTF-8 also produces 3 byte results. It has to in order to be lossless in encoding
16-bit Unicode (think about it). Also, it doesn't encode every human language character
because it only encodes 16-bit Unicode.
 
A

Aquila Deus

Celia said:
I've looked up UTF-8 and Unicode in the Wikipedia, and at Dictionary.com,
but I'm not grokking it yet.

From what I understand:

Unicode:
Every human language character 'a', '7', '*', etc is converted into a
16 bit number.

UTF-8:
Every human language character is converted into a 1 or 2 byte number
to make it align with ASCII and be useable with non Unicode enabled apps.


According to Wikipedia:
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-
length character encoding for Unicode...


If these are correct descriptions then that would make UTF-8 _not_
something which is on top of, or for, unicode, but a variation of unicode.
I thought Unicode _is_ a character encoding.

Ummmm, because many people and softwares actually refer to the
"Unicode-16" encoding when they use the word "Unicode".
 
B

blake.ong

Is it true that a webpage for example... which uses Unicode can display
different languages in the same webpage? while UTF-8 cant....?
 
A

Antti S. Brax

Is it true that a webpage for example... which uses Unicode can display
different languages in the same webpage? while UTF-8 cant....?

That is incorrect information.

Unicode defines a _set_ _of_ _characters_.

UTF-8 defines a way to represent the characters defined by
Unicode in binary (== bytes). There are also other ways to
represent unicode characters in binary than just UTF-8
(for example UTF-16BE and UTF-16LE).

Web pages and different languages are most likely related
to XML and it's lang-attribute.
 
A

Aquila Deus

Antti said:
Yes.

while UTF-8 cant....?

Yes it can. UTF-8 is just one of unicode encodings.
That is incorrect information.

Unicode defines a _set_ _of_ _characters_.

UTF-8 defines a way to represent the characters defined by
Unicode in binary (== bytes). There are also other ways to
represent unicode characters in binary than just UTF-8
(for example UTF-16BE and UTF-16LE).

Exactly right, what's confusing is that many people use term "Unicode"
to refer to the "UTF-16BE" and/or "UTF-16LE" encoding.
Web pages and different languages are most likely related
to XML and it's lang-attribute.

Save all files in UTF-8 and then you don't need to worry about
languages anymore :)
 
B

Bryce

I've looked up UTF-8 and Unicode in the Wikipedia, and at Dictionary.com,
but I'm not grokking it yet.

Read this:
http://www.joelonsoftware.com/articles/Unicode.html
From what I understand:

Unicode:
Every human language character 'a', '7', '*', etc is converted into a
16 bit number.

UTF-8:
Every human language character is converted into a 1 or 2 byte number
to make it align with ASCII and be useable with non Unicode enabled apps.

UTF-8 is a variable byte code page. For normal US ASCII characters,
only a single byte is required.

According to Wikipedia:
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-
length character encoding for Unicode...
True

If these are correct descriptions then that would make UTF-8 _not_
something which is on top of, or for, unicode, but a variation of unicode.
I thought Unicode _is_ a character encoding.

No, Unicode is a "dictionary" of all characters. UTF-8 is a code page
(or character set if you will), its a way of representing that Unicode
character in memory.

For example:
Lets take the Euro symbol. In Unicode, its represented:
U+20AC

Its represented in UTF-8 in memory as:
E2 82
Please show me my ignorance.
Non-technical analogies would be particularly helpful.

Read the article I posted above, and it should shed some light on the
subject
 
B

Bryce

An encoding is a way to translate a stream of bits (on disk, in
memory, etc.) into characters. Unicode is not an encoding, it's a
character set: a way of assigning numeric values to characters, like
ASCII. With ASCII, we never needed to make the distinction between a
character set an an encoding, because each byte represents a
character. But Unicode characters can have values up to 2^32, which
means we would need four bytes to represent each character if we were
to use the same approach to encoding as we do with ASCII.
(Originally, Unicode character values only went up to 2^16, but they
discovered that wasn't sufficient.) That was a pretty difficult pill
for programmers and administrators to swallow, so UTF-8 was invented
as a compromise. Characters in the 7-bit ASCII range only take one
byte to encode, while two bytes can convey the extended ASCII
characters plus many characters from other Western character sets. As
the character values become larger, UTF-8 becomes less efficient than
a simple numeric-value encoding, but if you're dealing mainly with
ASCII, it works very well.

This is all terribly simplified but I hope I've made the point:
Unicode is not an encoding, it's a character set (THE character set,
if you will). UTF-8 is an encoding of the Unicode character set.

Couldn't have said it any better.
 
B

Bryce

BTW, UTF-8 also produces 3 byte results. It has to in order to be lossless in encoding
16-bit Unicode (think about it). Also, it doesn't encode every human language character
because it only encodes 16-bit Unicode.

UTF-8 can have up to 6 bytes.
 
B

Bryce

Is it true that a webpage for example... which uses Unicode can display
different languages in the same webpage? while UTF-8 cant....?

A webpage can't "use" Unicode. It is either UTF-8, or some other
encoding.
 
L

Lee Fesperman

Bryce said:
UTF-8 can have up to 6 bytes.

Oops, I didn't realize that. I'm afraid my information had come from reverse
engineering. After thinking on it later, I had come to the conclusion that the encoding
could support 32-bit Unicode with 6 bytes. Thanks for the correction.
 
C

Chris Smith

Lee Fesperman said:
Oops, I didn't realize that. I'm afraid my information had come from reverse
engineering. After thinking on it later, I had come to the conclusion that the encoding
could support 32-bit Unicode with 6 bytes. Thanks for the correction.

It's actually even more complicated than that. Java, in all cases where
it implements UTF-8, supports a kind of pseudo-UTF-8. This Java-
specific encoding first encodes the Unicode text as UTF-16, and then
uses only the 1-byte, 2-byte, and 3-byte forms of UTF-8. So it's
*correct* to say that UTF-8 can be up to six bytes long, but it's
perhaps misleading in the context of Java unless a disclaimer is added.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
A

Antti S. Brax

Save all files in UTF-8 and then you don't need to worry about
languages anymore :)

Saving pages in UTF-8 only releaves me from worrying about
encoding Å, Ä and Ö. Using UTF-8 won't magically give the
english speaking world a clue about how to pronounce them.
:)
 
C

Chris Uppal

Chris said:
It's actually even more complicated than that.

I hate to say it, but you are over-simplifying ;-)

Unfortunately, picture has become quite confused (and Sun, IMO, have
unnecessarily and irresponsible added to this). So here's my attempt to add to
the confusion...

Let's start with UTF-8. There are two "official" standards for the encoding
known as UTF-8. One is in ISO/IEC 10646 (which I haven't read, btw, I'm going
on hearsay here), and is summarised in RFC 2279. That defines an encoding of
31-bit values in up to 6 bytes. I believe the same encoding would work
perfectly well for the full 32-bit range, but it is artificially limited to
31-bit values. The second "official" standard for UTF-8 is that of the
Unicode consortium; their version of it is identical to the IS0 version except
that it is further limited (artificially) to the 24-bit range, and hence never
requires more than 4 bytes to encode a value. IMO, this is a mistake on the
part of the Unicode people -- implementations should be required to decode the
full ISO range (including the extended private use area) rather than being
required (as I understand it) to abort with an error if ISO-encoded >24-bit
data is encountered. Still, in practise, for Unicode data (which is always 24
bit or less) there is no difference between the formats.

Now Sun enter the picture. Start with the situation before Java 5. Java (as
of then) used Unicode internally. Not any /encoding/, just pure abstract
Unicode
data -- each String corresponds to an a sequence of characters from the Unicode
repertoire. That's all very nice and clear, unfortunately there are a couple
of snakes in this Eden.

One is that the primitive type 'char' is a 16-bit quantity, so most Unicode
characters cannot be represented in Java. Fortunately those characters (the
ones outside the 16-bit range) are used relatively infrequently, so we mostly
managed to get along with Java the way it was. It's obviously a problem
waiting to happen, though, especially if a Java program is receiving Unicode
data from a source that is not hamstrung by a crippled Unicode implementation.
(XML data is Unicode, for instance, and it'd be unfortunate if a Java XML
implementation barfed when faced with perfectly valid XML).

The second problem is less severe -- in fact it only causes confusion, not
actual functional limitations. Sun decided to define their own encoding for
Unicode data. I have no problems with that, it's a sensible encoding for its
purpose(s). Where they displayed flabbergasting irresponsibility was to call
it "UTF-8" too. Admittedly it's closely related to UTF-8, but it is neither
upwardly nor downwardly compatible with it. That encoding (call it
pseudo-UTF-8) can only encode values in the 16 bit range, and so never uses
more than 3 bytes per "character" (however it uses 2 bytes for 0, whereas true
UTF-8 uses only 1 byte). Since Sun blithely named various methods that
manipulate data in this format with some variation on 'utf8' (e.g.
ObjectOutputStream.writeUTF8() or the JNI function GetStringUTFChars()) that
has added to the confusion. OTOH, the CharsetEncoder called "UTF-8" does
perform true UTF-8 encoding (not pseudo-UTF8), at least for the sequences of
16-bit limited 'char's that could be fed to it prior to Java5.

But Java programmers are rarely satisfied. We demand ever greater complexity,
baroque over-engineering piled on confounding intricacy. So Sun, responding as
ever to the needs of the community, decided to Act...

Java 5 adds another layer of confusion. To Sun's credit, the misnamed
references to "UTF-8" have been clearly documented as such (but not, alas,
deprecated and renamed). However it was necessary to do something about the
16-bit limit. To be honest, I don't think that Sun had any choice in the
solution they've adopted, but that doesn't make it any less vile.

Since Java 5, Strings (and similar) are no longer pure abstractions of Unicode
character sequences. The 'char' datatype no longer represents (in any useful
sense) a Unicode character. No, by fiat the objects that used to hold pure
abstract Unicode, now contain an /encoded/ representation -- specifically
UTF-16. The so-called 'char' datatype no longer holds pure Unicode characters,
but instead is used to hold the 16-bit quantities that are used by the UTF-16
encoding. String.charAt() no longer returns the nth character of the Unicode
string, but returns the nth 16-bit value from the UTF-16 encoding of the
Unicode string (and, as such, is useless in any context that is about the
textual meaning of the string -- Character.isUpperCase(char) for instance no
longer makes any sense at all). Actual semantic textual elements are now
represented as 'int's. (Of course, Unicode makes it clear that the
"characters" in a Unicode sequence do not necessarily map directly onto the
"textual elements" that a human reader would perceive -- there are diacritical
marks and so on -- but that's just another delicious layer of complexity in the
cake...)

Incidentally, this means that some legal Java Strings are no longer legal
Unicode. Not merely that they can (in principle) contain sequences that are
meaningless when interpreted as UTF-16, but that they can contain sequences
that conforming Unicode implementations are required to reject (for security
reasons). I am reasonably hopeful that the Unicode CharsetEncoders will detect
such malformed sequences and refuse to generate correspondingly malformed (and
illegal) byte-sequences, but I haven't yet checked.

All this is pretty unfortunate. We are left in a position where we can either
do our own handling of the UTF-16 encoding (very error prone, especially as
many mistaken assumptions about the textual meaning of 'char' values won't be
caught be the compiler /or/ by unsophisticated testing), or switch over to
using the newer APIs (which are unnecessarily clunky, IMO. For instance why
is there no easy way to iterate over the logical elements of a String ? They
are
also confusingly low-level technical, with much talk of 'surrogate pairs' and
so on). Or, I suppose, we could create our own Unicode-aware objects and use
those in preference to the supplied 'char' and java.lang.String, but then what
do we do with all the other software that expects to work with Strings (and
similar) ?

Oh yes, and what about quasi-UTF-8 ? Sun have seized the bull by the horns and
/made no change/... An admittedly ingenious solution to a technical problem --
arguably even quite elegant. But it does mean that the JVM communicates with
the real world using data that is encoded twice; 24-bit Unicode data is first
encoded into UTF-16, and then that is encoded again using the old quasi-UTF-8
format. Thus a 24-bit character can require 1, 2, 3 or 6 bytes to encode.

I love this stuff. Just love it...

-- chris
 
A

Alan Moore

I love this stuff. Just love it...

We can tell. ;)

I'm sure this was far more than the OP wanted to know, but it clears
up some questions I've had for a while, so thanks, Chris. If only we
could go back and make 'char' a 32-bit type...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top