A few questiosn about encoding

Íéêüëáïò Êïýñáò · Jun 9, 2013

A few questiosn about encoding please:

Because then how do you tell when you need one byte, and when you need
two? If you read two bytes, and see 0x4C 0xFA, does that mean two
characters, with ordinal values 0x4C and 0xFA, or one character with
ordinal value 0x4CFA?

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.

Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
values to make a surrogate pair.

A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?

UTF-8 uses 8-bit values, but sometimes
it combines two, three or four of them to represent a single code-point.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'á´' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (sinceordinal > 65000 )

The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?

Fábio Santos · Jun 9, 2013

A few questiosn about encoding please:

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.

A surrogate pair is like itting for example Ctrl-A, which means is a

combination character that consists of 2 different characters?

Is this what a surrogate is? a pari of 2 chars?

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'á´' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 )

The amount of bytes needed to store a character solely depends on the

character's ordinal value in the Unicode table?
In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2
to 4 bytes. A utf-32 always takes 4 bytes.

The process of encoding bytes to characters is called encoding. The
opposite is decoding. This is all made transparent in python with the
encode() and decode() methods. You normally don't care about this kind of
things.

Nobody · Jun 9, 2013

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
meant up to 256, not above 256.

But then you've used up all 256 possible bytes for storing the first 256
characters, and there aren't any left for use in multi-byte sequences.

You need some means to distinguish between a single-byte character and an
individual byte within a multi-byte sequence.

UTF-8 does that by allocating specific ranges to specific purposes.
0x00-0x7F are single-byte characters, 0x80-0xBF are continuation bytes of
multi-byte sequences, 0xC0-0xFF are leading bytes of multi-byte sequences.

This scheme has the advantage of making UTF-8 non-modal, i.e. if a byte is
corrupted, added or removed, it will only affect the character containing
that particular byte; the encoder can re-synchronise at the beginning of
the following character.

OTOH, with encodings such as UTF-16, UTF-32 or ISO-2022, adding or
removing a byte will result in desyncronisation, with all subsequent
characters being corrupted.

A surrogate pair is like itting for example Ctrl-A, which means is a
combination character that consists of 2 different characters? Is this
what a surrogate is? a pari of 2 chars?

A surrogate pair is a pair of 16-bit codes used to represent a single
Unicode character whose code is greater than 0xFFFF.

The 2048 codepoints from 0xD800 to 0xDFFF inclusive aren't used to
represent characters, but "surrogates". Unicode characters with codes
in the range 0x10000-0x10FFFF are represented in UTF-16 as a pair of
surrogates. First, 0x10000 is subtracted from the code, giving a value in
the range 0-0xFFFFF (20 bits). The top ten bits are added to 0xD800 to
give a value in the range 0xD800-0xDBFF, while the bottom ten bits are
added to 0xDC00 to give a value in the range 0xDC00-0xDFFF.

Because the codes used for surrogates aren't valid as individual
characters, scanning a string for a particular character won't
accidentally match part of a multi-word character.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'Î±Î„' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is
stored ? (since ordinal > 65000 )

Most Chinese, Japanese and Korean (CJK) characters have codepoints within
the BMP (i.e. <= 0xFFFF), so they only require 3 bytes in UTF-8. The
codepoints above the BMP are mostly for archaic ideographs (those no
longer in normal use), mathematical symbols, dead languages, etc.

The amount of bytes needed to store a character solely depends on the
character's ordinal value in the Unicode table?

Yes. UTF-8 is essentially a mechanism for representing 31-bit unsigned
integers such that smaller integers require fewer bytes than larger
integers (subsequent revisions of Unicode cap the range of possible
codepoints to 0x10FFFF, as that's all that UTF-16 can handle).

Chris â€œKwpolskaâ€ Warrick · Jun 9, 2013

A few questiosn about encoding please:

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meantup to 256, not above 256.

It is required so the computer can know where characters begin.
0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8. Further
details here: http://en.wikipedia.org/wiki/UTF-8#Description

A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?

http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF

Long story short: codepoint - 0x10000 (up to 20 bits) â†’ two 10-bit
numbers â†’ 0xD800 + first_half 0xDC00 + second_half. Rephrasing:

We take MATHEMATICAL BOLD CAPITAL B (U+1D401). If you have UTF-8: ð

It is over 0xFFFF, and we need to use surrogate pairs. We end up with
0xD401, or 0b1101010000000001. Both representations are worthless, as
we have a 16-bit number, not a 20-bit one. We throw in some leading
zeroes and end up with 0b00001101010000000001. Split it in half and
we get 0b0000110101 and 0b0000000001, which we can now shorten to
0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001. 0xD800 +
0x0035 and 0xDC00 + 0x0035 â†’ 0xD835 0xDC00. Type it into python and:
'ð'

And before you ask: that â€œBEâ€ stands for Big-Endian. Little-Endian
would mean reversing the bytes in a codepoint, which would make it
'\x35\xD8\x01\xDC' (the name is based on the first 256 characters,
which are 0x6500 for 'a' in a little-endian encoding.

Another question you may ask: 0xD800â€¦0xDFFF are reserved in Unicode
for the purposes of UTF-16, so there is no conflicts.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'Î±Î„' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )

yup. Î± is at 0x03B1, or 945 decimal.

'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 )

Not necessarily, as CJK characters start at U+2E80, which is in the
3-byte range (0x0800 through 0xFFFF) â€” the table is here:
http://en.wikipedia.org/wiki/UTF-8#Description

Steven D'Aprano · Jun 12, 2013

Isn't 14 bits way to many to store a character ?

No.

There are 1114111 possible characters in Unicode. (And in Japan, they
sometimes use TRON instead of Unicode, which has even more.)

If you list out all the combinations of 14 bits:

0000 0000 0000 00
0000 0000 0000 01
0000 0000 0000 10
0000 0000 0000 11
[...]
1111 1111 1111 10
1111 1111 1111 11

you will see that there are only 32767 (2**15-1) such values. You can't
fit 1114111 characters with just 32767 values.

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚ · Jun 12, 2013

Isn't 14 bits way to many to store a character ?

Click to expand...

No.

There are 1114111 possible characters in Unicode. (And in Japan, they
sometimes use TRON instead of Unicode, which has even more.)

If you list out all the combinations of 14 bits:

0000 0000 0000 00
0000 0000 0000 01
0000 0000 0000 10
0000 0000 0000 11
[...]
1111 1111 1111 10
1111 1111 1111 11

you will see that there are only 32767 (2**15-1) such values. You can't
fit 1114111 characters with just 32767 values.

Thanks Steven,
So, how many bytes does UTF-8 stored for codepoints > 127 ?

example for codepoint 256, 1345, 16474 ?

Dave Angel · Jun 12, 2013

Isn't 14 bits way to many to store a character ?

Click to expand...

No.

There are 1114111 possible characters in Unicode. (And in Japan, they
sometimes use TRON instead of Unicode, which has even more.)

If you list out all the combinations of 14 bits:

0000 0000 0000 00
0000 0000 0000 01
0000 0000 0000 10
0000 0000 0000 11
[...]
1111 1111 1111 10
1111 1111 1111 11

you will see that there are only 32767 (2**15-1) such values. You can't
fit 1114111 characters with just 32767 values.

Actually, it's worse. There are 16536 such values (2**14), assuming you
include null, which you did in your list.

Ulrich Eckhardt · Jun 12, 2013

Am 12.06.2013 13:23, schrieb ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚:

So, how many bytes does UTF-8 stored for codepoints > 127 ?

What has your research turned up? I personally consider it lazy and
respectless to get lots of pointers that you could use for further
research and ask for more info before you even followed these links.

example for codepoint 256, 1345, 16474 ?

Yes, examples exist. Gee, if there only was an information network that
you could access and where you could locate information on various
programming-related topics somehow. Seriously, someone should invent
this thing! But still, even without it, you have all the tools (i.e.
Python) in your hand to generate these examples yourself! Check out ord,
bin, encode, decode for a start.

Uli

Nobody · Jun 12, 2013

So, how many bytes does UTF-8 stored for codepoints > 127 ?

U+0000..U+007F 1 byte
U+0080..U+07FF 2 bytes
U+0800..U+FFFF 3 bytes

=U+10000 4 bytes

So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic,
Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead
languages and mathematical symbols.

The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total
of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits).

Steven D'Aprano · Jun 13, 2013

So, how many bytes does UTF-8 stored for codepoints > 127 ?

Two, three or four, depending on the codepoint.

example for codepoint 256, 1345, 16474 ?

You can do this yourself. I have already given you enough information in
previous emails to answer this question on your own, but here it is again:

Open an interactive Python session, and run this code:

c = ord(16474)
len(c.encode('utf-8'))

That will tell you how many bytes are used for that example.

Steven D'Aprano · Jun 13, 2013

The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a
total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than
20 bits).

Same with UTF-8 and UTF-32, both of which are limited to U+10FFFF because
that is what Unicode is limited to.

The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the
mechanism of UTF-32 could go up to 0xFFFFFFFF, but doing so means you
don't have Unicode chars any more, and hence your byte-string is not
valid UTF-32:

py> b = b'\xFF'*8
py> b.decode('UTF-32')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
codepoint not in range(0x110000)

Chris Angelico · Jun 13, 2013

The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
that's not UTF-8, that's UTF-8-plus-extra-codepoints.

And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80",
even though mathematically they would translate into U+0000 and U+D800
respectively. The UTF-16 *mechanism* is limited to no more than
Unicode has currently used, but I'm left wondering if that's actually
the other way around - that Unicode planes were deemed to stop at the
point where UTF-16 can't encode any more. Not that it matters; with
most of the current planes completely unallocated, it seems unlikely
we'll be needing more.

ChrisA

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚ · Jun 13, 2013

Two, three or four, depending on the codepoint.

The amount of bytes needed by UTF-8 to store a code-point(character),
depends on the ordinal value of the code-point in the Unicode charset,
correct?

If this is correct then the higher the ordinal value(which is an decimal
integer) in the Unicode charset the more bytes needed for storage.

Its like the bigger a decimal integer is the bigger binary number it
produces.

Is this correct?

You can do this yourself. I have already given you enough information in
previous emails to answer this question on your own, but here it is again:

Open an interactive Python session, and run this code:

c = ord(16474)
len(c.encode('utf-8'))

That will tell you how many bytes are used for that example.

This si actually wrong.

ord()'s arguments must be a character for which we expect its ordinal value.
'äš'

Some Chinese symbol.
So code-point 'äš' has a Unicode ordinal value of 16474, correct?

where in after encoding this glyph's ordinal value to binary gives us
the following bytes:
b'0b100000001011010'

Now, we take tow symbols out:

'b' symbolism which is there to tell us that we are looking a bytes
object as well as the
'0b' symbolism which is there to tell us that we are looking a binary
representation of a bytes object

Thus, there we count 15 bits left.
So it says 15 bits, which is 1-bit less that 2 bytes.
Is the above statements correct please?

but thinking this through more and more:
3

it seems that the bytestring the encode process produces is of length 3.

So i take it is 3 bytes?

but there is a mismatch of what >>> bin(16474).encode('utf-8') and >>>
chr(16474).encode('utf-8') is telling us here.

Care to explain that too please ?

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚ · Jun 13, 2013

U+0000..U+007F 1 byte
U+0080..U+07FF 2 bytes
U+0800..U+FFFF 3 bytes

'U' stands for Unicode code-point which means a character right?

How can you be able to tell up to what character utf-8 needs 1 byte or 2
bytes or 3?

And some of the bytes' bits are used to tell where a code-points
representations stops, right? i mean if we have a code-point that needs
2 bytes to be stored that the high bit must be set to 1 to signify that
this character's encoding stops at 2 bytes.

I just know that 2^8 = 256, that's by first look 265 places, which mean
256 positions to hold a code-point which in turn means a character.

We take the high bit out and then we have 2^7 which is enough positions
for 0-127 standard ASCII. High bit is set to '0' to signify that char is
encoded in 1 byte.

Please tell me that i understood correct so far.

But how about for 2 or 3 or 4 bytes?

Am i saying ti correct ?

jmfauth · Jun 13, 2013

------

UTF-8, Unicode (consortium): 1 to 4 *Unicode Transformation Unit*

UTF-8, ISO 10646: 1 to 6 *Unicode Transformation Unit*

(still actual, unless tealy freshly modified)

jmf

Chris Angelico · Jun 13, 2013

How can you be able to tell up to what character utf-8 needs 1 byte or 2
bytes or 3?

You look up Wikipedia, using the handy links that have been put to you
MULTIPLE TIMES.

ChrisA

Steven D'Aprano · Jun 13, 2013

On 13/6/2013 3:13 Ï€Î¼, Steven D'Aprano wrote:
This si actually wrong.

ord()'s arguments must be a character for which we expect its ordinal
value.

Gah!

That's twice I've screwed that up. Sorry about that!

'äš'

Some Chinese symbol.
So code-point 'äš' has a Unicode ordinal value of 16474, correct?
Correct.

where in after encoding this glyph's ordinal value to binary gives us
the following bytes:

b'0b100000001011010'

No! That creates a string from 16474 in base two:

'0b100000001011010'

The leading 0b is just syntax to tell you "this is base 2, not base 8
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

Then you encode the string '0b100000001011010' into UTF-8. There are 17
characters in this string, and they are all ASCII characters to they take
up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form). In
hex form, they are:

b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30'

which takes up a lot more room, which is why Python prefers to show ASCII
characters as characters rather than as hex.

What you want is:

chr(16474).encode('utf-8')

[...]

Thus, there we count 15 bits left.
So it says 15 bits, which is 1-bit less that 2 bytes. Is the above
statements correct please?

No. There are 17 BYTES there. The string "0" doesn't get turned into a
single bit. It still takes up a full byte, 0x30, which is 8 bits.

but thinking this through more and more:

3

it seems that the bytestring the encode process produces is of length 3.

Correct! Now you have got the right idea.

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚ · Jun 13, 2013

An observations here that you please confirm as valid.

1. A code-point and the code-point's ordinal value are associated into a
Unicode charset. They have the so called 1:1 mapping.

So, i was under the impression that by encoding the code-point into
utf-8 was the same as encoding the code-point's ordinal value into utf-8.

That is why i tried to:
bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')

So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its
ordinal value.

The leading 0b is just syntax to tell you "this is base 2, not base 8
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

But byte objects are represented as '\x' instead of the aforementioned
'0x'. Why is that?

No! That creates a string from 16474 in base two:
'0b100000001011010'

I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?

Then you encode the string '0b100000001011010' into UTF-8. There are 17
characters in this string, and they are all ASCII characters to they take
up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form).

0b100000001011010 stands for a number in base 2 for me not as a string.
Have i understood something wrong?

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚ · Jun 13, 2013

You can disagree all you like. Steven cited a simple point of fact,
one which can be verified in any Python interpreter. Nikos, you are
flat wrong here; bin(16474) creates a string.

Indeed python embraced it in single quoting '0b100000001011010' and not
as 0b100000001011010 which in fact makes it a string.

But since bin(16474) seems to create a string rather than an expected
number(at leat into my mind) then how do we get the binary
representation of the number 16474 as a number?

Chris Angelico · Jun 13, 2013

Indeed python embraced it in single quoting '0b100000001011010' and not as
0b100000001011010 which in fact makes it a string.

But since bin(16474) seems to create a string rather than an expected
number(at leat into my mind) then how do we get the binary representationof
the number 16474 as a number?

In Python 2:
In Python 3, you have to fiddle around with ctypes, but broadly
speaking, the same thing.

ChrisA

Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
newbie with a encoding question, please help	8	Apr 1, 2010
Question of UTF16BE encoding / decoding	2	May 5, 2009
Python3 - encoding issues	4	Nov 29, 2009
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013

A few questiosn about encoding

Íéêüëáïò Êïýñáò

Fábio Santos

Nobody

Chris â€œKwpolskaâ€ Warrick

Steven D'Aprano

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚

Dave Angel

Ulrich Eckhardt

Nobody

Steven D'Aprano

Steven D'Aprano

Chris Angelico

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚

jmfauth

Chris Angelico

Steven D'Aprano

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚

ÎÎ¹ÎºÏŒÎ»Î±Î¿Ï‚ ÎšÎ¿ÏÏÎ±Ï‚

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads