unicode by default

H

harrismh777

hi folks,
I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in
python 3.x by default. (I know what default means, I mean, what changed?)

I think part of my problem is that I'm spoiled (American, ascii
heritage) and have been either stuck in ascii knowingly, or UTF-8
without knowing (just because the code points lined up). I am confused
by the implications for using 3.x, because I am reading that there are
significant things to be aware of... what?

On my installation 2.6 sys.maxunicode comes up with 1114111, and my
2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6
was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that
the default compile option for 2.7 & 3.2 (I didn't change anything) is
set for UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much
correctly?

The books say that the .py sources are UTF-8 by default... and that
3.x is either UCS-2 or UCS-4. If I use the file handling capabilities
of Python in 3.x (by default) what encoding will be used, and how will
that affect the output?

If I do not specify any code points above ascii 0xFF does any of
this matter anyway?



Thanks.

kind regards,
m harris
 
I

Ian Kelly

hi folks,
  I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in python
3.x by default. (I know what default means, I mean, what changed?)

The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.
  I think part of my problem is that I'm spoiled (American, ascii heritage)
and have been either stuck in ascii knowingly, or UTF-8 without knowing
(just because the code points lined up). I am confused by the implications
for using 3.x, because I am reading that there are significant things to be
aware of... what?

Mainly Python 3 no longer does explicit conversion between bytes and
unicode, requiring the programmer to be explicit about such
conversions. If you have Python 2 code that is sloppy about this, you
may get some Unicode encode/decode errors when trying to run the same
code in Python 3. The 2to3 tool can help somewhat with this, but it
can't prevent all problems.
  On my installation 2.6  sys.maxunicode comes up with 1114111, and my 2.7
and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
default compile option for 2.7 & 3.2 (I didn't change anything) is set for
UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much correctly?

I think that UCS-2 has always been the default unicode width for
CPython, although the exact representation used internally is an
implementation detail.
  The books say that the .py sources are UTF-8 by default... and that 3..x is
either UCS-2 or UCS-4.  If I use the file handling capabilities of Python in
3.x (by default) what encoding will be used, and how will that affect the
output?

If you open a file in binary mode, the result is a non-decoded byte stream.

If you open a file in text mode and do not specify an encoding, then
the result of locale.getpreferredencoding() is used for decoding, and
the result is a unicode stream.
  If I do not specify any code points above ascii 0xFF does any of this
matter anyway?

You mean 0x7F, and probably, due to the need to explicitly encode and decode.
 
B

Benjamin Kaplan

hi folks,
  I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in python
3.x by default. (I know what default means, I mean, what changed?)

  I think part of my problem is that I'm spoiled (American, ascii heritage)
and have been either stuck in ascii knowingly, or UTF-8 without knowing
(just because the code points lined up). I am confused by the implications
for using 3.x, because I am reading that there are significant things to be
aware of... what?

  On my installation 2.6  sys.maxunicode comes up with 1114111, and my 2.7
and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
default compile option for 2.7 & 3.2 (I didn't change anything) is set for
UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much correctly?

Not really sure about that, but it doesn't matter anyway. Because even
though internally the string is stored as either a UCS-2 or a UCS-4
string, you never see that. You just see this string as a sequence of
characters. If you want to turn it into a sequence of bytes, you have
to use an encoding.
  The books say that the .py sources are UTF-8 by default... and that 3..x is
either UCS-2 or UCS-4.  If I use the file handling capabilities of Python in
3.x (by default) what encoding will be used, and how will that affect the
output?

  If I do not specify any code points above ascii 0xFF does any of this
matter anyway?

ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
there is a difference for anything over that range. A byte string is a
sequence of bytes. A unicode string is a sequence of these mythical
abstractions called characters. So a unicode string u'\u00a0' will
have a length of 1. Encode that to UTF-8 and you'll find it has a
length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
the top bit is used to signal that you need the next byte for this
character)

If you want the history behind the whole encoding mess, Joel Spolsky
wrote a rather amusing article explaining how this all came about:
http://www.joelonsoftware.com/articles/Unicode.html

And the biggest reason to use Unicode is so that you don't have to
worry about your program messing up because someone hands you input in
a different encoding than you used.
 
H

harrismh777

Ian Kelly wrote:

Ian, Benjamin, thanks much.
The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.

... thank you, this is very helpful.
You mean 0x7F, and probably, due to the need to explicitly encode and decode.

Yes, actually, I did... and from Benjamin's reply it seems that
this matters only if I am working with bytes. Is it true that if I am
working without using bytes sequences that I will not need to care about
the encoding anyway, unless of course I need to specify a unicode code
point?

Thanks again.

kind regards,
m harris
 
J

John Machin

Is it true that if I am
working without using bytes sequences that I will not need to care about
the encoding anyway, unless of course I need to specify a unicode code
point?

Quite the contrary.

(1) You cannot work without using bytes sequences. Files are byte
sequences. Web communication is in bytes. You need to (know / assume / be
able to extract / guess) the input encoding. You need to encode your
output using an encoding that is expected by the consumer (or use an
output method that will do it for you).

(2) You don't need to use bytes to specify a Unicode code point. Just use
an escape sequence e.g. "\u0404" is a Cyrillic character.
 
H

harrismh777

John said:
(1) You cannot work without using bytes sequences. Files are byte
sequences. Web communication is in bytes. You need to (know / assume / be
able to extract / guess) the input encoding. You need to encode your
output using an encoding that is expected by the consumer (or use an
output method that will do it for you).

(2) You don't need to use bytes to specify a Unicode code point. Just use
an escape sequence e.g. "\u0404" is a Cyrillic character.

Thanks John. In reverse order, I understand point (2). I'm less clear
on point (1).

If I generate a string of characters that I presume to be ascii/utf-8
(no \u0404 type characters) and write them to a file (stdout) how does
default encoding affect that file.by default..? I'm not seeing that
there is anything unusual going on... If I open the file with vi? If
I open the file with gedit? emacs?

.....

Another question... in mail I'm receiving many small blocks that look
like sprites with four small hex codes, scattered about the mail...
mostly punctuation, maybe? ... guessing, are these unicode code
points, and if so what is the best way to 'guess' the encoding? ... is
it coded in the stream somewhere...protocol?

thanks
 
M

MRAB

Thanks John. In reverse order, I understand point (2). I'm less clear on
point (1).

If I generate a string of characters that I presume to be ascii/utf-8
(no \u0404 type characters) and write them to a file (stdout) how does
default encoding affect that file.by default..? I'm not seeing that
there is anything unusual going on... If I open the file with vi? If I
open the file with gedit? emacs?

....

Another question... in mail I'm receiving many small blocks that look
like sprites with four small hex codes, scattered about the mail...
mostly punctuation, maybe? ... guessing, are these unicode code points,
and if so what is the best way to 'guess' the encoding? ... is it coded
in the stream somewhere...protocol?
You need to understand the difference between characters and bytes.

A string contains characters, a file contains bytes.

The encoding specifies how a character is represented as bytes.

For example:

In the Latin-1 encoding, the character "£" is represented by the
byte 0xA3.

In the UTF-8 encoding, the character "£" is represented by the byte
sequence 0xC2 0xA3.

In the ASCII encoding, the character "£" can't be represented at all.

The advantage of UTF-8 is that it can represent _all_ Unicode
characters (codepoints, actually) as byte sequences, and all those in
the ASCII range are represented by the same single bytes which the
original ASCII system used. Use the UTF-8 encoding unless you have to
use a different one.

A file contains only bytes, a socket handles only bytes. Which encoding
you should use for characters is down to protocol. A system such as
email, which can handle different encodings, should have a way of
specifying the encoding, and perhaps also a default encoding.
 
H

harrismh777

Steven said:

Thanks for being patient guys, here's what I've done:


When I edit "myfile" with vi I see the 'characters' :

pound sign £

... same with emacs, same with gedit ...


When I hexdump myfile I see this:

0000000 6f70 6375 2064 6973 6e67 c220 00a3


This is *not* what I expected... well it is (little-endian) right up to
the 'c2' and that is what is confusing me....

I did not open the file with an encoding of UTF-8... so I'm assuming
UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
'A300' but what I got instead was UTF-8 little-endian 'c2a3' ....

See my problem?... when I open the file with emacs I see the character
pound sign... same with gedit... they're all using UTF-8 by default. By
default it looks like Python3 is writing output with UTF-8 as default...
and I thought that by default Python3 was using either UTF-16 or UTF-32.
So, I'm confused here... also, I used the character sequence \u00A3
which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
which is the normal UTF-8...

Thanks again for your patience... I really do hate to be dense about
this... but this is another area where I'm just beginning to dabble and
I'd like to know soon what I'm doing...

Thanks for the link Steve... I'm headed there now...




kind regards,
m harris
 
J

John Machin

Thanks John. In reverse order, I understand point (2). I'm less clear
on point (1).

If I generate a string of characters that I presume to be ascii/utf-8
(no \u0404 type characters)
and write them to a file (stdout) how does
default encoding affect that file.by default..? I'm not seeing that
there is anything unusual going on...

About """characters that I presume to be ascii/utf-8 (no \u0404 type
characters)""": All Unicode characters (including U+0404) are encodable in
bytes using UTF-8.

The result of sys.stdout.write(unicode_characters) to a TERMINAL depends
mostly on sys.stdout.encoding. This is likely to be UTF-8 on a
linux/OSX/platform. On a typical American / Western European /[former]
colonies Windows box, this is likely to be cp850 on a Command Prompt
window, and cp1252 in IDLE.

UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises
if the terminal can't render the character -- you'll get spaces or blobs
or boxes with hex digits in them or nothing.

Windows (Command Prompt window): only a small subset of characters can be
encoded in e.g. cp850; anything else causes an exception.

Windows (IDLE): ignores sys.stdout.encoding and renders the characters
itself. Same outcome as *x/UTF-8 above.

If you write directly (or sys.stdout is redirected) to a FILE, the default
encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless
the machine's site.py has been fiddled with to make it UTF-8 or something
else.
If I open the file with vi? If
I open the file with gedit? emacs?

Any editor will have a default encoding; if that doesn't match the file
encoding, you have a (hopefully obvious) problem if the editor doesn't
detect the mismatch. Consult your editor's docs or HTFF1K.
Another question... in mail I'm receiving many small blocks that look
like sprites with four small hex codes, scattered about the mail...
mostly punctuation, maybe? ... guessing, are these unicode code
points,
yes

and if so what is the best way to 'guess' the encoding?

google("chardet") or rummage through the mail headers (but 4 hex digits in
a box are a symptom of inability to render, not necessarily caused by an
incorrect decoding)

... is
it coded in the stream somewhere...protocol?

Should be.
 
T

Terry Reedy

Thanks for being patient guys, here's what I've done:



When I edit "myfile" with vi I see the 'characters' :

pound sign £

... same with emacs, same with gedit ...


When I hexdump myfile I see this:

0000000 6f70 6375 2064 6973 6e67 c220 00a3
This is *not* what I expected... well it is (little-endian) right up to
the 'c2' and that is what is confusing me....
I did not open the file with an encoding of UTF-8... so I'm assuming
UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
'A300' but what I got instead was UTF-8 little-endian 'c2a3' ....

See my problem?... when I open the file with emacs I see the character
pound sign... same with gedit... they're all using UTF-8 by default. By
default it looks like Python3 is writing output with UTF-8 as default...
and I thought that by default Python3 was using either UTF-16 or UTF-32.
So, I'm confused here... also, I used the character sequence \u00A3
which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
which is the normal UTF-8...

If you open a file as binary (bytes), you must write bytes, and they are
stored without transformation. If you open in text mode, you must write
text (string as unicode in 3.2) and Python will encode to bytes using
either some default or the encoding you specified in the open statement.
It does not matter how Python stored the unicode internally. Does this
help? Your intent is signalled by how you open the file.
 
J

John Machin

By
default it looks like Python3 is writing output with UTF-8 as default...
and I thought that by default Python3 was using either UTF-16 or UTF-32.
So, I'm confused here... also, I used the character sequence \u00A3
which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
which is the normal UTF-8...

Python uses either a 16-bit or a 32-bit INTERNAL representation of Unicode
code points. Those NN bits have nothing to do with the UTF-NN encodings,
which can be used to encode the codepoints as byte sequences for EXTERNAL
purposes. In your case, UTF-8 has been used as it is the default encoding
on your platform.
 
B

Benjamin Kaplan

Thanks for being patient guys, here's what I've done:



When I edit "myfile" with vi I see the 'characters' :

pound sign £

  ... same with emacs, same with gedit  ...


When I hexdump myfile I see this:

0000000 6f70 6375 2064 6973 6e67 c220 00a3


This is *not* what I expected... well it is (little-endian) right up to the
'c2' and that is what is confusing me....

I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16
by default (python3) so I was expecting a '00A3' little-endian as 'A300' but
what I got instead was UTF-8 little-endian  'c2a3' ....
quick note here: UTF-8 doesn't have an endian-ness. It's always read
from left to right, with the high bit telling you whether you need to
continue or not. So it's always "little endian".
See my problem?... when I open the file with emacs I see the character pound
sign... same with gedit... they're all using UTF-8 by default. By defaultit
looks like Python3 is writing output with UTF-8 as default... and I thought
that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused
here...  also, I used the character sequence \u00A3 which I thought was
UTF-16... but Python3 changed my intent to  'c2a3' which is the normal
UTF-8...

The fact that CPython uses UCS-2 or UCS-4 internally is an
implementation detail and isn't actually part of the Python
specification. As far as a Python program is concerned, a Unicode
string is a list of character objects, not bytes. Much like any other
object, a unicode character needs to be serialized before it can be
written to a file. An encoding is a serialization function for
characters.

If the file you're writing to doesn't specify an encoding, Python will
default to locale.getdefaultencoding(), which tries to get your
system's preferred encoding from environment variables (in other
words, the same source that emacs and gedit will use to get the
default encoding).
 
J

John Machin

If the file you're writing to doesn't specify an encoding, Python will
default to locale.getdefaultencoding(),

No such attribute. Perhaps you mean locale.getpreferredencoding()
 
H

harrismh777

Ben said:
I'd phrase that as:
* Text is a sequence of characters. Most inputs to the program,
including files, sockets, etc., contain a sequence of bytes.
* Always know whether you're dealing with text or with bytes. No object
can be both.
* In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is
the type for text.
* In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a
sequence of bytes.


That is very helpful... thanks


MRAB, Steve, John, Terry, Ben F, Ben K, Ian...
...thank you guys so much, I think I've got a better picture now of
what is going on... this is also one place where I don't think the books
are as clear as they need to be at least for me...(Lutz, Summerfield).

So, the UTF-16 UTF-32 is INTERNAL only, for Python... and text in/out is
based on locale... in my case UTF-8 ...that is enormously helpful for
me... understanding locale on this system is as mystifying as unicode is
in the first place.
Well, after reading about unicode tonight (about four hours) I realize
that its not really that hard... there's just a lot of details that have
to come together. Straightening out that whole tower-of-babel thing is
sure a pain in the butt.
I also was not aware that UTF-8 chars could be up to six(6) byes long
from left to right. I see now that the little-endianness I was
ascribing to python is just a function of hexdump... and I was a little
disappointed to find that hexdump does not support UTF-8, just ascii...doh.
Anyway, thanks again... I've got enough now to play around a bit...

PS thanks Steve for that link, informative and entertaining too... Joe
says, "If you are a programmer . . . and you don't know the basics of
characters, character sets, encodings, and Unicode, and I catch you, I'm
going to punish you by making you peel onions for 6 months in a
submarine. I swear I will". :)








kind regards,
m harris
 
H

harrismh777

Terry said:
It does not matter how Python stored the unicode internally. Does this
help? Your intent is signalled by how you open the file.

Very much, actually, thanks. I was missing the 'internal' piece, and
did not realize that if I didn't specify the encoding on the open that
python would pull the default encoding from locale...


kind regards,
m harris
 
J

John Machin

So, the UTF-16 UTF-32 is INTERNAL only, for Python

NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
encodings for the EXTERNAL representation of Unicode characters in byte
streams.
I also was not aware that UTF-8 chars could be up to six(6) byes long
from left to right.

It could be, once upon a time in ISO faerieland, when it was thought that
Unicode could grow to 2**32 codepoints. However ISO and the Unicode
consortium have agreed that 17 planes is the utter max, and accordingly a
valid UTF-8 byte sequence can be no longer than 4 bytes ... see below
Traceback (most recent call last):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0:
invalid start byte
 
T

TheSaint

John said:
No such attribute. Perhaps you mean locale.getpreferredencoding()

what about sys.getfilesystemencoding()
In the event to distribuite a program how to guess which encoding will the
user has?
 
I

Ian Kelly

NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
encodings for the EXTERNAL representation of Unicode characters in byte
streams.

Right. *Under the hood* Python uses UCS-2 (which is not exactly the
same thing as UTF-16, by the way) to represent Unicode strings.
However, this is entirely transparent. To the Python programmer, a
unicode string is just an abstraction of a sequence of code-points.
You don't need to think about UCS-2 at all. The only times you need
to worry about encodings are when you're encoding unicode characters
to byte strings, or decoding bytes to unicode characters, or opening a
stream in text mode; and in those cases the only encoding that matters
is the external one.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
474,044
Messages
2,570,388
Members
47,052
Latest member
ketan

Latest Threads

Top