Q: The `print' statement over Unicode

  • Thread starter =?iso-8859-1?Q?Fran=E7ois?= Pinard
  • Start date
?

=?iso-8859-1?Q?Fran=E7ois?= Pinard

Hi, people. I hope someone would like to enlighten me.

For any application handling Unicode internally, I'm usually careful
at properly converting those Unicode strings into 8-bit strings before
writing them out.

However, this morning, I mistakenly forgot to do so before using one
Unicode string (containing a non-ASCII character) as an argument to
the `print' statement, and I did _not_ get an error. This is rather
surprising to me. I reread the section of the Python reference manual
(version 2.3.4, this machine uses 2.3.3 currently), and it does not say
anything about a special processing for Unicode strings.

In my understanding, when `print' is given an argument which is not
already a string (I read: 8-bit string), it first gets converted into
a string (I read: calling __str__). But if I call `str()' explicitly,
_then_ I get an error as expected. The question is, why is there no
error if I do not call `str()' explicity?

For example, given file `question.py' with this contents:

# -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte)

doing `python question.py' yields:

<type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128)

(last line wrapped for legibility).

So (trying to be crystal clear), why is the first `print' working over
its third argument, but not the second? How does `print' convert that
Unicode string to a 8-bit string for output, if not through `str()'?
What is missing to the documentation, or to my way of understanding it?
 
T

Thomas Heller

François Pinard said:
Hi, people. I hope someone would like to enlighten me.

For any application handling Unicode internally, I'm usually careful
at properly converting those Unicode strings into 8-bit strings before
writing them out.

However, this morning, I mistakenly forgot to do so before using one
Unicode string (containing a non-ASCII character) as an argument to
the `print' statement, and I did _not_ get an error. This is rather
surprising to me. I reread the section of the Python reference manual
(version 2.3.4, this machine uses 2.3.3 currently), and it does not say
anything about a special processing for Unicode strings.

In my understanding, when `print' is given an argument which is not
already a string (I read: 8-bit string), it first gets converted into
a string (I read: calling __str__). But if I call `str()' explicitly,
_then_ I get an error as expected. The question is, why is there no
error if I do not call `str()' explicity?

For example, given file `question.py' with this contents:

# -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte)

doing `python question.py' yields:

<type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128)

(last line wrapped for legibility).

So (trying to be crystal clear), why is the first `print' working over
its third argument, but not the second? How does `print' convert that
Unicode string to a 8-bit string for output, if not through `str()'?
What is missing to the documentation, or to my way of understanding it?

AFAIK, print uses sys.stdout.encoding to encode the unicode string.

Thomas
 
?

=?iso-8859-1?Q?Fran=E7ois?= Pinard

[Thomas Heller]
[...] given file `question.py' with this contents:
# -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte)
doing `python question.py' yields:
<type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128)
[...] why is the first `print' working over its third argument, but
not the second? How does `print' convert that Unicode string to a
8-bit string for output, if not through `str()'? What is missing to
the documentation, or to my way of understanding it?
AFAIK, print uses sys.stdout.encoding to encode the unicode string.

Much thanks for this information.

I was not aware of this file attribute. Looking around, I found a
quick description in the Library Reference, under "2.3.8 File Objects".
However, I did not find in the documentation the rules stating how
or when this attribute receives a value, and in particular here, for
the case of `sys.stdout'. The Reference Manual, under "6.6 The print
statement", is silent about how Unicode strings are handled.

Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

François Pinard said:
Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?

It should, but, alas, it doesn't. Contributions are welcome.

The algorithm to set sys.std{in,out}.encoding is in
sysmodule.c:_PySys_Init and pythonrun.c:py_InitializeEx
and goes roughly as follows:

- On Windows, if isatty returns true, use GetConsoleCP and
GetConsoleOutputCP.
- On Unix, if isatty returns true, langinfo.h is present,
CODESET is defined, and nl_langinfo(CODESET) returns a
non-empty string, use that.
- otherwise, .encoding will not be set.

Regards,
Martin
 
J

Jeremy Bowers

[Martin von Löwis]
It should, but, alas, it doesn't. Contributions are welcome.

My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers! :)

I'm not sure that the smiley completely de-fangs this comment.

Have you every tried managing a project even a tenth the size of Python
*without* those tools? If you had any idea of the kind of continuous,
day-in, day-out *useless busywork* you were asking of the developers,
merely to save you a minute or two on the one occasion you have something
to contribute, you'd apologize for the incredibly unreasonable demand you
are making from people giving you an amazing amount of free stuff. (You'd
get a lot less of it, too; administration isn't coding, and excessive
administration makes the coding even less fun and thus less likely to be
done.) An apology would not be out of line, smiley or no.

I've never administered anything the size of Python. I have, however, been
up close and personal with a project that had about five developers
full-time, and administering *that* without bug trackers would have been a
nightmare. I can't even imagine trying to run Python by hand.... at least
not that and getting useful work done too.
 
?

=?iso-8859-1?Q?Fran=E7ois?= Pinard

[Martin von Löwis]
It should, but, alas, it doesn't. Contributions are welcome.

My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers! :)
The algorithm to set sys.std{in,out}.encoding is in
sysmodule.c:_PySys_Init and pythonrun.c:py_InitializeEx
and goes roughly as follows:
- On Windows, if isatty returns true, use GetConsoleCP and
GetConsoleOutputCP.
- On Unix, if isatty returns true, langinfo.h is present,
CODESET is defined, and nl_langinfo(CODESET) returns a
non-empty string, use that.
- otherwise, .encoding will not be set.

Thanks. Your kind explanation, above, should make it, as is, somewhere
in the documentation -- until Xah decides to rewrite it, of course! :).
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

François Pinard said:
My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers! :)

Ok, then we need to wait for somebody else to contribute a documentation
patch.
Thanks. Your kind explanation, above, should make it, as is, somewhere
in the documentation

But how will that happen? Unless somebody contributes a documentation
patch, the documentation will not change magically!

Regards,
Martin
 
J

John J. Lee

Jeremy Bowers said:
[Martin von Löwis]
François Pinard wrote:

Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?
It should, but, alas, it doesn't. Contributions are welcome.

My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers! :)

I'm not sure that the smiley completely de-fangs this comment.
Have you every tried managing a project even a tenth the size of Python
*without* those tools? If you had any idea of the kind of continuous
[...]

I don't mean to put words into François' mouth, but IIRC he managed,
for example, GNU tar for some time and, while using some kind of
tracking system "under the covers", didn't impose it on his users.

IMVHO, that was very nice of him, but I'd be reluctant to attempt to
enforce this way of working on a hard-working and competent
contributor to an open source project to which I'm not a core
contributor myself.


John
 
J

Jeremy Bowers

I don't mean to put words into François' mouth, but IIRC he managed,
for example, GNU tar for some time and, while using some kind of
tracking system "under the covers", didn't impose it on his users.

IMVHO, that was very nice of him, but I'd be reluctant to attempt to
enforce this way of working on a hard-working and competent
contributor to an open source project to which I'm not a core
contributor myself.

Then I'd honor his consistency of belief, but still consider it impolite
in general, as asking someone to do tons of work overall to save you a bit
is almost always impolite.
 
G

Guest

Jeremy said:
Then I'd honor his consistency of belief, but still consider it impolite
in general, as asking someone to do tons of work overall to save you a bit
is almost always impolite.

This is not what he did, though - he did not break "the protocol" by
sending in patches by email (which indeed we would reject). Instead, he
said (before) that he cannot contribute because he is
unwilling to/incapable of using a bug tracker. This is an acceptable
position: contributors are volunteers, and he choses not to volunteer.
He then has to accept (in the specific case) that the documentation is
imprecise/incomplete.

More precisely, he is correct that *his* contribution is not welcome,
contrary to my broad statement "contributions are welcome". The
more narrower statement "contributions that follow the guidelines
are welcome" still stands.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Unicode 20
Thinking Unicode 0
split lines from stdin into a list of unicode strings 0
Python Unicode handling wins again -- mostly 67
Ascii to Unicode. 4
etree, minidom unicode 0
unicode 7
helping with unicode 4

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,149
Latest member
Vinay Kumar Nevatia0
Top