Q: The `print' statement over Unicode

=?iso-8859-1?Q?Fran=E7ois?= Pinard · May 4, 2005

Hi, people. I hope someone would like to enlighten me.

For any application handling Unicode internally, I'm usually careful
at properly converting those Unicode strings into 8-bit strings before
writing them out.

However, this morning, I mistakenly forgot to do so before using one
Unicode string (containing a non-ASCII character) as an argument to
the `print' statement, and I did _not_ get an error. This is rather
surprising to me. I reread the section of the Python reference manual
(version 2.3.4, this machine uses 2.3.3 currently), and it does not say
anything about a special processing for Unicode strings.

In my understanding, when `print' is given an argument which is not
already a string (I read: 8-bit string), it first gets converted into
a string (I read: calling __str__). But if I call `str()' explicitly,
_then_ I get an error as expected. The question is, why is there no
error if I do not call `str()' explicity?

For example, given file `question.py' with this contents:

# -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte)

doing `python question.py' yields:

<type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128)

(last line wrapped for legibility).

So (trying to be crystal clear), why is the first `print' working over
its third argument, but not the second? How does `print' convert that
Unicode string to a 8-bit string for output, if not through `str()'?
What is missing to the documentation, or to my way of understanding it?

Thomas Heller · May 4, 2005

François Pinard said:
Hi, people. I hope someone would like to enlighten me.

For any application handling Unicode internally, I'm usually careful
at properly converting those Unicode strings into 8-bit strings before
writing them out.

However, this morning, I mistakenly forgot to do so before using one
Unicode string (containing a non-ASCII character) as an argument to
the `print' statement, and I did _not_ get an error. This is rather
surprising to me. I reread the section of the Python reference manual
(version 2.3.4, this machine uses 2.3.3 currently), and it does not say
anything about a special processing for Unicode strings.

In my understanding, when `print' is given an argument which is not
already a string (I read: 8-bit string), it first gets converted into
a string (I read: calling __str__). But if I call `str()' explicitly,
_then_ I get an error as expected. The question is, why is there no
error if I do not call `str()' explicity?

For example, given file `question.py' with this contents:

# -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte)

doing `python question.py' yields:

<type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128)

(last line wrapped for legibility).

So (trying to be crystal clear), why is the first `print' working over
its third argument, but not the second? How does `print' convert that
Unicode string to a 8-bit string for output, if not through `str()'?
What is missing to the documentation, or to my way of understanding it?

AFAIK, print uses sys.stdout.encoding to encode the unicode string.

Thomas

=?iso-8859-1?Q?Fran=E7ois?= Pinard · May 7, 2005

[Thomas Heller]

[...] given file `question.py' with this contents:
# -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte)
doing `python question.py' yields:
<type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128)
[...] why is the first `print' working over its third argument, but
not the second? How does `print' convert that Unicode string to a
8-bit string for output, if not through `str()'? What is missing to
the documentation, or to my way of understanding it?

Click to expand...

AFAIK, print uses sys.stdout.encoding to encode the unicode string.

Much thanks for this information.

I was not aware of this file attribute. Looking around, I found a
quick description in the Library Reference, under "2.3.8 File Objects".
However, I did not find in the documentation the rules stating how
or when this attribute receives a value, and in particular here, for
the case of `sys.stdout'. The Reference Manual, under "6.6 The print
statement", is silent about how Unicode strings are handled.

Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · May 7, 2005

François Pinard said:
Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?

It should, but, alas, it doesn't. Contributions are welcome.

The algorithm to set sys.std{in,out}.encoding is in
sysmodule.c:_PySys_Init and pythonrun.c

y_InitializeEx
and goes roughly as follows:

- On Windows, if isatty returns true, use GetConsoleCP and
GetConsoleOutputCP.
- On Unix, if isatty returns true, langinfo.h is present,
CODESET is defined, and nl_langinfo(CODESET) returns a
non-empty string, use that.
- otherwise, .encoding will not be set.

Regards,
Martin

Jeremy Bowers · May 7, 2005

[Martin von LÃ¶wis]

It should, but, alas, it doesn't. Contributions are welcome.

Click to expand...

My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers!

I'm not sure that the smiley completely de-fangs this comment.

Have you every tried managing a project even a tenth the size of Python
*without* those tools? If you had any idea of the kind of continuous,
day-in, day-out *useless busywork* you were asking of the developers,
merely to save you a minute or two on the one occasion you have something
to contribute, you'd apologize for the incredibly unreasonable demand you
are making from people giving you an amazing amount of free stuff. (You'd
get a lot less of it, too; administration isn't coding, and excessive
administration makes the coding even less fun and thus less likely to be
done.) An apology would not be out of line, smiley or no.

I've never administered anything the size of Python. I have, however, been
up close and personal with a project that had about five developers
full-time, and administering *that* without bug trackers would have been a
nightmare. I can't even imagine trying to run Python by hand.... at least
not that and getting useful work done too.

=?iso-8859-1?Q?Fran=E7ois?= Pinard · May 7, 2005

[Martin von Löwis]

It should, but, alas, it doesn't. Contributions are welcome.

My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers!

The algorithm to set sys.std{in,out}.encoding is in
sysmodule.c:_PySys_Init and pythonrun.cy_InitializeEx
and goes roughly as follows:

- On Windows, if isatty returns true, use GetConsoleCP and
GetConsoleOutputCP.
- On Unix, if isatty returns true, langinfo.h is present,
CODESET is defined, and nl_langinfo(CODESET) returns a
non-empty string, use that.
- otherwise, .encoding will not be set.

Thanks. Your kind explanation, above, should make it, as is, somewhere
in the documentation -- until Xah decides to rewrite it, of course!

.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · May 7, 2005

François Pinard said:
My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers!

Ok, then we need to wait for somebody else to contribute a documentation
patch.

Thanks. Your kind explanation, above, should make it, as is, somewhere
in the documentation

But how will that happen? Unless somebody contributes a documentation
patch, the documentation will not change magically!

Regards,
Martin

John J. Lee · May 8, 2005

Jeremy Bowers said:
[Martin von Löwis]

François Pinard wrote:

Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?

Click to expand...

It should, but, alas, it doesn't. Contributions are welcome.

Click to expand...

My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers!

Click to expand...

I'm not sure that the smiley completely de-fangs this comment.
Have you every tried managing a project even a tenth the size of Python
*without* those tools? If you had any idea of the kind of continuous

[...]

I don't mean to put words into François' mouth, but IIRC he managed,
for example, GNU tar for some time and, while using some kind of
tracking system "under the covers", didn't impose it on his users.

IMVHO, that was very nice of him, but I'd be reluctant to attempt to
enforce this way of working on a hard-working and competent
contributor to an open source project to which I'm not a core
contributor myself.

John

Jeremy Bowers · May 8, 2005

I don't mean to put words into FranÃ§ois' mouth, but IIRC he managed,
for example, GNU tar for some time and, while using some kind of
tracking system "under the covers", didn't impose it on his users.

IMVHO, that was very nice of him, but I'd be reluctant to attempt to
enforce this way of working on a hard-working and competent
contributor to an open source project to which I'm not a core
contributor myself.

Then I'd honor his consistency of belief, but still consider it impolite
in general, as asking someone to do tons of work overall to save you a bit
is almost always impolite.

Guest · May 8, 2005

Jeremy said:
Then I'd honor his consistency of belief, but still consider it impolite
in general, as asking someone to do tons of work overall to save you a bit
is almost always impolite.

This is not what he did, though - he did not break "the protocol" by
sending in patches by email (which indeed we would reject). Instead, he
said (before) that he cannot contribute because he is
unwilling to/incapable of using a bug tracker. This is an acceptable
position: contributors are volunteers, and he choses not to volunteer.
He then has to accept (in the specific case) that the documentation is
imprecise/incomplete.

More precisely, he is correct that *his* contribution is not welcome,
contrary to my broad statement "contributions are welcome". The
more narrower statement "contributions that follow the guidelines
are welcome" still stands.

Regards,
Martin

Unicode	20	Dec 16, 2012
Thinking Unicode	0	Aug 8, 2013
split lines from stdin into a list of unicode strings	0	Aug 28, 2013
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Ascii to Unicode.	4	Jul 28, 2010
etree, minidom unicode	0	Dec 5, 2008
unicode	7	Jul 1, 2007
helping with unicode	4	Jul 3, 2012

Q: The `print' statement over Unicode

=?iso-8859-1?Q?Fran=E7ois?= Pinard

Thomas Heller

=?iso-8859-1?Q?Fran=E7ois?= Pinard

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Jeremy Bowers

=?iso-8859-1?Q?Fran=E7ois?= Pinard

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John J. Lee

Jeremy Bowers

Guest

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads