Python and unicode

Goran Novosel · Sep 19, 2010

Hi everybody.

I've played for few hours with encoding in py, but it's still somewhat
confusing to me. So I've written a test file (encoded as utf-8). I've
put everything I think is true in comment at the beginning of script.
Could you check if it's correct (on side note, script does what I
intended it to do).

One more thing, is there some mechanism to avoid writing all the time
'something'.decode('utf-8')? Some sort of function call to tell py
interpreter that id like to do implicit decoding with specified
encoding for all string constants in script?

Here's my script:
-------------------
# vim: set encoding=utf-8 :

"""
----- encoding and py -----

- 1st (or 2nd) line tells py interpreter encoding of file
- if this line is missing, interpreter assumes 'ascii'
- it's possible to use variations of first line
- the first or second line must match the regular expression
"coding[:=]\s*([-\w.]+)" (PEP-0263)
- some variations:

'''
# coding=<encoding name>
'''

'''
#!/usr/bin/python
# -*- coding: <encoding name> -*-
'''

'''
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
'''

- this version works for my vim:
'''
# vim: set encoding=utf-8 :
'''

- constants can be given via str.decode() method or via unicode
constructor

- if locale is used, it shouldn't be set to 'LC_ALL' as it changes
encoding

"""

import datetime, locale

#locale.setlocale(locale.LC_ALL,'croatian') # changes encoding
locale.setlocale(locale.LC_TIME,'croatian') # sets correct date
format, but encoding is left alone

print 'default locale:', locale.getdefaultlocale()

s='abcdef ÄŒÄÄ†Ä‡ÄÄ‘Å Å¡Å½Å¾'.decode('utf-8')
ss=unicode('ab ÄŒÄ‡Å Ä‘Å½','utf-8')

# date part of string is decoded as cp1250, because it's default
locale
all=datetime.date(2000,1,6).strftime("'%d.%m.%Y.', %x, %A, %B,
").decode('cp1250')+'%s, %s' % (s, ss)

print all
-------------------

Martin v. Loewis · Sep 19, 2010

One more thing, is there some mechanism to avoid writing all the time

'something'.decode('utf-8')?

Yes, use u'something' instead (i.e. put the letter u before the literal,
to make it a unicode literal). Since Python 2.6, you can also put

from __future__ import unicode_literals

at the top of the file to make all string literals Unicode objects.
Since Python 3.0, this is the default (i.e. all string literals
*are* unicode objects).

Regards,
Martin

Carl Banks · Sep 19, 2010

This will help Vim, but won't help Python. Use the PEP 263 encoding
declaration <URL:http://www.python.org/dev/peps/pep-0263/> to let Python
know the encoding of the program source file.

That's funny because I went to PEP 263 and the line he used was listed
there. Apparently, you're the one that needs to read PEP 263.

Carl Banks

Steven D'Aprano · Sep 19, 2010

This will help Vim, but won't help Python.

It will actually -- the regex Python uses to detect encoding lines is
documented, and Vim-style declarations are allowed as are Emacs style. In
fact, something as minimal as:

# coding=utf-8

will do the job.

Use the PEP 263 encoding
declaration <URL:http://www.python.org/dev/peps/pep-0263/> to let Python
know the encoding of the program source file.

While PEPs are valuable, once accepted or rejected they become historical
documents. They don't necessarily document the current behaviour of the
language.

See here for documentation on encoding declarations:

http://docs.python.org/reference/lexical_analysis.html#encoding-declarations

Goran Novosel · Sep 20, 2010

Can't believe I missed something as simple as u'smt', and I even saw
that on many occasions...
Thank you.

Dotan Cohen · Sep 20, 2010

While PEPs are valuable, once accepted or rejected they become historical
documents. They don't necessarily document the current behaviour of the
language.

See here for documentation on encoding declarations:

http://docs.python.org/reference/lexical_analysis.html#encoding-declarations

This is the first time that I've read the PEP document regarding
Unicode / UTF-8. I see that it mentions that the declaration must be
on the second or first line of the file. Is this still true in Python
3? I have been putting it further down (still before all python code,
but after some comments) in code that I write (for my own use, not
commercial code).

Peter Otten · Sep 20, 2010

Dotan said:
This is the first time that I've read the PEP document regarding
Unicode / UTF-8. I see that it mentions that the declaration must be
on the second or first line of the file. Is this still true in Python
3?
Yes

I have been putting it further down (still before all python code,
but after some comments) in code that I write (for my own use, not
commercial code).

It may work by accident, if you declare it as UTF-8, because that is also
the default in Python 3.

Peter

Dotan Cohen · Sep 20, 2010

It may work by accident, if you declare it as UTF-8, because that is also
the default in Python 3.

That does seem to be the case.

Thank you for the enlightenment and information.

Martin v. Loewis · Sep 20, 2010

Am 20.09.2010 12:57, schrieb Dotan Cohen:

That does seem to be the case.

Thank you for the enlightenment and information.

It's as Peter says. Python really will ignore any encoding declaration
on the third or later line. This was added to the spec on explicit
request from Guido van Rossum.

It's still the case today. However, in Python 3, in the absence of an
encoding declaration, the file encoding is assumed to be UTF-8
(producing an error if it actually is not). So it worked for you
by accident.

Regards,
Martin

Python 3.3, gettext and Unicode problems	0	Dec 30, 2012
split lines from stdin into a list of unicode strings	0	Aug 28, 2013
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Python Unicode handling wins again -- mostly	67	Nov 29, 2013
Unicode Support in Ruby, Perl, Python, Emacs Lisp	6	Oct 7, 2010
Inserting Unicode text with MySQLdb in Python 2.4-2.5?	5	Nov 18, 2009
Python unicode and Windows cmd.exe	10	Mar 14, 2010
LANG, locale, unicode, setup.py and Debian packaging	25	Jan 12, 2008

Python and unicode

Goran Novosel

Martin v. Loewis

Carl Banks

Steven D'Aprano

Goran Novosel

Dotan Cohen

Peter Otten

Dotan Cohen

Martin v. Loewis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads