Python and unicode

G

Goran Novosel

Hi everybody.

I've played for few hours with encoding in py, but it's still somewhat
confusing to me. So I've written a test file (encoded as utf-8). I've
put everything I think is true in comment at the beginning of script.
Could you check if it's correct (on side note, script does what I
intended it to do).

One more thing, is there some mechanism to avoid writing all the time
'something'.decode('utf-8')? Some sort of function call to tell py
interpreter that id like to do implicit decoding with specified
encoding for all string constants in script?

Here's my script:
-------------------
# vim: set encoding=utf-8 :

"""
----- encoding and py -----

- 1st (or 2nd) line tells py interpreter encoding of file
- if this line is missing, interpreter assumes 'ascii'
- it's possible to use variations of first line
- the first or second line must match the regular expression
"coding[:=]\s*([-\w.]+)" (PEP-0263)
- some variations:

'''
# coding=<encoding name>
'''

'''
#!/usr/bin/python
# -*- coding: <encoding name> -*-
'''

'''
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
'''

- this version works for my vim:
'''
# vim: set encoding=utf-8 :
'''

- constants can be given via str.decode() method or via unicode
constructor

- if locale is used, it shouldn't be set to 'LC_ALL' as it changes
encoding

"""

import datetime, locale

#locale.setlocale(locale.LC_ALL,'croatian') # changes encoding
locale.setlocale(locale.LC_TIME,'croatian') # sets correct date
format, but encoding is left alone

print 'default locale:', locale.getdefaultlocale()

s='abcdef ÄŒÄĆćÄ𩹮ž'.decode('utf-8')
ss=unicode('ab ČćŠđŽ','utf-8')

# date part of string is decoded as cp1250, because it's default
locale
all=datetime.date(2000,1,6).strftime("'%d.%m.%Y.', %x, %A, %B,
").decode('cp1250')+'%s, %s' % (s, ss)

print all
-------------------
 
M

Martin v. Loewis

One more thing, is there some mechanism to avoid writing all the time
'something'.decode('utf-8')?

Yes, use u'something' instead (i.e. put the letter u before the literal,
to make it a unicode literal). Since Python 2.6, you can also put

from __future__ import unicode_literals

at the top of the file to make all string literals Unicode objects.
Since Python 3.0, this is the default (i.e. all string literals
*are* unicode objects).

Regards,
Martin
 
S

Steven D'Aprano

This will help Vim, but won't help Python.

It will actually -- the regex Python uses to detect encoding lines is
documented, and Vim-style declarations are allowed as are Emacs style. In
fact, something as minimal as:

# coding=utf-8

will do the job.
Use the PEP 263 encoding
declaration <URL:http://www.python.org/dev/peps/pep-0263/> to let Python
know the encoding of the program source file.

While PEPs are valuable, once accepted or rejected they become historical
documents. They don't necessarily document the current behaviour of the
language.

See here for documentation on encoding declarations:

http://docs.python.org/reference/lexical_analysis.html#encoding-declarations
 
G

Goran Novosel

Can't believe I missed something as simple as u'smt', and I even saw
that on many occasions...
Thank you.
 
D

Dotan Cohen

While PEPs are valuable, once accepted or rejected they become historical
documents. They don't necessarily document the current behaviour of the
language.

See here for documentation on encoding declarations:

http://docs.python.org/reference/lexical_analysis.html#encoding-declarations

This is the first time that I've read the PEP document regarding
Unicode / UTF-8. I see that it mentions that the declaration must be
on the second or first line of the file. Is this still true in Python
3? I have been putting it further down (still before all python code,
but after some comments) in code that I write (for my own use, not
commercial code).
 
P

Peter Otten

Dotan said:
This is the first time that I've read the PEP document regarding
Unicode / UTF-8. I see that it mentions that the declaration must be
on the second or first line of the file. Is this still true in Python
3?
Yes

I have been putting it further down (still before all python code,
but after some comments) in code that I write (for my own use, not
commercial code).

It may work by accident, if you declare it as UTF-8, because that is also
the default in Python 3.

Peter
 
D

Dotan Cohen

It may work by accident, if you declare it as UTF-8, because that is also
the default in Python 3.

That does seem to be the case.

Thank you for the enlightenment and information.
 
M

Martin v. Loewis

Am 20.09.2010 12:57, schrieb Dotan Cohen:
That does seem to be the case.

Thank you for the enlightenment and information.

It's as Peter says. Python really will ignore any encoding declaration
on the third or later line. This was added to the spec on explicit
request from Guido van Rossum.

It's still the case today. However, in Python 3, in the absence of an
encoding declaration, the file encoding is assumed to be UTF-8
(producing an error if it actually is not). So it worked for you
by accident.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top