Python and unicode

Discussion in 'Python' started by Goran Novosel, Sep 19, 2010.

  1. Hi everybody.

    I've played for few hours with encoding in py, but it's still somewhat
    confusing to me. So I've written a test file (encoded as utf-8). I've
    put everything I think is true in comment at the beginning of script.
    Could you check if it's correct (on side note, script does what I
    intended it to do).

    One more thing, is there some mechanism to avoid writing all the time
    'something'.decode('utf-8')? Some sort of function call to tell py
    interpreter that id like to do implicit decoding with specified
    encoding for all string constants in script?

    Here's my script:
    -------------------
    # vim: set encoding=utf-8 :

    """
    ----- encoding and py -----

    - 1st (or 2nd) line tells py interpreter encoding of file
    - if this line is missing, interpreter assumes 'ascii'
    - it's possible to use variations of first line
    - the first or second line must match the regular expression
    "coding[:=]\s*([-\w.]+)" (PEP-0263)
    - some variations:

    '''
    # coding=<encoding name>
    '''

    '''
    #!/usr/bin/python
    # -*- coding: <encoding name> -*-
    '''

    '''
    #!/usr/bin/python
    # vim: set fileencoding=<encoding name> :
    '''

    - this version works for my vim:
    '''
    # vim: set encoding=utf-8 :
    '''

    - constants can be given via str.decode() method or via unicode
    constructor

    - if locale is used, it shouldn't be set to 'LC_ALL' as it changes
    encoding

    """

    import datetime, locale

    #locale.setlocale(locale.LC_ALL,'croatian') # changes encoding
    locale.setlocale(locale.LC_TIME,'croatian') # sets correct date
    format, but encoding is left alone

    print 'default locale:', locale.getdefaultlocale()

    s='abcdef ÄŒÄĆćÄ𩹮ž'.decode('utf-8')
    ss=unicode('ab ČćŠđŽ','utf-8')

    # date part of string is decoded as cp1250, because it's default
    locale
    all=datetime.date(2000,1,6).strftime("'%d.%m.%Y.', %x, %A, %B,
    ").decode('cp1250')+'%s, %s' % (s, ss)

    print all
    -------------------
    Goran Novosel, Sep 19, 2010
    #1
    1. Advertising

  2. > One more thing, is there some mechanism to avoid writing all the time
    > 'something'.decode('utf-8')?


    Yes, use u'something' instead (i.e. put the letter u before the literal,
    to make it a unicode literal). Since Python 2.6, you can also put

    from __future__ import unicode_literals

    at the top of the file to make all string literals Unicode objects.
    Since Python 3.0, this is the default (i.e. all string literals
    *are* unicode objects).

    Regards,
    Martin
    Martin v. Loewis, Sep 19, 2010
    #2
    1. Advertising

  3. Goran Novosel

    Carl Banks Guest

    On Sep 19, 4:09 pm, Ben Finney <> wrote:
    > Goran Novosel <> writes:
    > > # vim: set encoding=utf-8 :

    >
    > This will help Vim, but won't help Python. Use the PEP 263 encoding
    > declaration <URL:http://www.python.org/dev/peps/pep-0263/> to let Python
    > know the encoding of the program source file.


    That's funny because I went to PEP 263 and the line he used was listed
    there. Apparently, you're the one that needs to read PEP 263.


    Carl Banks
    Carl Banks, Sep 20, 2010
    #3
  4. On Mon, 20 Sep 2010 09:09:31 +1000, Ben Finney wrote:

    > Goran Novosel <> writes:
    >
    >> # vim: set encoding=utf-8 :

    >
    > This will help Vim, but won't help Python.


    It will actually -- the regex Python uses to detect encoding lines is
    documented, and Vim-style declarations are allowed as are Emacs style. In
    fact, something as minimal as:

    # coding=utf-8

    will do the job.

    > Use the PEP 263 encoding
    > declaration <URL:http://www.python.org/dev/peps/pep-0263/> to let Python
    > know the encoding of the program source file.


    While PEPs are valuable, once accepted or rejected they become historical
    documents. They don't necessarily document the current behaviour of the
    language.

    See here for documentation on encoding declarations:

    http://docs.python.org/reference/lexical_analysis.html#encoding-declarations



    --
    Steven
    Steven D'Aprano, Sep 20, 2010
    #4
  5. Can't believe I missed something as simple as u'smt', and I even saw
    that on many occasions...
    Thank you.
    Goran Novosel, Sep 20, 2010
    #5
  6. Goran Novosel

    Dotan Cohen Guest

    On Mon, Sep 20, 2010 at 05:42, Steven D'Aprano
    <> wrote:
    >> Use the PEP 263 encoding
    >> declaration <URL:http://www.python.org/dev/peps/pep-0263/> to let Python
    >> know the encoding of the program source file.

    >
    > While PEPs are valuable, once accepted or rejected they become historical
    > documents. They don't necessarily document the current behaviour of the
    > language.
    >
    > See here for documentation on encoding declarations:
    >
    > http://docs.python.org/reference/lexical_analysis.html#encoding-declarations
    >
    >


    This is the first time that I've read the PEP document regarding
    Unicode / UTF-8. I see that it mentions that the declaration must be
    on the second or first line of the file. Is this still true in Python
    3? I have been putting it further down (still before all python code,
    but after some comments) in code that I write (for my own use, not
    commercial code).

    --
    Dotan Cohen

    http://gibberish.co.il
    http://what-is-what.com
    Dotan Cohen, Sep 20, 2010
    #6
  7. Goran Novosel

    Peter Otten Guest

    Dotan Cohen wrote:

    > On Mon, Sep 20, 2010 at 05:42, Steven D'Aprano
    > <> wrote:
    >>> Use the PEP 263 encoding
    >>> declaration <URL:http://www.python.org/dev/peps/pep-0263/> to let Python
    >>> know the encoding of the program source file.

    >>
    >> While PEPs are valuable, once accepted or rejected they become historical
    >> documents. They don't necessarily document the current behaviour of the
    >> language.
    >>
    >> See here for documentation on encoding declarations:
    >>
    >> http://docs.python.org/reference/lexical_analysis.html#encoding-

    declarations
    >>
    >>

    >
    > This is the first time that I've read the PEP document regarding
    > Unicode / UTF-8. I see that it mentions that the declaration must be
    > on the second or first line of the file. Is this still true in Python
    > 3?


    Yes

    > I have been putting it further down (still before all python code,
    > but after some comments) in code that I write (for my own use, not
    > commercial code).


    It may work by accident, if you declare it as UTF-8, because that is also
    the default in Python 3.

    Peter
    Peter Otten, Sep 20, 2010
    #7
  8. Goran Novosel

    Dotan Cohen Guest

    On Mon, Sep 20, 2010 at 12:20, Peter Otten <> wrote:
    > It may work by accident, if you declare it as UTF-8, because that is also
    > the default in Python 3.
    >


    That does seem to be the case.

    Thank you for the enlightenment and information.


    --
    Dotan Cohen

    http://gibberish.co.il
    http://what-is-what.com
    Dotan Cohen, Sep 20, 2010
    #8
  9. Am 20.09.2010 12:57, schrieb Dotan Cohen:
    > On Mon, Sep 20, 2010 at 12:20, Peter Otten <> wrote:
    >> It may work by accident, if you declare it as UTF-8, because that is also
    >> the default in Python 3.
    >>

    >
    > That does seem to be the case.
    >
    > Thank you for the enlightenment and information.


    It's as Peter says. Python really will ignore any encoding declaration
    on the third or later line. This was added to the spec on explicit
    request from Guido van Rossum.

    It's still the case today. However, in Python 3, in the absence of an
    encoding declaration, the file encoding is assumed to be UTF-8
    (producing an error if it actually is not). So it worked for you
    by accident.

    Regards,
    Martin
    Martin v. Loewis, Sep 20, 2010
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,898
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    521
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    929
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. Chirag Mistry
    Replies:
    6
    Views:
    156
    Ollivier Robert
    Feb 8, 2008
  5. Terry Reedy
    Replies:
    0
    Views:
    64
    Terry Reedy
    Jan 7, 2014
Loading...

Share This Page