unicode string literals and "u" prefix

Discussion in 'Python' started by nico, Nov 8, 2004.

  1. nico

    nico Guest

    In my python scripts, I use a lot of accented characters as I work in
    In order to do this, I put the line
    # -*- coding: UTF-8 -*-
    at the beginning of the script file.
    Then, when I need to store accented characters in a string, I used to
    prefix the literal string with 'u', like this:
    mystring = u"prénom"

    But if I understand well, prefixing a unicode string literal with 'u'
    will eventually become obsolete ( in python 3.0 ? ), as all strings
    will be unicode in a more or less distant future.

    So, to write "clean" script code, is it a good idea to write a script
    like this ?

    ---- myscript ----

    #! /usr/local/bin/python -U
    # -*- coding: UTF-8 -*-

    s = 'hélène'
    print len(s)
    print s


    The second line says that all string literals are encoded in UTF-8, as
    I work with an editor that saves all my files as UTF-8.

    Normally, I should write
    s = u'hélène' but the -U python option make python considers string
    literals as unicode string.
    ( I know the -U option can disappear in a next python version, but is
    not better to delete the "-U" option at the top of the scripts than
    all "u" unicode prefixes, when python will consider all strings as
    unicode ?... )

    Finally, I write
    print s
    instead of
    print s.encode('utf-8')
    as I used to because I want this script to work on computer with other
    It seems that "print" encodes by default with the shell current

    Is this the best way to deal with accented characters ?
    Do you think that a script written like this will still work with
    python 3.0 ?
    Any comment ?
    nico, Nov 8, 2004
    1. Advertisements

  2. I think you misunderstand. It might become deprecated (in the sense
    of becoming redundant yet still possible); in any case, this future
    is certainly distant (maybe five or ten years).
    No. Do use Unicode literals whenever you can.
    As you say: *when*. Current Python doesn't, and explicit is better than
    implicit. There is no plan yet as to when (or even if) to release Python
    Yes, it should.
    Most certainly. Even when string literals become Unicode by default,
    u""-prefixes will still be accepted - most likely so for ten or
    twenty years.

    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Nov 8, 2004
    1. Advertisements

  3. nico

    nico Guest

    Thank you a lot for your answer.

    I understand better, now.
    Nevertheless, all this unicode issue is quite confusing for beginners
    ( I started to learn Python two month ago... ).
    And it seems that I am not the only one in this case.
    In fact, I just came across this discussion of april 2003 "[Zope3-dev]
    i18n, unicode, and the underline"

    Working for an insurance company, most of our data contain french
    accented characters.
    So, we are condemned to work essentially with unicode strings.
    In fact, it is hard to find examples where plain ascii strings would
    be useful in our case.
    Even data we retrieve from databases are returned to us as unicode

    That's why I tried to find a way to get rid of all those "u" prefixes
    instead of systematically putting it in front of each unicode string
    litteral, which is somewhat "noisy".
    That's also because I am afraid that sometime someone will forget this
    "u" prefix, and errors will be detected in a far more later stage, or
    too late.
    A way of defaulting all string literal as unicode would have been a

    It would be good if we could just write a declaration at the beginning
    of the source file like
    # strings_are_unicode_by_default
    We would write unicode strings without "u" prefix like this:
    and if we really must have plain ascii strings, we could explicitely
    prefix them with "a", for instance s=a"my plain ascii string".
    Thus, everybody would be happy, and there will be no incidence about
    all the already written codes or librairies.
    But there must be issues I am not aware of, I suppose...

    I think you have the same problem when you write strings in german
    But if it is no problem for you to prefix your strings with "u" like
    in :
    s=u"Vielen Dank für Ihre Antwort"
    then we can live with it too, for the next twenty years.

    Sometimes, I feel like an ethnical minority, when I see in a
    well-known book about Python that "Because Unicode is a relatively
    advanced and rarely used tool, we will omit further details in this
    introductory text."
    Working in a language with accented characters is definitively bad

    Freundliche Grüsse

    Nicolas Riesch
    nico, Nov 9, 2004
  4. Hi,
    This statement looks as if you confuse utf-8 and unicode. They are not the
    same. The former is an encoding of the latter.
    I can understand that wish, but it certainly would break too much existing
    3rd-party-code. But I wonder if a tool as pychecker could be enhanced to
    issue warnings on python code if string literals are not prefixed by an u.
    Diez B. Roggisch, Nov 9, 2004
  5. I try to avoid putting non-English messages into source code (Python
    or not). Instead, I often put English into source code, then use
    gettext to fetch translations.
    I do use Unicode strings a lot in my Python applications. However,
    I rarely use them in string literals. If I had to put accented/umlauted
    characters into a Unicode literal, I had no problems putting u""
    in front of the literal.

    If you really need a way of declaring all string literals as Unicode,
    on a per-module basis, then

    from __future__ import string_literals_are_unicode

    is an appropriate way of doing this. Of course, it does not work in
    the current versions (including 2.4); it doesn't work because nobody
    has contributed code to make it work.

    So if you really need the feature, please implement it, and submit
    the change to sf.net/projects/python. There is nothing wrong with
    such a feature - just that nobody has implemented it.
    This is how open source works.

    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Nov 10, 2004
  6. nico

    Andrew Dalke Guest

    Were it to be done, would that also introduce new syntax for
    generating a byte string?

    Perhaps b"" as in

    s = b"\N{LATIN"

    Andrew Dalke, Nov 10, 2004
  7. nico

    Just Guest

    IMO we should plan to move towards the following:

    - all string literals should become unicode
    - there should be a bytes() type for binary
    - there should be a way to use byte string
    literals. b"..." seems a good candidate.

    I doubt this can be done without breaking stuff (although a __future__
    directive may make it possible), so maybe this is a 3.0 project.

    There already is a PEP for a bytes type:
    ...but it seems it's been dormant since 2002. Time to revive it?

    Just, Nov 10, 2004
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.