unicode string literals and "u" prefix

nico · Nov 8, 2004

In my python scripts, I use a lot of accented characters as I work in
french.
In order to do this, I put the line
# -*- coding: UTF-8 -*-
at the beginning of the script file.
Then, when I need to store accented characters in a string, I used to
prefix the literal string with 'u', like this:
mystring = u"prénom"

But if I understand well, prefixing a unicode string literal with 'u'
will eventually become obsolete ( in python 3.0 ? ), as all strings
will be unicode in a more or less distant future.

So, to write "clean" script code, is it a good idea to write a script
like this ?

---- myscript ----

#! /usr/local/bin/python -U
# -*- coding: UTF-8 -*-

s = 'hélène'
print len(s)
print s

-------------------

The second line says that all string literals are encoded in UTF-8, as
I work with an editor that saves all my files as UTF-8.

Normally, I should write
s = u'hélène' but the -U python option make python considers string
literals as unicode string.
( I know the -U option can disappear in a next python version, but is
not better to delete the "-U" option at the top of the scripts than
all "u" unicode prefixes, when python will consider all strings as
unicode ?... )

Finally, I write
print s
instead of
print s.encode('utf-8')
as I used to because I want this script to work on computer with other
encodings.
It seems that "print" encodes by default with the shell current
encoding.

Is this the best way to deal with accented characters ?
Do you think that a script written like this will still work with
python 3.0 ?
Any comment ?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 8, 2004

nico said:
But if I understand well, prefixing a unicode string literal with 'u'
will eventually become obsolete ( in python 3.0 ? ), as all strings
will be unicode in a more or less distant future.

I think you misunderstand. It might become deprecated (in the sense
of becoming redundant yet still possible); in any case, this future
is certainly distant (maybe five or ten years).

So, to write "clean" script code, is it a good idea to write a script
like this ?

No. Do use Unicode literals whenever you can.

( I know the -U option can disappear in a next python version, but is
not better to delete the "-U" option at the top of the scripts than
all "u" unicode prefixes, when python will consider all strings as
unicode ?... )

As you say: *when*. Current Python doesn't, and explicit is better than
implicit. There is no plan yet as to when (or even if) to release Python
3.

It seems that "print" encodes by default with the shell current
encoding.

Yes, it should.

Do you think that a script written like this will still work with
python 3.0 ?

Most certainly. Even when string literals become Unicode by default,
u""-prefixes will still be accepted - most likely so for ten or
twenty years.

Regards,
Martin

nico · Nov 9, 2004

Thank you a lot for your answer.

I understand better, now.
Nevertheless, all this unicode issue is quite confusing for beginners
( I started to learn Python two month ago... ).
And it seems that I am not the only one in this case.
In fact, I just came across this discussion of april 2003 "[Zope3-dev]
i18n, unicode, and the underline"
http://mail.zope.org/pipermail/zope3-dev/2003-April/006410.html.

Working for an insurance company, most of our data contain french
accented characters.
So, we are condemned to work essentially with unicode strings.
In fact, it is hard to find examples where plain ascii strings would
be useful in our case.
Even data we retrieve from databases are returned to us as unicode
strings.

That's why I tried to find a way to get rid of all those "u" prefixes
instead of systematically putting it in front of each unicode string
litteral, which is somewhat "noisy".
That's also because I am afraid that sometime someone will forget this
"u" prefix, and errors will be detected in a far more later stage, or
too late.
A way of defaulting all string literal as unicode would have been a
relief.

It would be good if we could just write a declaration at the beginning
of the source file like
# strings_are_unicode_by_default
We would write unicode strings without "u" prefix like this:
s="élément"
and if we really must have plain ascii strings, we could explicitely
prefix them with "a", for instance s=a"my plain ascii string".
Thus, everybody would be happy, and there will be no incidence about
all the already written codes or librairies.
But there must be issues I am not aware of, I suppose...

I think you have the same problem when you write strings in german
language.
But if it is no problem for you to prefix your strings with "u" like
in :
s=u"Vielen Dank für Ihre Antwort"
then we can live with it too, for the next twenty years.

Sometimes, I feel like an ethnical minority, when I see in a
well-known book about Python that "Because Unicode is a relatively
advanced and rarely used tool, we will omit further details in this
introductory text."
Working in a language with accented characters is definitively bad
luck.

Freundliche Grüsse

Nicolas Riesch

Diez B. Roggisch · Nov 9, 2004

Hi,

Working for an insurance company, most of our data contain french
accented characters.
So, we are condemned to work essentially with unicode strings.
In fact, it is hard to find examples where plain ascii strings would
be useful in our case.
Even data we retrieve from databases are returned to us as unicode
strings.

This statement looks as if you confuse utf-8 and unicode. They are not the
same. The former is an encoding of the latter.

A way of defaulting all string literal as unicode would have been a
relief.

I can understand that wish, but it certainly would break too much existing
3rd-party-code. But I wonder if a tool as pychecker could be enhanced to
issue warnings on python code if string literals are not prefixed by an u.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 10, 2004

nico said:
I think you have the same problem when you write strings in german
language.

I try to avoid putting non-English messages into source code (Python
or not). Instead, I often put English into source code, then use
gettext to fetch translations.

Sometimes, I feel like an ethnical minority, when I see in a
well-known book about Python that "Because Unicode is a relatively
advanced and rarely used tool, we will omit further details in this
introductory text."

I do use Unicode strings a lot in my Python applications. However,
I rarely use them in string literals. If I had to put accented/umlauted
characters into a Unicode literal, I had no problems putting u""
in front of the literal.

If you really need a way of declaring all string literals as Unicode,
on a per-module basis, then

from __future__ import string_literals_are_unicode

is an appropriate way of doing this. Of course, it does not work in
the current versions (including 2.4); it doesn't work because nobody
has contributed code to make it work.

So if you really need the feature, please implement it, and submit
the change to sf.net/projects/python. There is nothing wrong with
such a feature - just that nobody has implemented it.
This is how open source works.

Regards,
Martin

Andrew Dalke · Nov 10, 2004

Martin said:
If you really need a way of declaring all string literals as Unicode,
on a per-module basis, then

from __future__ import string_literals_are_unicode

Were it to be done, would that also introduce new syntax for
generating a byte string?

Perhaps b"" as in

s = b"\N{LATIN"

?
Andrew
(e-mail address removed)

Just · Nov 10, 2004

Andrew Dalke said:
Were it to be done, would that also introduce new syntax for
generating a byte string?

Perhaps b"" as in

s = b"\N{LATIN"

?

IMO we should plan to move towards the following:

- all string literals should become unicode
- there should be a bytes() type for binary
strings
- there should be a way to use byte string
literals. b"..." seems a good candidate.

I doubt this can be done without breaking stuff (although a __future__
directive may make it possible), so maybe this is a 3.0 project.

There already is a PEP for a bytes type:
http://www.python.org/peps/pep-0296.html
...but it seems it's been dormant since 2002. Time to revive it?

Just

portable unicode literals	4	Oct 15, 2012
How to print prefix and suffix without giving a String as an argument between them	2	May 9, 2022
Unicode literals and byte string interpretation.	4	Oct 27, 2011
idle 2.x and unicode literals	0	Apr 2, 2010
Unicode escapes and String literals?	24	Dec 13, 2012
Python Unicode handling wins again -- mostly	67	Nov 29, 2013
Unicode	20	Dec 16, 2012
String to unicode - duplicating by function the effect of u prefix	3	Jun 18, 2009

unicode string literals and "u" prefix

nico

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

nico

Diez B. Roggisch

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Andrew Dalke

Just

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads