Simpler transition to PEP 3000 "Unicode only strings"?

Petr Prikryl · Sep 20, 2005

Hi all,

My question is: How do you tackle with mixing
Unicode and non-Unicode parts of your application?

Context:
========

The PEP 3000 says
"Make all strings be Unicode, and have a separate bytes() type."

Until then, I am forced to write
# -*- coding: cp123456 -*-
(see 2.1.4 Encoding declarations) and use...
myString = u'text with funny letters'

This leads to a source polution that will be
difficult to remove later.

The idea:
=========

What do you think about the following proposal
that goes the half way

If the Python source file is stored in UTF-8 (or
other recognised Unicode file format), then the
encoding declaration must reflect the format or
can be omitted entirely. In such case, all
simple string literals will be treated as
unicode string literals.

Would this break any existing code?

Thanks for your time and experience,
pepr

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Sep 20, 2005

Petr said:
Would this break any existing code?

Yes, it would break code which currently contains

# -*- coding: utf-8 -*-

and also contains byte string literals.

Notice that there is an alternative form of the UTF-8
declaration: if the Python file starts with an UTF-8
signature (BOM), then it is automatically considered
as UTF-8, with no explicit conding:-declaration
required. Set IDLE's Options/General/Default Source
Encoding to UTF-8 to have IDLE automatically use the
UTF-8 signature when saving files with non-ASCII
characters.

As for dropping the u prefix on string literals:
Just try the -U option of the interpreter some time,
which makes all string literals Unicode. If you manage
to get the standard library working this way, you
won't need a per-file decision anymore: just start
your program with 'python -U'.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Sep 20, 2005

Petr said:
Would this break any existing code?

Yes, it would break code which currently contains

# -*- coding: utf-8 -*-

and also contains byte string literals.

Notice that there is an alternative form of the UTF-8
declaration: if the Python file starts with an UTF-8
signature (BOM), then it is automatically considered
as UTF-8, with no explicit conding:-declaration
required. Set IDLE's Options/General/Default Source
Encoding to UTF-8 to have IDLE automatically use the
UTF-8 signature when saving files with non-ASCII
characters.

As for dropping the u prefix on string literals:
Just try the -U option of the interpreter some time,
which makes all string literals Unicode. If you manage
to get the standard library working this way, you
won't need a per-file decision anymore: just start
your program with 'python -U'.

Regards,
Martin

Dieter Maurer · Sep 21, 2005

Petr Prikryl said:
...
The idea:
=========

What do you think about the following proposal
that goes the half way

If the Python source file is stored in UTF-8 (or
other recognised Unicode file format), then the
encoding declaration must reflect the format or
can be omitted entirely. In such case, all
simple string literals will be treated as
unicode string literals.

Would this break any existing code?

Yes: modules that construct byte strings (i.e. strings
which should *not* be unicode strings).

Nevertheless, such a module may be stored in UTF-8.

to use unicode strings only	6	Jun 27, 2009
Revised PEP 349: Allow str() to return unicode strings	2	Aug 22, 2005
PEP: Generalised String Coercion	1	Aug 6, 2005
eval and unicode	12	Mar 20, 2008
PEP 3131: Supporting Non-ASCII Identifiers	399	May 13, 2007
os.lisdir, gets unicode, returns unicode... USUALLY?!?!?	13	Nov 16, 2006
PEP 3107 Function Annotations for review and comment	4	Dec 29, 2006
PEP 350: Codetags	20	Sep 26, 2005

Simpler transition to PEP 3000 "Unicode only strings"?

Petr Prikryl

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Dieter Maurer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads