Simpler transition to PEP 3000 "Unicode only strings"?

P

Petr Prikryl

Hi all,

My question is: How do you tackle with mixing
Unicode and non-Unicode parts of your application?

Context:
========

The PEP 3000 says
"Make all strings be Unicode, and have a separate bytes() type."

Until then, I am forced to write
# -*- coding: cp123456 -*-
(see 2.1.4 Encoding declarations) and use...
myString = u'text with funny letters'

This leads to a source polution that will be
difficult to remove later.

The idea:
=========

What do you think about the following proposal
that goes the half way

If the Python source file is stored in UTF-8 (or
other recognised Unicode file format), then the
encoding declaration must reflect the format or
can be omitted entirely. In such case, all
simple string literals will be treated as
unicode string literals.

Would this break any existing code?

Thanks for your time and experience,
pepr
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Petr said:
Would this break any existing code?

Yes, it would break code which currently contains

# -*- coding: utf-8 -*-

and also contains byte string literals.

Notice that there is an alternative form of the UTF-8
declaration: if the Python file starts with an UTF-8
signature (BOM), then it is automatically considered
as UTF-8, with no explicit conding:-declaration
required. Set IDLE's Options/General/Default Source
Encoding to UTF-8 to have IDLE automatically use the
UTF-8 signature when saving files with non-ASCII
characters.

As for dropping the u prefix on string literals:
Just try the -U option of the interpreter some time,
which makes all string literals Unicode. If you manage
to get the standard library working this way, you
won't need a per-file decision anymore: just start
your program with 'python -U'.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Petr said:
Would this break any existing code?

Yes, it would break code which currently contains

# -*- coding: utf-8 -*-

and also contains byte string literals.

Notice that there is an alternative form of the UTF-8
declaration: if the Python file starts with an UTF-8
signature (BOM), then it is automatically considered
as UTF-8, with no explicit conding:-declaration
required. Set IDLE's Options/General/Default Source
Encoding to UTF-8 to have IDLE automatically use the
UTF-8 signature when saving files with non-ASCII
characters.

As for dropping the u prefix on string literals:
Just try the -U option of the interpreter some time,
which makes all string literals Unicode. If you manage
to get the standard library working this way, you
won't need a per-file decision anymore: just start
your program with 'python -U'.

Regards,
Martin
 
D

Dieter Maurer

Petr Prikryl said:
...
The idea:
=========

What do you think about the following proposal
that goes the half way

If the Python source file is stored in UTF-8 (or
other recognised Unicode file format), then the
encoding declaration must reflect the format or
can be omitted entirely. In such case, all
simple string literals will be treated as
unicode string literals.

Would this break any existing code?

Yes: modules that construct byte strings (i.e. strings
which should *not* be unicode strings).

Nevertheless, such a module may be stored in UTF-8.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top