Use of Unicode in Python 2.5 source code literals

Uncle Bruce · May 3, 2009

I'm working with Python 2.5.4 and the NLTK (Natural Language
Toolkit). I'm an experienced programmer, but new to Python.

This question arose when I tried to create a literal in my source code
for a Unicode codepoint greater than 255. (I also posted this
question in the NLTK discussion group).

The Python HELP (at least for version 2.5.4) states:

+++++++
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-
++++++++++++

Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of
Unicode literals and what can be used as literals.

Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.

I noticed, in researching this question, that Marc Andre Lemburg
stated, back in 2001, "Since Python source code is defined to be
ASCII..."

I'm writing code for linguistics (other than English), so I need
access to lots more characters. Most of the time, the characters come
from files, so no problem. But for some processing tasks, I simply
must be able to use "real" Unicode literals in the source code.
(Writing hex escape sequences in a complex regex would be a
nightmare).

Was this taken care of in the switch from Python 2.X to 3.X?

Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?

Also, in the Windows version of Python, how can I tell if it was
compiled to support 16 bits of Unicode or 32 bits of Unicode?

Bruce in Toronto

Steven D'Aprano · May 3, 2009

Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of Unicode
literals and what can be used as literals.

Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.

When you say it "won't allow", what do you mean? That you can't paste
them into the document? Does it give an error? An exception at compile
time or runtime?

I assume you have included the coding line at the top of the file. Make
sure it says utf-8 and not latin-1.

# -*- coding: uft-8 -*-

This is especially important if you use a Windows text editor that puts a
Unicode BOM at the start of the file.

What happens if you use a different editor to insert the characters in
the file, and then open it in IDLE?

How are you writing the literals? As byte strings or unicode strings? E.g.

# filename = nonascii.py
theta = 'Î¸' # byte string, probably will lead to problems
sigma = u'Î£' # unicode, this is the Right Way

Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?

It works for me in Python 2.4 and 2.5, although I'm not using IDLE.
Î¸

Perhaps it is a problem with IDLE?

Matt Nordhoff · May 3, 2009

Uncle said:
I'm working with Python 2.5.4 and the NLTK (Natural Language
Toolkit). I'm an experienced programmer, but new to Python.

This question arose when I tried to create a literal in my source code
for a Unicode codepoint greater than 255. (I also posted this
question in the NLTK discussion group).

The Python HELP (at least for version 2.5.4) states:

+++++++
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-
++++++++++++

Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of
Unicode literals and what can be used as literals.

Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.

I noticed, in researching this question, that Marc Andre Lemburg
stated, back in 2001, "Since Python source code is defined to be
ASCII..."

I'm writing code for linguistics (other than English), so I need
access to lots more characters. Most of the time, the characters come
from files, so no problem. But for some processing tasks, I simply
must be able to use "real" Unicode literals in the source code.
(Writing hex escape sequences in a complex regex would be a
nightmare).

Was this taken care of in the switch from Python 2.X to 3.X?

Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?

Also, in the Windows version of Python, how can I tell if it was
compiled to support 16 bits of Unicode or 32 bits of Unicode?

Bruce in Toronto

Works for me:

--- snip ---
$ cat snowman.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata

snowman = u'â˜ƒ'

print len(snowman)
print unicodedata.name(snowman)
$ python2.6 snowman.py
1
SNOWMAN
--- snip ---

What did you set the encoding to in the declaration at the top of the
file? The help text you quoted uses latin-1 as an example, an encoding
which, of course, only supports 256 code points. Did you try utf-8 instead?

The regular expression engine's Unicode support is a different question,
and I do not know the answer.

By the way, Python 2.x only supports using non-ASCII characters in
source code in string literals. Python 3 adds support for Unicode
identifiers (e.g. variable names, function argument names, etc.).
--

Uncle Bruce · May 3, 2009

Uncle Bruce wrote:

--

I think I've figured it out!

What I was trying to do was to enter the literal strings directly into
the IDLE interpreter. The IDLE interpreter will not accept high
codepoints directly.

However, when I put a defined function in a separate file with high
codepoints, IDLE processes them just fine! display produced the
expected Hex strings, and Print displayed the correct characters.

Success!

portable unicode literals	4	Oct 15, 2012
Compilation of old source code.	0	Mar 3, 2022
Mql5 programming - expert bot source code	0	Nov 4, 2024
Python Unicode handling wins again -- mostly	67	Nov 29, 2013
Py-dea: Streamline string literals now!	21	Dec 27, 2011
Raspberry Pi Open Source PLC Communication Wonder LECPython, and Example of Communication with Omron PLC	0	Oct 9, 2024
Unicode	2	Mar 15, 2013
Unicode escapes and String literals?	24	Dec 13, 2012

Use of Unicode in Python 2.5 source code literals

Uncle Bruce

Steven D'Aprano

Matt Nordhoff

Uncle Bruce

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads