Use of Unicode in Python 2.5 source code literals

U

Uncle Bruce

I'm working with Python 2.5.4 and the NLTK (Natural Language
Toolkit). I'm an experienced programmer, but new to Python.

This question arose when I tried to create a literal in my source code
for a Unicode codepoint greater than 255. (I also posted this
question in the NLTK discussion group).

The Python HELP (at least for version 2.5.4) states:

+++++++
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-
++++++++++++

Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of
Unicode literals and what can be used as literals.

Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.

I noticed, in researching this question, that Marc Andre Lemburg
stated, back in 2001, "Since Python source code is defined to be
ASCII..."

I'm writing code for linguistics (other than English), so I need
access to lots more characters. Most of the time, the characters come
from files, so no problem. But for some processing tasks, I simply
must be able to use "real" Unicode literals in the source code.
(Writing hex escape sequences in a complex regex would be a
nightmare).

Was this taken care of in the switch from Python 2.X to 3.X?

Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?

Also, in the Windows version of Python, how can I tell if it was
compiled to support 16 bits of Unicode or 32 bits of Unicode?

Bruce in Toronto
 
S

Steven D'Aprano

Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of Unicode
literals and what can be used as literals.

Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.

When you say it "won't allow", what do you mean? That you can't paste
them into the document? Does it give an error? An exception at compile
time or runtime?

I assume you have included the coding line at the top of the file. Make
sure it says utf-8 and not latin-1.

# -*- coding: uft-8 -*-

This is especially important if you use a Windows text editor that puts a
Unicode BOM at the start of the file.

What happens if you use a different editor to insert the characters in
the file, and then open it in IDLE?

How are you writing the literals? As byte strings or unicode strings? E.g.

# filename = nonascii.py
theta = 'θ' # byte string, probably will lead to problems
sigma = u'Σ' # unicode, this is the Right Way


Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?

It works for me in Python 2.4 and 2.5, although I'm not using IDLE.
θ

Perhaps it is a problem with IDLE?
 
M

Matt Nordhoff

Uncle said:
I'm working with Python 2.5.4 and the NLTK (Natural Language
Toolkit). I'm an experienced programmer, but new to Python.

This question arose when I tried to create a literal in my source code
for a Unicode codepoint greater than 255. (I also posted this
question in the NLTK discussion group).

The Python HELP (at least for version 2.5.4) states:

+++++++
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-
++++++++++++

Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of
Unicode literals and what can be used as literals.

Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.

I noticed, in researching this question, that Marc Andre Lemburg
stated, back in 2001, "Since Python source code is defined to be
ASCII..."

I'm writing code for linguistics (other than English), so I need
access to lots more characters. Most of the time, the characters come
from files, so no problem. But for some processing tasks, I simply
must be able to use "real" Unicode literals in the source code.
(Writing hex escape sequences in a complex regex would be a
nightmare).

Was this taken care of in the switch from Python 2.X to 3.X?

Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?

Also, in the Windows version of Python, how can I tell if it was
compiled to support 16 bits of Unicode or 32 bits of Unicode?

Bruce in Toronto

Works for me:

--- snip ---
$ cat snowman.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata

snowman = u'☃'

print len(snowman)
print unicodedata.name(snowman)
$ python2.6 snowman.py
1
SNOWMAN
--- snip ---

What did you set the encoding to in the declaration at the top of the
file? The help text you quoted uses latin-1 as an example, an encoding
which, of course, only supports 256 code points. Did you try utf-8 instead?

The regular expression engine's Unicode support is a different question,
and I do not know the answer.

By the way, Python 2.x only supports using non-ASCII characters in
source code in string literals. Python 3 adds support for Unicode
identifiers (e.g. variable names, function argument names, etc.).
--
 
U

Uncle Bruce

Uncle Bruce wrote:

I think I've figured it out!

What I was trying to do was to enter the literal strings directly into
the IDLE interpreter. The IDLE interpreter will not accept high
codepoints directly.

However, when I put a defined function in a separate file with high
codepoints, IDLE processes them just fine! display produced the
expected Hex strings, and Print displayed the correct characters.

Success!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top