U
Uncle Bruce
I'm working with Python 2.5.4 and the NLTK (Natural Language
Toolkit). I'm an experienced programmer, but new to Python.
This question arose when I tried to create a literal in my source code
for a Unicode codepoint greater than 255. (I also posted this
question in the NLTK discussion group).
The Python HELP (at least for version 2.5.4) states:
+++++++
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
++++++++++++
Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of
Unicode literals and what can be used as literals.
Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.
I noticed, in researching this question, that Marc Andre Lemburg
stated, back in 2001, "Since Python source code is defined to be
ASCII..."
I'm writing code for linguistics (other than English), so I need
access to lots more characters. Most of the time, the characters come
from files, so no problem. But for some processing tasks, I simply
must be able to use "real" Unicode literals in the source code.
(Writing hex escape sequences in a complex regex would be a
nightmare).
Was this taken care of in the switch from Python 2.X to 3.X?
Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?
Also, in the Windows version of Python, how can I tell if it was
compiled to support 16 bits of Unicode or 32 bits of Unicode?
Bruce in Toronto
Toolkit). I'm an experienced programmer, but new to Python.
This question arose when I tried to create a literal in my source code
for a Unicode codepoint greater than 255. (I also posted this
question in the NLTK discussion group).
The Python HELP (at least for version 2.5.4) states:
+++++++
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
++++++++++++
Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of
Unicode literals and what can be used as literals.
Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.
I noticed, in researching this question, that Marc Andre Lemburg
stated, back in 2001, "Since Python source code is defined to be
ASCII..."
I'm writing code for linguistics (other than English), so I need
access to lots more characters. Most of the time, the characters come
from files, so no problem. But for some processing tasks, I simply
must be able to use "real" Unicode literals in the source code.
(Writing hex escape sequences in a complex regex would be a
nightmare).
Was this taken care of in the switch from Python 2.X to 3.X?
Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?
Also, in the Windows version of Python, how can I tell if it was
compiled to support 16 bits of Unicode or 32 bits of Unicode?
Bruce in Toronto