Short questions wrt Python & Unicode

K

KvS

Hi all,

I've been reading about unicode in general and using it in Python in
particular lately as this turns out to be not so straightforward
actually. I wanted to aks two questions:

1) I'm writing a program that interacts with the user through wxPython
(unicode build) and stores & retrieves data using PySQLite. As fas as I
know now, both packages are capable of handling Python unicode objects
(wxPython returns the values of text controls etc. by default as Python
unicode objects and "TEXT" columns in PySQLite have unicode entries)
and since of course both interface with me through Python unicode
objects I should be able to use each others generated unicode objects
without any fear in each other functions, right??

2) How do I get a representation of a unic. object in terms of Unicode
code points? repr() doesn't do that, it sometimes parses or encodes the
code points right:
u'@\u0166\xe6'

(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)

Thanks in advance!

- Kees
 
J

John Machin

2) How do I get a representation of a unic. object in terms of Unicode
code points? repr() doesn't do that, it sometimes parses or encodes the
code points right:

|>>> s=u"\u0040\u0166\u00e6"
|>>> s
u'@\u0166\xe6'

|>>> ' '.join('U+%04X % ord(c) for c in s)
'U+0040 U+0166 U+00E6'

If you'd prefer it more Pythonic than unicode.orgic, adjust the format
string and separator to suit your taste.
(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)

|>>> u'\xe6' == u'\u00e6' == unichr(0xe6)
True
|>>> hex(ord(u'\u00e6'))
'0xe6'

U+nnnnnn is represented internally as the integer 0xnnnnnn -- except if
it won't fit, but you can pretend that surrogate pairs don't exist, for
the moment :)

Cheers,
John
 
F

Fredrik Lundh

KvS said:
u'@\u0166\xe6'

(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)

no, it's simply the shortest way to represent U+00E6 as Python Unicode
string literal, when limited to ASCII only.

</F>
 
K

KvS

John said:
|>>> ' '.join('U+%04X % ord(c) for c in s)
'U+0040 U+0166 U+00E6'

If you'd prefer it more Pythonic than unicode.orgic, adjust the format
string and separator to suit your taste.


|>>> u'\xe6' == u'\u00e6' == unichr(0xe6)
True
|>>> hex(ord(u'\u00e6'))
'0xe6'

U+nnnnnn is represented internally as the integer 0xnnnnnn -- except if
it won't fit, but you can pretend that surrogate pairs don't exist, for
the moment :)

Cheers,
John

Thanks to you and Fredrik! What about q1? I know it's silly since for
integers e.g. one doesn't give such an issue any thought at all, it's
just that this understanding of en/decodings etc. make things a bit
more blurry to me. It should be the case that a package may do
internally (en-/decodign etc.) what it wants to represent/manipulate
unic. strings but should always communicate to the outside world via
the interchangable & uniform Python unicode object right?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,430
Messages
2,571,676
Members
48,796
Latest member
Greg L.

Latest Threads

Top