Short questions wrt Python & Unicode

KvS · Jun 9, 2006

Hi all,

I've been reading about unicode in general and using it in Python in
particular lately as this turns out to be not so straightforward
actually. I wanted to aks two questions:

1) I'm writing a program that interacts with the user through wxPython
(unicode build) and stores & retrieves data using PySQLite. As fas as I
know now, both packages are capable of handling Python unicode objects
(wxPython returns the values of text controls etc. by default as Python
unicode objects and "TEXT" columns in PySQLite have unicode entries)
and since of course both interface with me through Python unicode
objects I should be able to use each others generated unicode objects
without any fear in each other functions, right??

2) How do I get a representation of a unic. object in terms of Unicode
code points? repr() doesn't do that, it sometimes parses or encodes the
code points right:
u'@\u0166\xe6'

(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)

Thanks in advance!

- Kees

John Machin · Jun 9, 2006

2) How do I get a representation of a unic. object in terms of Unicode
code points? repr() doesn't do that, it sometimes parses or encodes the
code points right:

|>>> s=u"\u0040\u0166\u00e6"
|>>> s
u'@\u0166\xe6'

|>>> ' '.join('U+%04X % ord(c) for c in s)
'U+0040 U+0166 U+00E6'

If you'd prefer it more Pythonic than unicode.orgic, adjust the format
string and separator to suit your taste.

(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)

|>>> u'\xe6' == u'\u00e6' == unichr(0xe6)
True
|>>> hex(ord(u'\u00e6'))
'0xe6'

U+nnnnnn is represented internally as the integer 0xnnnnnn -- except if
it won't fit, but you can pretend that surrogate pairs don't exist, for
the moment

Cheers,
John

Fredrik Lundh · Jun 9, 2006

KvS said:
u'@\u0166\xe6'

(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)

no, it's simply the shortest way to represent U+00E6 as Python Unicode
string literal, when limited to ASCII only.

</F>

KvS · Jun 9, 2006

John said:
|>>> ' '.join('U+%04X % ord(c) for c in s)
'U+0040 U+0166 U+00E6'

If you'd prefer it more Pythonic than unicode.orgic, adjust the format
string and separator to suit your taste.

|>>> u'\xe6' == u'\u00e6' == unichr(0xe6)
True
|>>> hex(ord(u'\u00e6'))
'0xe6'

U+nnnnnn is represented internally as the integer 0xnnnnnn -- except if
it won't fit, but you can pretend that surrogate pairs don't exist, for
the moment

Cheers,
John

Thanks to you and Fredrik! What about q1? I know it's silly since for
integers e.g. one doesn't give such an issue any thought at all, it's
just that this understanding of en/decodings etc. make things a bit
more blurry to me. It should be the case that a package may do
internally (en-/decodign etc.) what it wants to represent/manipulate
unic. strings but should always communicate to the outside world via
the interchangable & uniform Python unicode object right?

Unicode questions	17	Oct 19, 2010
Python dict as unicode	1	Nov 24, 2010
Python Unicode handling wins again -- mostly	67	Nov 29, 2013
Unicode again ... default codec ...	0	Oct 20, 2009
byte count unicode string	0	Sep 20, 2006
sys.stdout, urllib and unicode... I don't understand.	5	Nov 11, 2008
Quepy, transform questions in natural language into queries in a DBlanguage using python	0	Jan 29, 2013
csv and mixed lists of unicode and numbers	6	Nov 24, 2009

Short questions wrt Python & Unicode

KvS

John Machin

Fredrik Lundh

KvS

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads