A bug for unicode strings in Python 2.4?

Thomas Moore · Jan 11, 2006

Hi:

Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.

u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
u.split() [u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']

Click to expand...

Click to expand...

I think u should get split.

--Frank

Neil Hodgson · Jan 11, 2006

Thomas Moore:

[u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']

I think u should get split.

Where do you think "é€™æ˜¯ä¸æ–‡å—ä¸²" should be split and why?

Neil

Thomas Moore · Jan 11, 2006

Thomas Moore:

u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
u.split()

Click to expand...

[u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']

I think u should get split.

Click to expand...

Where do you think "é€™æ˜¯ä¸æ–‡å—ä¸²" should be split and why?

Isn't a unicode string character by character?

-Frank

Fredrik Lundh · Jan 11, 2006

Thomas said:
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.

u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
u.split() [u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']

Click to expand...

Click to expand...

I think u should get split.

why? split splits on whitespace (basically unicode category Zs), and
there are no whitespace symbols in there:

u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
[c.isspace() for c in u]

Click to expand...

Click to expand...

[False, False, False, False, False, False]

there's no universal "split on words in all languages" function in the
standard python library. You may be able to roll your own using the
information in http://www.unicode.org/reports/tr29/ plus functions
in the unicodedata module (which currently doesn't include the
BreakTest tables; patches are welcome). Or maybe google can
help you find an existing implementation.

</F>

Thomas Moore · Jan 11, 2006

Hi:

Thanks. I'll write my own split().

Frank

Szabolcs Nagy · Jan 11, 2006

Thanks. I'll write my own split().

do you want to split character by character?
then use
list(u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32')

A bug about unicode string in Python 2.4?	0	Jan 11, 2006
Python2.6 + win32com crashes with unicode bug	5	Oct 29, 2009
re.sub() backreference bug?	4	Aug 17, 2006
Which libraries for Python 2.5.2	29	Dec 27, 2011
print - bug or feature - concatenated format strings in a printstatement	7	Mar 16, 2009
New to python, can i ask for a little help?	4	May 13, 2009
the unicode saga continues...	2	Nov 14, 2009
I'm happy with Python 2.5	20	Feb 27, 2011

A bug for unicode strings in Python 2.4?

Thomas Moore

Neil Hodgson

Thomas Moore

Fredrik Lundh

Thomas Moore

Szabolcs Nagy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads