A bug for unicode strings in Python 2.4?

N

Neil Hodgson

Thomas Moore:
[u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']


I think u should get split.

Where do you think "這是中文字串" should be split and why?

Neil
 
F

Fredrik Lundh

Thomas said:
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
u.split() [u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']

I think u should get split.

why? split splits on whitespace (basically unicode category Zs), and
there are no whitespace symbols in there:
u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
[c.isspace() for c in u]
[False, False, False, False, False, False]

there's no universal "split on words in all languages" function in the
standard python library. You may be able to roll your own using the
information in http://www.unicode.org/reports/tr29/ plus functions
in the unicodedata module (which currently doesn't include the
BreakTest tables; patches are welcome). Or maybe google can
help you find an existing implementation.

</F>
 
S

Szabolcs Nagy

Thanks. I'll write my own split().

do you want to split character by character?
then use
list(u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32')
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top