Problem processing Chinese

A

Anthony Liu

I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.







__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
 
P

Peter Otten

Anthony said:
I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.

Suppose you have a file with the following contents:
'\xbc\xc7\xd5\xdf \xd0\xbb\xbd\xf0\xbb\xa2 \xa1\xa2'

Then it's best to open it via codecs -- of course you have to know the
encoding:
u'\u8bb0\u8005 \u8c22\u91d1\u864e \u3001'

This may still look strange to you but it's the unicode string's repr().
If sys.stdout.encoding is properly set on your system you can just print it:
记者 谢金虎 ã€

If that fails, provide the encoding explicitly:
system
记者 谢金虎 ã€

Because now you are in unicode all further operations are performed on
characters rather than bytes. Processing Chinese is no longer more
difficult than any language that confines itself to plain ASCII.
But if you split your text into a list
[u'\u8bb0\u8005', u'\u8c22\u91d1\u864e', u'\u3001']

you probably think you are back to square one. That is because Python prints
the repr() of the list items (otherwise a comma would give the impression
that the list contains more items than it actually does). To get the actual
characters, choose an item explicitly
items = u.split()
print items[0]
记者

or convert the entire list to a string of your liking, e. g:
print u"[%s]" % u", ".join(items)
[记者, 谢金虎, ã€]

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top