Problem processing Chinese

Discussion in 'Python' started by Anthony Liu, Oct 14, 2005.

  1. Anthony Liu

    Anthony Liu Guest

    I believe that topic related to Chinese processing was
    discussed before. I could not dig out the info I want
    from the mail list archive.

    My Python script reads some Chinese text and then
    split a line delimited by white spaces. I got lists
    like

    ['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
    '\xa1\xa2']

    I had

    #-*- coding: gbk -*-

    on top of the script.

    My Windows 2000 system's default language is Chinese
    (GB2312) and displays Chinese perfectly.

    I don't know how to configure python or what else I
    need to properly process such two-byte-character text.

    Thanks.







    __________________________________
    Yahoo! Mail - PC Magazine Editors' Choice 2005
    http://mail.yahoo.com
    Anthony Liu, Oct 14, 2005
    #1
    1. Advertising

  2. Anthony Liu

    Peter Otten Guest

    Anthony Liu wrote:

    > I believe that topic related to Chinese processing was
    > discussed before. I could not dig out the info I want
    > from the mail list archive.
    >
    > My Python script reads some Chinese text and then
    > split a line delimited by white spaces. I got lists
    > like
    >
    > ['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
    > '\xa1\xa2']
    >
    > I had
    >
    > #-*- coding: gbk -*-
    >
    > on top of the script.
    >
    > My Windows 2000 system's default language is Chinese
    > (GB2312) and displays Chinese perfectly.
    >
    > I don't know how to configure python or what else I
    > need to properly process such two-byte-character text.
    >
    > Thanks.


    Suppose you have a file with the following contents:

    >>> file("chinese.txt").read()

    '\xbc\xc7\xd5\xdf \xd0\xbb\xbd\xf0\xbb\xa2 \xa1\xa2'

    Then it's best to open it via codecs -- of course you have to know the
    encoding:

    >>> codecs.open("chinese.txt", "r", "gbk").read()

    u'\u8bb0\u8005 \u8c22\u91d1\u864e \u3001'

    This may still look strange to you but it's the unicode string's repr().
    If sys.stdout.encoding is properly set on your system you can just print it:

    >>> u = codecs.open("chinese.txt", "r", "gbk").read()
    >>> print u

    记者 谢金虎 ã€

    If that fails, provide the encoding explicitly:

    >>> print u.encode("utf-8") # probably "gbk" instead of "utf-8" on your

    system
    记者 谢金虎 ã€

    Because now you are in unicode all further operations are performed on
    characters rather than bytes. Processing Chinese is no longer more
    difficult than any language that confines itself to plain ASCII.
    But if you split your text into a list

    >>> u.split()

    [u'\u8bb0\u8005', u'\u8c22\u91d1\u864e', u'\u3001']

    you probably think you are back to square one. That is because Python prints
    the repr() of the list items (otherwise a comma would give the impression
    that the list contains more items than it actually does). To get the actual
    characters, choose an item explicitly

    >>> items = u.split()
    >>> print items[0]

    记者

    or convert the entire list to a string of your liking, e. g:

    >>> print u"[%s]" % u", ".join(items)

    [记者, 谢金虎, ã€]

    Peter
    Peter Otten, Oct 14, 2005
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Anthony Liu
    Replies:
    0
    Views:
    602
    Anthony Liu
    Mar 6, 2004
  2. Anthony Liu
    Replies:
    0
    Views:
    441
    Anthony Liu
    Mar 7, 2004
  3. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    410
    Michael Foord
    Sep 17, 2004
  4. Replies:
    1
    Views:
    348
    Nick Chan
    Sep 17, 2007
  5. Blackguester
    Replies:
    0
    Views:
    397
    Blackguester
    Jan 12, 2009
Loading...

Share This Page