How to Split Chinese Character with backslash representation?

Discussion in 'Python' started by Wijaya Edward, Oct 27, 2006.

  1. Hi all,

    I was trying to split a string that
    represent chinese characters below:

    ['\xc5\xeb\xc7\xd5\xbc']

    But why the split function here doesn't seem
    to do the job for obtaining the desired result:

    ['\xc5','\xeb','\xc7','\xd5','\xbc']



    Regards,
    -- Edward WIJAYA
    SINGAPORE



    ------------ Institute For Infocomm Research - Disclaimer -------------
    This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
    --------------------------------------------------------
     
    Wijaya Edward, Oct 27, 2006
    #1
    1. Advertisements

  2. Depends on what you want to do with them:
    print char


    Å
    ë
    Ç
    Õ
    ¼
    char


    '\xc5'
    '\xeb'
    '\xc7'
    '\xd5'
    '\xbc' print char


    Å
    ë
    Ç
    Õ
    ¼
    '\xeb\xc7'

    Basically, you characters are already separated into a list of
    characters, that's effectively what a string is (but with a few more
    methods applicable only to lists of characters, not to other lists).
     
    Cameron Walsh, Oct 27, 2006
    #2
    1. Advertisements

  3. Thanks but my intention is to strictly use regex.
    Since there are separator I need to include as delimiter
    Especially for the case like this:
    ['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-', 'B', 'A', 'R']

    What we want as the output is this instead:
    ['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','BAR]

    What's the best way to do it?

    -- Edward WIJAYA
    SINGAPORE

    ________________________________

    From: python-list-bounces+ewijaya= on behalf of Cameron Walsh
    Sent: Fri 10/27/2006 12:03 PM
    To:
    Subject: Re: How to Split Chinese Character with backslash representation?



    Depends on what you want to do with them:
    print char


    Å
    ë
    Ç
    Õ
    ¼
    char


    '\xc5'
    '\xeb'
    '\xc7'
    '\xd5'
    '\xbc' print char


    Å
    ë
    Ç
    Õ
    ¼
    '\xeb\xc7'

    Basically, you characters are already separated into a list of
    characters, that's effectively what a string is (but with a few more
    methods applicable only to lists of characters, not to other lists).
    --
    http://mail.python.org/mailman/listinfo/python-list



    ------------ Institute For Infocomm Research - Disclaimer -------------
    This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
    --------------------------------------------------------
     
    Wijaya Edward, Oct 27, 2006
    #3
  4. Wijaya Edward

    limodou Guest

    If the case is very simple, why not just replace '_' with '', for example:

    str.replace('-', '')
     
    limodou, Oct 27, 2006
    #4
  5. Except he appears to want the Chinese characters as elements of the
    list, and English words as elements of the list. Note carefully the
    last two elements in his desired list. I'm still puzzling this one...
     
    Cameron Walsh, Oct 27, 2006
    #5
  6. '\xd5', '\xbc', 'FOO', 'BAR']

    the RE matches either a sequence of latin characters, *or* a single
    non-ASCII character.

    you may want to adjust the character ranges to match the encoding you're
    using, and your definition of non-chinese words.

    </F>
     
    Fredrik Lundh, Oct 27, 2006
    #6
  7. Wijaya Edward

    limodou Guest

    Oh, I see. I made a mistake.
     
    limodou, Oct 27, 2006
    #7
  8. Wijaya Edward

    Paul McGuire Guest

    There are no backslash characters in the string str, so split finds nothing
    to split on. I know it looks like there are, but the backslashes shown are
    part of the \x escape sequence for defining characters when you can't or
    don't want to use plain ASCII characters (such as in your example in which
    the characters are all in the range 0x80 to 0xff). Look at this example:
    @

    I defined s using the escaped \x notation, but s does not contain any
    backslashes, it contains the '@' character, whose ordinal character value is
    64, or 40hex.

    Also, str is not the best name for a string variable, since this masks the
    built-in str type.

    -- Paul
     
    Paul McGuire, Oct 27, 2006
    #8
  9. Moreover, you are not splitting on a backslash; since you used a
    r'raw_string', you are in fact splitting on TWO backslashes. It looks
    like you want to treat str as a raw string to get at the slashes, but it
    isn't a raw string and I don't think you can directly convert it to one.
    If you want the numeric values of each byte, you can do the following:

    Py >>> char_values = [ ord(c) for c in str ]
    Py >>> char_values
    [ 197, 235, 199, 213, 188 ]
    Py >>>

    Note that those numbers are decimal equivalents of the hex values given
    in your string, but are now in integer format.

    On the other hand, you may want to use str.encode('gbk') (or whatever
    your encoding is) so that you're actually dealing with characters rather
    than bytes:

    Py >>> str.decode('gbk')

    Traceback (most recent call last):
    File "<pyshell#29>", line 1, in -toplevel-
    str.decode('gbk')
    UnicodeDecodeError: 'gbk' codec can't decode byte 0xbc in position 4:
    incomplete multibyte sequence
    Py >>> str[0:4].decode('gbk')
    u'\u70f9\u94a6'

    Py >>> print str[0:4].decode('gbk')
    烹钦
    Py >>> print str[0:4]
    ÅëÇÕ

    OK, so gbk choked on the odd character at the end. Maybe you need a
    different encoding, or maybe your string got truncated somewhere along
    the line....

    Cheers,
    Cliff
     
    J. Clifford Dyer, Oct 27, 2006
    #9
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.