How to Split Chinese Character with backslash representation?

W

Wijaya Edward

Oct 26, 2006

#1

Hi all,

I was trying to split a string that
represent chinese characters below:

['\xc5\xeb\xc7\xd5\xbc']

But why the split function here doesn't seem
to do the job for obtaining the desired result:

['\xc5','\xeb','\xc7','\xd5','\xbc']

Regards,
-- Edward WIJAYA
SINGAPORE

------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------

C

Cameron Walsh

Oct 27, 2006

#2

Wijaya said:
Hi all,

I was trying to split a string that
represent chinese characters below:

['\xc5\xeb\xc7\xd5\xbc']

But why the split function here doesn't seem
to do the job for obtaining the desired result:

['\xc5','\xeb','\xc7','\xd5','\xbc']

Depends on what you want to do with them:
print char

Å
ë
Ç
Õ
¼

>>> list_of_characters = list(string)
>>> list_of_characters ['\xc5', '\xeb', '\xc7', '\xd5', '\xbc']
>>> for char in string:

Click to expand...

Click to expand...

char

'\xc5'
'\xeb'
'\xc7'
'\xd5'
'\xbc' print char

Å
ë
Ç
Õ
¼

>>> string[3] '\xd5'
>>> string[1:3]

Click to expand...

Click to expand...

'\xeb\xc7'

Basically, you characters are already separated into a list of
characters, that's effectively what a string is (but with a few more
methods applicable only to lists of characters, not to other lists).

W

Wijaya Edward

Oct 27, 2006

#3

Thanks but my intention is to strictly use regex.
Since there are separator I need to include as delimiter
Especially for the case like this:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','BAR]

What's the best way to do it?

-- Edward WIJAYA
SINGAPORE

________________________________

From: [email protected] on behalf of Cameron Walsh
Sent: Fri 10/27/2006 12:03 PM
To: (e-mail address removed)
Subject: Re: How to Split Chinese Character with backslash representation?

Wijaya said:
Hi all,

I was trying to split a string that
represent chinese characters below:

['\xc5\xeb\xc7\xd5\xbc']

But why the split function here doesn't seem
to do the job for obtaining the desired result:

['\xc5','\xeb','\xc7','\xd5','\xbc']

Depends on what you want to do with them:
print char

Å
ë
Ç
Õ
¼

>>> list_of_characters = list(string)
>>> list_of_characters ['\xc5', '\xeb', '\xc7', '\xd5', '\xbc']
>>> for char in string:

Click to expand...

Click to expand...

char

'\xc5'
'\xeb'
'\xc7'
'\xd5'
'\xbc' print char

Å
ë
Ç
Õ
¼

>>> string[3] '\xd5'
>>> string[1:3]

Click to expand...

Click to expand...

'\xeb\xc7'

Basically, you characters are already separated into a list of
characters, that's effectively what a string is (but with a few more
methods applicable only to lists of characters, not to other lists).
--
http://mail.python.org/mailman/listinfo/python-list

------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------

L

limodou

Oct 27, 2006

#4

Thanks but my intention is to strictly use regex.
Since there are separator I need to include as delimiter
Especially for the case like this:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','BAR]

What's the best way to do it?

If the case is very simple, why not just replace '_' with '', for example:

str.replace('-', '')

C

Cameron Walsh

Oct 27, 2006

#5

limodou said:
Thanks but my intention is to strictly use regex.
Since there are separator I need to include as delimiter
Especially for the case like this:

str = '\xc5\xeb\xc7\xd5\xbc--FOO--BAR'
field = list(str)
print field

Click to expand...

['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-',
'-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','BAR]

What's the best way to do it?

Click to expand...

If the case is very simple, why not just replace '_' with '', for example:

str.replace('-', '')

Except he appears to want the Chinese characters as elements of the
list, and English words as elements of the list. Note carefully the
last two elements in his desired list. I'm still puzzling this one...

F

Fredrik Lundh

Oct 27, 2006

#6

Wijaya said:
Since there are separator I need to include as delimiter
Especially for the case like this:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','BAR]

>>> s = '\xc5\xeb\xc7\xd5\xbc--FOO--BAR'
>>> re.findall("(?i)[a-z]+|[\xA0-\xFF]", s)

Click to expand...

Click to expand...

'\xd5', '\xbc', 'FOO', 'BAR']

the RE matches either a sequence of latin characters, *or* a single
non-ASCII character.

you may want to adjust the character ranges to match the encoding you're
using, and your definition of non-chinese words.

</F>

L

limodou

Oct 27, 2006

#7

limodou said:
limodou said:

Thanks but my intention is to strictly use regex.
Since there are separator I need to include as delimiter
Especially for the case like this:

str = '\xc5\xeb\xc7\xd5\xbc--FOO--BAR'
field = list(str)
print field
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-',
'-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','BAR]

What's the best way to do it?

Click to expand...

If the case is very simple, why not just replace '_' with '', for example:

str.replace('-', '')

Click to expand...

Except he appears to want the Chinese characters as elements of the
list, and English words as elements of the list. Note carefully the
last two elements in his desired list. I'm still puzzling this one...

Oh, I see. I made a mistake.

P

Paul McGuire

Oct 27, 2006

#8

Wijaya Edward said:
Hi all,

I was trying to split a string that
represent chinese characters below:

['\xc5\xeb\xc7\xd5\xbc']

But why the split function here doesn't seem
to do the job for obtaining the desired result:

['\xc5','\xeb','\xc7','\xd5','\xbc']

There are no backslash characters in the string str, so split finds nothing
to split on. I know it looks like there are, but the backslashes shown are
part of the \x escape sequence for defining characters when you can't or
don't want to use plain ASCII characters (such as in your example in which
the characters are all in the range 0x80 to 0xff). Look at this example:
@

I defined s using the escaped \x notation, but s does not contain any
backslashes, it contains the '@' character, whose ordinal character value is
64, or 40hex.

Also, str is not the best name for a string variable, since this masks the
built-in str type.

-- Paul

J

J. Clifford Dyer

Oct 27, 2006

#9

Paul said:
There are no backslash characters in the string str, so split finds nothing
to split on. I know it looks like there are, but the backslashes shown are
part of the \x escape sequence for defining characters when you can't or
don't want to use plain ASCII characters (such as in your example in which
the characters are all in the range 0x80 to 0xff).

Moreover, you are not splitting on a backslash; since you used a
r'raw_string', you are in fact splitting on TWO backslashes. It looks
like you want to treat str as a raw string to get at the slashes, but it
isn't a raw string and I don't think you can directly convert it to one.
If you want the numeric values of each byte, you can do the following:

Py >>> char_values = [ ord(c) for c in str ]
Py >>> char_values
[ 197, 235, 199, 213, 188 ]
Py >>>

Note that those numbers are decimal equivalents of the hex values given
in your string, but are now in integer format.

On the other hand, you may want to use str.encode('gbk') (or whatever
your encoding is) so that you're actually dealing with characters rather
than bytes:

Py >>> str.decode('gbk')

Traceback (most recent call last):
File "<pyshell#29>", line 1, in -toplevel-
str.decode('gbk')
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbc in position 4:
incomplete multibyte sequence
Py >>> str[0:4].decode('gbk')
u'\u70f9\u94a6'

Py >>> print str[0:4].decode('gbk')
çƒ¹é’¦
Py >>> print str[0:4]
Ã…Ã«Ã‡Ã•

OK, so gbk choked on the odd character at the end. Maybe you need a
different encoding, or maybe your string got truncated somewhere along
the line....

Cheers,
Cliff

Fwd: How to Split Chinese Character with backslash representation?	0	Oct 26, 2006
Printing Hidden Character in Python	1	Oct 26, 2006
Slurping All Content of a File into a Variable	2	Oct 26, 2006
Matching Pure Numeric and '' with Python re	0	Oct 27, 2006
Benchmarking Python's subroutines/function	1	Oct 3, 2006
Insert Content of a File into a Variable	3	Oct 26, 2006
Python Best Practice References	2	Oct 13, 2006
Howto pass Array/Hash into Function	6	Oct 3, 2006

Wijaya Edward

Cameron Walsh

Wijaya Edward

limodou

Cameron Walsh

Fredrik Lundh

limodou

Paul McGuire

J. Clifford Dyer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads