help needed with regex and unicode

Pradnyesh Sawant · Mar 4, 2008

Hi all,
I have a file which contains chinese characters. I just want to find out
all the places that these chinese characters occur.

The following script doesn't seem to work

**********************************************************************
class RemCh(object):
def __init__(self, fName):
self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
fp = open(fName, 'r')
content = fp.read()
s = re.search('[\u2F00-\u2fdf]', content, re.U)
if s:
print s.group(0)
if __name__ == '__main__':
rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
**********************************************************************

the php file content is something like the following:

**********************************************************************
// Check if the folder still has subscribed blogs
$subCount = function1($param1, $param2);
if ($subCount > 0) {
$errors['summary'] = 'Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
$errorMessage = 'Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
}

if (empty($errors)) {
$ret = function2($blog_res, $yuid, $fid);
if ($ret >= 0) {
$saveFalg = TRUE;
} else {
error_log("ERROR:: ret: $ret, function1($param1, $param2)");
$errors['summary'] = "Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¨Â±Ã¥Ã£
$errorMessage = "Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¨Â±Ã¥Ã£
}
}
**********************************************************************

--
warm regards,
Pradnyesh Sawant
--
Luck is the residue of good design. --Anon

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHzNv60/i4ob0zRxgRAq31AJ0WJ8F12Cy/8MFGlCAtUCU77m54HwCeMj1b
nbcxsMRTNAxCDcAxSzXyQuI=
=sy+l
-----END PGP SIGNATURE-----

Marc 'BlackJack' Rintsch · Mar 4, 2008

I have a file which contains chinese characters. I just want to find out
all the places that these chinese characters occur.

The following script doesn't seem to work

**********************************************************************
class RemCh(object):
def __init__(self, fName):
self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
fp = open(fName, 'r')
content = fp.read()
s = re.search('[\u2F00-\u2fdf]', content, re.U)
if s:
print s.group(0)
if __name__ == '__main__':
rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
**********************************************************************

the php file content is something like the following:

**********************************************************************
// Check if the folder still has subscribed blogs
$subCount = function1($param1, $param2);
if ($subCount > 0) {
$errors['summary'] = 'ÃƒÂ¦Ã‚ÂÃƒÂ¯Ã‚Â½Ã‚Â ÃƒÂ¦Ã‚Â½ÃƒÂ¥Ã‚Â¤Ã¦ÂÂ¤ÃƒÂ¥ÃƒÂ¯Ã‚Â«ÃƒÂ¥ÃƒÂ©ÃƒÂ©Ã‚Â§ÃƒÂ§Ã‚Â²ÃƒÂ¨';
$errorMessage = 'ÃƒÂ¦Ã‚ÂÃƒÂ¯Ã‚Â½Ã‚Â ÃƒÂ¦Ã‚Â½ÃƒÂ¥Ã‚Â¤Ã¦ÂÂ¤ÃƒÂ¥ÃƒÂ¯Ã‚Â«ÃƒÂ¥ÃƒÂ©ÃƒÂ©Ã‚Â§ÃƒÂ§Ã‚Â²ÃƒÂ¨';
}

Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
decode `content` to unicode before searching the chinese characters.

Ciao,
Marc 'BlackJack' Rintsch

Mark Tolonen · Mar 4, 2008

Marc 'BlackJack' Rintsch said:
I have a file which contains chinese characters. I just want to find out
all the places that these chinese characters occur.

The following script doesn't seem to work

**********************************************************************
class RemCh(object):
def __init__(self, fName):
self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
fp = open(fName, 'r')
content = fp.read()
s = re.search('[\u2F00-\u2fdf]', content, re.U)
if s:
print s.group(0)
if __name__ == '__main__':
rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
**********************************************************************

the php file content is something like the following:

**********************************************************************
// Check if the folder still has subscribed blogs
$subCount = function1($param1, $param2);
if ($subCount > 0) {
$errors['summary'] = 'ÃƒÂ¦Ã‚ÂÃƒÂ¯Ã‚Â½Ã‚ ÃƒÂ¦Ã‚Â½ÃƒÂ¥Ã‚Â¤Ã¦ÂÂ¤ÃƒÂ¥ÃƒÂ¯Ã‚Â«ÃƒÂ¥ÃƒÂ©ÃƒÂ©Ã‚Â§ÃƒÂ§Ã‚Â²ÃƒÂ¨';
$errorMessage = 'ÃƒÂ¦Ã‚ÂÃƒÂ¯Ã‚Â½Ã‚ ÃƒÂ¦Ã‚Â½ÃƒÂ¥Ã‚Â¤Ã¦ÂÂ¤ÃƒÂ¥ÃƒÂ¯Ã‚Â«ÃƒÂ¥ÃƒÂ©ÃƒÂ©Ã‚Â§ÃƒÂ§Ã‚Â²ÃƒÂ¨';
}

Click to expand...

Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
decode `content` to unicode before searching the chinese characters.

I couldn't get your data to decode into anything resembling Chinese, so I
created my own file as an example. If reading an encoded text file, it
comes
in as just a bunch of bytes:
Ã¯Â»Â¿Ã¦Ë†â€˜Ã¦ËœÂ¯Ã§Â¾Å½Ã¥â€ºÂ½Ã¤ÂºÂºÃ£â‚¬â€š WÃ‡â€™ shÃƒÂ¬ MÃ„â€ºiguÃƒÂ³rÃƒÂ©n. I am an American.

Garbage, because the encoding isn't known. Provide the correct encoding and
decode it to Unicode:
ï»¿æˆ‘æ˜¯ç¾Žå›½äººã€‚ WÇ’ shÃ¬ MÄ›iguÃ³rÃ©n. I am an American.

Here's the Unicode string. Note the 'u' before the quotes to indicate
Unicode.
u'\ufeff\u6211\u662f\u7f8e\u56fd\u4eba\u3002 W\u01d2 sh\xec
M\u011bigu\xf3r\xe9n. I am an American.'

If working with Unicode strings, the re module should be provided Unicode
strings also:

>>> print re.search(ur'[\u4E00-\u9FA5]',s).group(0) æˆ‘
>>> print re.findall(ur'[\u4E00-\u9FA5]',s)

Click to expand...

Click to expand...

[u'\u6211', u'\u662f', u'\u7f8e', u'\u56fd', u'\u4eba']

Hope that helps you.

--Mark

help needed with regex and unicode

Pradnyesh Sawant

Marc 'BlackJack' Rintsch

Mark Tolonen

Members online

Forum statistics

Latest Threads