help needed with regex and unicode

  • Thread starter Pradnyesh Sawant
  • Start date
P

Pradnyesh Sawant

Hi all,
I have a file which contains chinese characters. I just want to find out
all the places that these chinese characters occur.

The following script doesn't seem to work :(

**********************************************************************
class RemCh(object):
def __init__(self, fName):
self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
fp = open(fName, 'r')
content = fp.read()
s = re.search('[\u2F00-\u2fdf]', content, re.U)
if s:
print s.group(0)
if __name__ == '__main__':
rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
**********************************************************************

the php file content is something like the following:

**********************************************************************
// Check if the folder still has subscribed blogs
$subCount = function1($param1, $param2);
if ($subCount > 0) {
$errors['summary'] = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
$errorMessage = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
}

if (empty($errors)) {
$ret = function2($blog_res, $yuid, $fid);
if ($ret >= 0) {
$saveFalg = TRUE;
} else {
error_log("ERROR:: ret: $ret, function1($param1, $param2)");
$errors['summary'] = "æ­ï½ æ½å¤此è±åã
$errorMessage = "æ­ï½ æ½å¤此è±åã
}
}
**********************************************************************

--
warm regards,
Pradnyesh Sawant
--
Luck is the residue of good design. --Anon

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHzNv60/i4ob0zRxgRAq31AJ0WJ8F12Cy/8MFGlCAtUCU77m54HwCeMj1b
nbcxsMRTNAxCDcAxSzXyQuI=
=sy+l
-----END PGP SIGNATURE-----
 
M

Marc 'BlackJack' Rintsch

I have a file which contains chinese characters. I just want to find out
all the places that these chinese characters occur.

The following script doesn't seem to work :(

**********************************************************************
class RemCh(object):
def __init__(self, fName):
self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
fp = open(fName, 'r')
content = fp.read()
s = re.search('[\u2F00-\u2fdf]', content, re.U)
if s:
print s.group(0)
if __name__ == '__main__':
rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
**********************************************************************

the php file content is something like the following:

**********************************************************************
// Check if the folder still has subscribed blogs
$subCount = function1($param1, $param2);
if ($subCount > 0) {
$errors['summary'] = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
$errorMessage = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
}

Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
decode `content` to unicode before searching the chinese characters.

Ciao,
Marc 'BlackJack' Rintsch
 
M

Mark Tolonen

Marc 'BlackJack' Rintsch said:
I have a file which contains chinese characters. I just want to find out
all the places that these chinese characters occur.

The following script doesn't seem to work :(

**********************************************************************
class RemCh(object):
def __init__(self, fName):
self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
fp = open(fName, 'r')
content = fp.read()
s = re.search('[\u2F00-\u2fdf]', content, re.U)
if s:
print s.group(0)
if __name__ == '__main__':
rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
**********************************************************************

the php file content is something like the following:

**********************************************************************
// Check if the folder still has subscribed blogs
$subCount = function1($param1, $param2);
if ($subCount > 0) {
$errors['summary'] = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
$errorMessage = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
}

Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
decode `content` to unicode before searching the chinese characters.

I couldn't get your data to decode into anything resembling Chinese, so I
created my own file as an example. If reading an encoded text file, it
comes
in as just a bunch of bytes:
我是美国人。 Wǒ shì Měiguórén. I am an American.

Garbage, because the encoding isn't known. Provide the correct encoding and
decode it to Unicode:
我是美国人。 Wǒ shì Měiguórén. I am an American.

Here's the Unicode string. Note the 'u' before the quotes to indicate
Unicode.
u'\ufeff\u6211\u662f\u7f8e\u56fd\u4eba\u3002 W\u01d2 sh\xec
M\u011bigu\xf3r\xe9n. I am an American.'

If working with Unicode strings, the re module should be provided Unicode
strings also:
>>> print re.search(ur'[\u4E00-\u9FA5]',s).group(0) 我
>>> print re.findall(ur'[\u4E00-\u9FA5]',s)
[u'\u6211', u'\u662f', u'\u7f8e', u'\u56fd', u'\u4eba']

Hope that helps you.

--Mark
 

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top