S
Shiao
The regex below identifies words in all languages I tested, but not in
Hindi:
# -*- coding: utf-8 -*-
import re
pat = re.compile('^(\w+)$', re.U)
langs = ('English', 'ä¸æ–‡', 'हिनà¥à¤¦à¥€')
for l in langs:
m = pat.search(l.decode('utf-8'))
print l, m and m.group(1)
Output:
English English
ä¸æ–‡ ä¸æ–‡
हिनà¥à¤¦à¥€ None
From this is assumed that the Hindi text contains punctuation or other
characters that prevent the word match. Now, even more alienating is
this:
pat = re.compile('^(\W+)$', re.U) # note: now \W
for l in langs:
m = pat.search(l.decode('utf-8'))
print l, m and m.group(1)
Output:
English None
ä¸æ–‡ None
हिनà¥à¤¦à¥€ None
How can the Hindi be both not a word and "not not a word"??
Any clue would be much appreciated!
Best.
Hindi:
# -*- coding: utf-8 -*-
import re
pat = re.compile('^(\w+)$', re.U)
langs = ('English', 'ä¸æ–‡', 'हिनà¥à¤¦à¥€')
for l in langs:
m = pat.search(l.decode('utf-8'))
print l, m and m.group(1)
Output:
English English
ä¸æ–‡ ä¸æ–‡
हिनà¥à¤¦à¥€ None
From this is assumed that the Hindi text contains punctuation or other
characters that prevent the word match. Now, even more alienating is
this:
pat = re.compile('^(\W+)$', re.U) # note: now \W
for l in langs:
m = pat.search(l.decode('utf-8'))
print l, m and m.group(1)
Output:
English None
ä¸æ–‡ None
हिनà¥à¤¦à¥€ None
How can the Hindi be both not a word and "not not a word"??
Any clue would be much appreciated!
Best.