Unicode: matching a

Jeremie Le Hen · Nov 14, 2007

Hi list,

(Please Cc: me when replying, as I'm not subscribed to this list.)

I'm working with Unicode strings to handle accented characters but I'm
experiencing a few problem.

The first one is with regular expression. If I want to match a word
composed of characters only. One can easily use '[a-zA-Z]+' when
working in ascii, but unfortunately there is no equivalent when working
with unicode strings: the latter doesn't match accented characters. The
only mean the re package provides is '\w' along with the re.UNICODE
flag, but unfortunately it also matches digits and underscore. It
appears there is no suitable solution for this currently. Am I right?

Secondly, I need to translate accented characters to their unaccented
form. I've written this function (sorry if the code isn't as efficient
as possible, I'm not a long-time Python programmer, feel free to correct
me, I' be glad to learn anything):

% def unaccent(s):
% """
% """
%
% if not isinstance(s, types.UnicodeType):
% return s
% singleletter_re = re.compile(r'(?:^|\s)([A-Z])(?:$|\s)')
% result = ''
% for l in s:
% desc = unicodedata.name(l)
% m = singleletter_re.search(desc)
% if m is None:
% result += str(l)
% continue
% result += m.group(1).lower()
% return result
%

But I don't feel confortable with it. It strongly depend on the UCD
file format and names that don't contain a single letter cannot
obvisouly all be converted to ascii. How would you implement this
function?

Thank you for your help.
Regards,

Unicode: matching a word and unaccenting characters	2	Nov 14, 2007
Python Unicode handling wins again -- mostly	67	Nov 29, 2013
Ascii to Unicode.	4	Jul 28, 2010
API for custom Unicode error handlers	5	Oct 4, 2013
Upper/lowercase regex matching in unicode	1	Oct 19, 2005
Correct handling of case in unicode and regexps	1	Feb 23, 2013
unicode by default	29	May 11, 2011
Python and unicode	8	Sep 19, 2010

Unicode: matching a

Jeremie Le Hen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads