unicode categories -- regex

koara · Sep 22, 2007

Hello all -- my question regards special meta characters for the re
module. I saw in the re module documentation about the possibility to
abstract to any alphanumeric unicode character with '\w'. However,
there was no info on constructing patterns for other unicode
categories, such as purely alphabetical characters, or punctuation
symbols etc.

I found that this category information actually IS available in python
-- in the standard module unicodedata. For example,
unicodedata.category(u'.') gives 'Po' for 'Punctuation, other' etc.

So how do i include this information in regular pattern search? Any
ideas? Thanks.

I'm talking about python2.5 here.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Sep 22, 2007

So how do i include this information in regular pattern search? Any

ideas?

At the moment, you have to generate a character class for this yourself,
e.g.

py> chars = [unichr(i) for i in range(sys.maxunicode)]
py> chars = [c for c in chars if unicodedata.category(c)=='Po']
py> expr = u'[\\' + u'\\'.join(chars)+"]"
py> expr = re.compile(expr)
py> expr.match(u"#")
<_sre.SRE_Match object at 0xb7ce1d40>
py> expr.match(u"a")
py> expr.match(u"\u05be")
<_sre.SRE_Match object at 0xb7ce1d78>

Creating this expression is fairly expensive, however, once compiled,
it has a compact representation in memory, and matching it is
efficient.

Contributions to support categories directly in re are welcome. Look
at the relevant Unicode recommendation on how to do that.

HTH,
Martin

koara · Sep 22, 2007

At the moment, you have to generate a character class for this yourself,

e.g.
...

Thank you Martin, this is exactly what i wanted to know.

Regex for unicode letter characters	4	Jan 11, 2009
Identifying unicode punctuation characters with Python regex	4	Nov 14, 2008
Unicode regex and Hindi language	12	Nov 28, 2008
compound regex	0	Feb 9, 2009
trying to understand unicode	1	Apr 20, 2005
File names, character sets and Unicode	1	Dec 12, 2008
[perl-python] unicode study with unicodedata module	5	Mar 15, 2005
Using re to find unicode ranges	2	Sep 29, 2008

unicode categories -- regex

koara

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

koara

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads