unicode categories -- regex

K

koara

Hello all -- my question regards special meta characters for the re
module. I saw in the re module documentation about the possibility to
abstract to any alphanumeric unicode character with '\w'. However,
there was no info on constructing patterns for other unicode
categories, such as purely alphabetical characters, or punctuation
symbols etc.

I found that this category information actually IS available in python
-- in the standard module unicodedata. For example,
unicodedata.category(u'.') gives 'Po' for 'Punctuation, other' etc.

So how do i include this information in regular pattern search? Any
ideas? Thanks.


I'm talking about python2.5 here.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

So how do i include this information in regular pattern search? Any

At the moment, you have to generate a character class for this yourself,
e.g.

py> chars = [unichr(i) for i in range(sys.maxunicode)]
py> chars = [c for c in chars if unicodedata.category(c)=='Po']
py> expr = u'[\\' + u'\\'.join(chars)+"]"
py> expr = re.compile(expr)
py> expr.match(u"#")
<_sre.SRE_Match object at 0xb7ce1d40>
py> expr.match(u"a")
py> expr.match(u"\u05be")
<_sre.SRE_Match object at 0xb7ce1d78>

Creating this expression is fairly expensive, however, once compiled,
it has a compact representation in memory, and matching it is
efficient.

Contributions to support categories directly in re are welcome. Look
at the relevant Unicode recommendation on how to do that.

HTH,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top