Enabling the use of POSIX character classes in Python

P

Perry Johnson

Python's re module does not support POSIX character classes, for
example [:alpha:]. It is, of course, trivial to simulate them using
character ranges when the text to be matched uses the ASCII character
set. Sadly, my problem is that I need to process Unicode text. The re
module has its own character classes that do support Unicode, however
they are not sufficient.

I would find it extremely useful if there was information on the
Unicode code points that map to each of the POSIX character classes.
 
M

Martin v. Loewis

Am 11.12.2010 18:33, schrieb Perry Johnson:
Python's re module does not support POSIX character classes, for
example [:alpha:]. It is, of course, trivial to simulate them using
character ranges when the text to be matched uses the ASCII character
set. Sadly, my problem is that I need to process Unicode text. The re
module has its own character classes that do support Unicode, however
they are not sufficient.

I would find it extremely useful if there was information on the
Unicode code points that map to each of the POSIX character classes.

By definition, this is not possible. The POSIX character classes are
locale-dependent, whereas the recommendation for Unicode regular
expressions is that they are not (i.e. a Unicode regex character class
should refer to the same characters independent from the locale).

If you want to construct locale-dependent Unicode character classes,
you should use this procedure:
- iterate over all byte values (0..255)
- perform the relevant locale-specific tests
- decode each byte into Unicode, using the locale's encoding
- construct a character class out of that

Unfortunately, that will work only for single-byte encodings.
I'm not aware of a procedure that does that for multi-byte strings.

But perhaps you didn't mean "POSIX character class" in this literal
way.

Regards,
Martin
 
M

MRAB

Python's re module does not support POSIX character classes, for
example [:alpha:]. It is, of course, trivial to simulate them using
character ranges when the text to be matched uses the ASCII character
set. Sadly, my problem is that I need to process Unicode text. The re
module has its own character classes that do support Unicode, however
they are not sufficient.

I would find it extremely useful if there was information on the
Unicode code points that map to each of the POSIX character classes.

Have a look at the new regex implementation on PyPI:

http://pypi.python.org/pypi/regex
 
P

Perry Johnson

Python's re module does not support POSIX character classes, for
example [:alpha:]. It is, of course, trivial to simulate them using
character ranges when the text to be matched uses the ASCII character
set. Sadly, my problem is that I need to process Unicode text. The re
module has its own character classes that do support Unicode, however
they are not sufficient.

I would find it extremely useful if there was information on the
Unicode code points that map to each of the POSIX character classes.

Have a look at the new regex implementation on PyPI:

http://pypi.python.org/pypi/regex

This is exactly what I needed! Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top