Enabling the use of POSIX character classes in Python

Perry Johnson · Dec 11, 2010

Python's re module does not support POSIX character classes, for
example [:alpha:]. It is, of course, trivial to simulate them using
character ranges when the text to be matched uses the ASCII character
set. Sadly, my problem is that I need to process Unicode text. The re
module has its own character classes that do support Unicode, however
they are not sufficient.

I would find it extremely useful if there was information on the
Unicode code points that map to each of the POSIX character classes.

Martin v. Loewis · Dec 11, 2010

Am 11.12.2010 18:33, schrieb Perry Johnson:

Python's re module does not support POSIX character classes, for
example [:alpha:]. It is, of course, trivial to simulate them using
character ranges when the text to be matched uses the ASCII character
set. Sadly, my problem is that I need to process Unicode text. The re
module has its own character classes that do support Unicode, however
they are not sufficient.

I would find it extremely useful if there was information on the
Unicode code points that map to each of the POSIX character classes.

By definition, this is not possible. The POSIX character classes are
locale-dependent, whereas the recommendation for Unicode regular
expressions is that they are not (i.e. a Unicode regex character class
should refer to the same characters independent from the locale).

If you want to construct locale-dependent Unicode character classes,
you should use this procedure:
- iterate over all byte values (0..255)
- perform the relevant locale-specific tests
- decode each byte into Unicode, using the locale's encoding
- construct a character class out of that

Unfortunately, that will work only for single-byte encodings.
I'm not aware of a procedure that does that for multi-byte strings.

But perhaps you didn't mean "POSIX character class" in this literal
way.

Regards,
Martin

MRAB · Dec 11, 2010

Python's re module does not support POSIX character classes, for
example [:alpha:]. It is, of course, trivial to simulate them using
character ranges when the text to be matched uses the ASCII character
set. Sadly, my problem is that I need to process Unicode text. The re
module has its own character classes that do support Unicode, however
they are not sufficient.

I would find it extremely useful if there was information on the
Unicode code points that map to each of the POSIX character classes.

Have a look at the new regex implementation on PyPI:

http://pypi.python.org/pypi/regex

Perry Johnson · Dec 11, 2010

Python's re module does not support POSIX character classes, for
example [:alpha:]. It is, of course, trivial to simulate them using
character ranges when the text to be matched uses the ASCII character
set. Sadly, my problem is that I need to process Unicode text. The re
module has its own character classes that do support Unicode, however
they are not sufficient.

I would find it extremely useful if there was information on the
Unicode code points that map to each of the POSIX character classes.

Click to expand...

Have a look at the new regex implementation on PyPI:

http://pypi.python.org/pypi/regex

This is exactly what I needed! Thanks!

[RELEASED] Python 3.3.0 alpha 1	0	Mar 5, 2012
character classes, locale and utf8 - strange behaviour	0	Apr 29, 2011
PEP 383: Non-decodable Bytes in System Character Interfaces	1	Apr 22, 2009
[RELEASED] Python 3.3.0 alpha 1	0	Apr 2, 2012
The future of the character-encodings library	4	Mar 16, 2011
i18n for Character Classes in Patterns.	1	Feb 18, 2008
Reading in cooked mode (was Re: Python MSI not installing, log fileshowing name of a Viatnemese comm	8	Mar 23, 2014
Use of Unicode in Python 2.5 source code literals	3	May 3, 2009

Enabling the use of POSIX character classes in Python

Perry Johnson

Martin v. Loewis

MRAB

Perry Johnson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads