Request for Feedback; a module making it easier to use regular expressions.

K

Kenneth McDonald

I'm working on the 0.8 release of my 'rex' module, and would appreciate
feedback, suggestions, and criticism as I work towards finalizing the
API and feature sets. rex is a module intended to make regular expressions
easier to create and use (and in my experience as a regular expression
user, it makes them MUCH easier to create and use.)

I'm still working on formal documentation, and in any case, such
documentation isn't necessarily the easiest way to learn rex. So, I've
appended below a rex interactive session which was then annotated
with explanations of the features being used. I believe it presents a
reasonably good view of what rex can do. If you have time, please
read it, and then send your feedback via email. Unfortunately, I do not
currently have time to keep track of everything on comp.lang.python.

Thanks,
Ken McDonald

=============================================

What follows is an illustration by example of how the 'rex' module works, for those already knowledgable of regular expressions as used in Python's 're' (or similar) regular expressions package. It consists of a quick explanation of a rex feature, followed by an interactive demo of that feature. You need to understand a couple of quick points to understand rex and the demo.

1) To distinguish between standard regular expressions as constructed by hand and used with the 're' package, and regular expressions constructed by and used in 'rex', I'll call the former 'regexps', and the latter 'rexps'.

2) The Rexp class, of which every rexp is an instance, is simply a subclass of Python's regular string class, with some modified functionality (for example, the __add__ method has been changed to modify the action of the '+' operation), and many more operators and methods. I'm not sure this was the wisest thing to do, but it sure helps when trying to relate rexps to regexps; just construct a rexp interactively or in a program and print it, and in either case you'll see the underlying string that is passed to the 're' module functions and methods.

On to the tutorial.

'rex' is designed have few public names, so the easiest way to use
it is to import all the names:
The most basic rex function is PATTERN, which simply takes a string or strings, and produces a rexp which will match exactly the argument strings when used to match or search text. As mentioned above, what you see printed as the result of executing PATTERN is the string that will be (invisibly) passed to 're' as a regexp string. 'abc'

If given more than one argument, PATTERN will concatenate them into a single rexp. 'abcd'

The other rex function which converts standard strings to rexps is CHARSET, which produces patterns which match a single character in searched text if that character is in a set of characters defined by the CHARSET operation. This is the equivalent of the regexp [...] notation. Every character in a string passed to CHARSET will end up in the resulting set of characters. '[ab]'

If CHARSET is passed more than one string, all characters in all arguments are included in the result rexp. '[abcd]'

If an argument to CHARSET is a two-tuple of characters, it is taken as indicating the range of characters between and including those two characters. This is the same as the regexp [a-z] type notation. For example, this defines a rexp matching any single consonant. '[bcdfghj-np-tvwxz]'

When using CHARSET (or any other rexp operation), you do _not_ need to worry about escaping any characters which have special meanings in regexps; that is handled automatically. For example, in the follwing character set containing square brackets, a - sign, and a backslash, we have to escape the backslash only because it has a special meaning in normal Python strings. This could be avoided by using raw strings. The other three characters, which have special meaning in regexps, would have to be escaped if building this character set by hand.
'[\\[\\]\\-\\\\]'
The result above is what you'd need to type using re and regexps to directly define this character set. Think you can get it right the first time?

CHARSET provides a number of useful attributes defining commonly used character sets. Some of these are defined using special sequences defined in regexp syntax, others are defined as standard character sets. In all cases, the common factor is that CHARSET attributes all define patterns matching a _single_ character. Here are a few examples: '[~`!@#$%\\^&*()_\\-+={\\[}\\]|\\\\:;"\'<,>.?/]'

Character sets can be negated using the '~' operator. Here is a rexp which matches anything _except_ a digit. '[^0-9]'

Remember from above that PATTERN constructs rexps out of literals, and also concatenates multiple arguments to form a rexp which matches if all of those arguments match in sequence. However, the arguments to PATTERN don't have to be just strings; they can be other rexps, which are concatenated correctly to produce a new rexp. The following expression produces a rexp which matches the string 'abc' followed by any of 'd', 'e', or 'f'. 'abc[def]'

Instead of passing multiple arguments to PATTERN to obtain concatenation, you can simple use the '+' operator, which has exactly the same effect, but in many circumstances may produce easier-to-read code. However, if '+' is used in this way, its left operand _must_ be a rexp; a plain string won't work. 'abc[def]'

To obtain an alternation rexp--one which matches if any one of several other rexps match--we use the ANYONEOF function. This is equivalent to the "|" character in regexp notation. 'a|b|c'

As with PATTERN and '+', the '|' operator may be used in place of ANYONEOF to obtain alternation. As usual, the left-hand operand must be a rexp: 'a|b|c'
Note in the above that only the _first_ operand needs to be a rexp; this is because the first and second operands combine to form a rexp, and that rexp then becomes the left operand for the second '|' operator.

Now we come to a very significant difference between regexps and rexps; the ability to combine smaller expressions into larger expressions. Below are two regexps, the first matching any one of 'a' or 'b' or 'c', the second matching 'd', 'e', or 'f'. It would be nice if there were an easy way to combine them to match strings of the form ('a' or 'b' or 'c') followed by ('d' or 'e' or 'f') using simple string addition: 'a|b|cd|e|f'
Unfortunately, this produces a regexp which matches any one of 'a', 'b', 'cd', 'e', or 'f'. The simplest way I know of to achieve the desired result in this case is something like "("+"a|b|c"+")("+"d|e|f"+")". This is not exactly pretty, or easy to type. Something like this isn't necessary when dealing with all string literals as above, but what if the two operands were other regexps? Then you would have to type something like "("+X+")("+Y+")".

This is much clearer useing rexps: '(?:a|b|c)(?:d|e|f)'
or, shortening the expression using the '+' operator: '(?:a|b|c)(?:d|e|f)'
Note that when rexps are put together like this, the parentheses used for grouping are 'numberless' parentheses--they will not be considered when extracting match subresults using numbered groups. Since the insertion of these parentheses in the produced regexp are invisible to the rex user, this is exactly what is desired.

Precedence works as you might expect, with '+' having higher precedence than '|' (though the example below is rather simple as an illustration of this.) 'ab|ef'

To match a pattern 0 or more times, use the ZEROORMORE function. This is analogous to the regexp '*' character. Note that parentheses are inserted to ensure the function applies to all of what you pass in. '(?:abc)*'

ONEORMORE matches a sequence of one or more rexps, and is like the "+" regexp operator. '(?:abc)+'

The short way of obtaining repetition, and of matching more limited repetitions of a pattern, is to use the "*" operator. This expression is the same as ZEROORMORE("abc"): '(?:abc)*'

....and this is the same as ONEORMORE("abc"): '(?:abc)+'

If a negative sign precedes the match number, it indicates the resulting rexp should match _no more_ than that many repetitions of the (positive) number. This matches anywhere from 0 to 3 repetitions of "abc": '(?:abc){0,3}'
Use a two-tuple to specify both an upper and lower bound. Match anywhere from 2 to five repetitions of "abc". '(?:abc){2,5}'

The OPTIONAL function indicates that the argument rexp is optional (the containing pattern will match whether or not the rexp produced by OPTIONAL matches.) '(?:\\-)?'
There is no shorthand form for OPTIONAL. However, the following is semantically identical, though it produces a different regexp: '(?:\\-){0,1}'

Let's look a bit more at how easy it is to combine rexps into more complex rexps. PATTERN provides an attribute defining a rexp which matches floating-point numbers (without exponent): '(?:\\+|\\-)?\\d+(?:\\.\\d*)?'

Using this to build a complex number matcher (assuming no whitespace) is trivial: '(?:\\+|\\-)?\\d+(?:\\.\\d*)?(?:\\+|\\-)(?:\\+|\\-)?\\d+(?:\\.\\d*)?i'
I think the rexp construct is a little easier to understand and modify than the produced regexp :)

What if we want to extract the real and imaginary parts of any complex number we happen to match? To do this said:
>>> complexrexp = PATTERN.float['re'] + ANYONEOF("+", "-") + PATTERN.float['im'] + "i"
>>> complexresult = complexrexp.match("-3.14+2.17i")

By the way, here's the regexp resulting from the above rexp. '(?P<re>(?:\\+|\\-)?\\d+(?:\\.\\d*)?)(?:\\+|\\-)(?P<im>(?:\\+|\\-)?\\d+(?:\\.\\d*)?)i'
Would you really like to write it out by hand?

To extract the what matched the named group, we simply index the match result:
'-3.14'

I highly recommend using named groups when constructing rexps; it makes code more readable and less error-prone. However, if you do want to use a numbered group for some reason, use the group() method on an existing rexp: '((?:\\+|\\-)?\\d+(?:\\.\\d*)?)(?:\\+|\\-)((?:\\+|\\-)?\\d+(?:\\.\\d*)?)i'

If a match fails, we get a MatchResult which evaluates to False when used as a boolean: False

Attempting to extract a subgroup from a failed match raises a KeyError and prints an appropriate error message.
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/local/python/packages/rex/__init__.py", line 508, in __getitem__
return self.get(key)
File "/local/python/packages/rex/__init__.py", line 518, in get
else: raise KeyError, "Invalid group index: "+ `key` + " (a failed match result only has one group, indexed by 0)."
KeyError: "Invalid group index: 're' (a failed match result only has one group, indexed by 0)."

However, extracting group 0 (which in a successful match always represents the entirety of the matched text) of a failed MatchResult still results in the entire string against which the match was attempted. This may seem pointless now, but will be very useful when we get into iterative searches using rexps.
'-3.14*2.17i'

We can do some nice things with named groups which cannot be accomplished with standard regexps. The keys() method returns the names of all named subgroups which participated in a match (but does not return the names of subgroups which did _not_ participate in the match, such as a subgroup contained in a failed OPTIONAL rexp): ['re', 'im']

In addition, if a named group matches the _entire_ matched string, then the name of that group can be obtained with the 'getname' method. This is useful for determining which of a number top-level alternative rexps matched.
>>> altrexp = PATTERN.float['number'] | PATTERN.word['symbol']
>>> altrexp.match("3.14").getname() 'number'
>>> altrexp.match("abc").getname()
'symbol'
Note that if more than one named group matches the entire matched substring, then getname() will return one of the appropriate names, but which one is not predictable.

Lesser-used pattern matching facilities have not been neglected. Non-greedy reptition can be expressed in the same way as standard (greedy) repetition, by using the ** operator in place of *: '(?:a)+?'
In this next example, the big number in the resulting regexp is sys.MAXINT. This is the closest I know how to express "three to infinity" in a regexp pattern. '(?:a){3,2147483647}?'

Lookahead and lookback assertions are supported with the '+' and '-' unary operators: '(?<=a)'

Both types of assertions can be negated by prepending with a tilde, as can be done with CHARSET rexps: '(?!a)'

Any regular expression can be considered as denoting the set of all strings which it matches. (Or, for those who've taken a formal class on RE's and finite automate, the set of strings which it "generates".) So, matching a piece of text against a regular expression is really the same thing as asking if that text is in the set of strings "generated" by the regular expression. Rexps provide a nice way of doing this using Python's "in" operator. The examples below ask if a couple of strings are in the set of strings consisting of a sequence of one or more 'a' characters: False

Searching text is done using a rexp's search() method. Let's find the string "cd" in the text "abcdef":
We know the search succeeded by evaluating the MatchResult as a boolean... True
....and can easily extract the start and end positions of the matched string, and the string itself (which might be useful if the search rexp was not a literal):
>>> searchresult.start(0) 2
>>> searchresult.end(0) 4
>>> searchresult[0]
'cd'

Iterative searching--that is, searching for _all_ instances in a piece of text matched by a regular expression--can be a bit awkward when using regexps. It is very easy when using rexps. The example below uses the fact that __str__ in a MatchResult object is defined so that str(matchresult) returns the entire substring matched by the MatchResult; str() is mapped over the sequence of MatchResult instances generated by itersearch() to get a list of the matched substrings. ['0', '9', '7']

'itersearch' is a generator function, which means that it only computes and returns MatchResult instances as they are requested by the enclosing loop. So, itersearch() can be used in a memory-efficient manner even on very large pieces of text.

'itersearch' can also be used more flexibly. If defines an optional paramater named 'matched' which defaults to True and indicates that only successful MatchResults should be returned. If we perform a search with this parameter set to False, then only _failing_ MatchResults will be returned... ['ab', 'c', 'de']
....and if None is passed as the value of 'matched', then both successful and failed MatchResults will be returned: ['ab', '0', 'c', '9', 'de', '7']
We can still determine which of these results are failures and which are successes by using the MatchResults as a boolean: [False, True, False, True, False, True]

This leads to a great little idiom for going through _all_ the text of a string, and processing each part as appropriate (the bit of Python code below is not part of the interactive session):

for result in myRexp.itersearch(myText, matched=None):
if result: ...process the successful match...
else: ...process the failed match...

Rexps also have a 'replace' method, to replace found text with other text. Let's replace all digits in a string with the word "DIGIT": 'abDIGITcDIGITdeDIGIT'

More specific replacements can be achieved by passing in a dictionary as the replace argument. Any matched substring must have a key defined in the dictionary (else a KeyError will be thrown), and is replaced with the value associated with that key: 'abZEROcNINEdeSEVEN'

For the ultimate in flexibility, we can pass in a function as the replace argument. Whenever a match is found, its MatchResult will be passed as an argument to the function, and the result of the function will be used as the replacement value. Here's an example which increments the integer interpretation of each digit in some text by 1. ... return str(1+int(matchresult[0]))
... 'ab1c10de8'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top