ANN: 'rex', a module for easy creation and use of regular expressions

Discussion in 'Python' started by Kenneth McDonald, Jun 10, 2004.

  1. This is a 'pre-release' release of the 'rex' module, a Python module
    intended to ease the use of regular expressions. While I think the code
    is currently quite useful, I'm advising against using it except on an
    experimental basis, as the API is subject to change. One of the purposes
    of this release is to solicit feedback on the final API.

    The module is available as two text files appended at the end of this
    message: '__init__.py' should be placed in a folder called 'rex' on
    your python path, while the test file can be placed anywhere, and
    simply does some basic testing of rex. I'd attempted to release this
    about a week ago with a uuencoded copy of the module, but I believe my
    news service disallowed that message; at least, I never saw it on my
    news feed. Sorry if this is a repeat.

    Immediately below are a couple of snippets from the internal rex documentation,
    to give you an idea of what rex does. I hope this will intrigue you enough
    to try and to give me feedback. rex is free software.

    At this point, I would greatly appreciate feedback. Do you think this
    could be a useful module? Do you have any suggestions for the API?
    What would you like to see for additional functionality?

    If there is sufficient enthusiasm, I would certainly welcome code
    or documentation contributions. But we'll worry about that if it
    turns out enough people are interested... :)


    A Bit About rex...
    ==================

    'rex' stands for any of (your choice):
    - Regular Expression eXtensions
    - Regular Expressions eXpanded
    - Rex, King of Regular Expressions (ha, ha ha, ha).

    rex provides a completely different way of writing regular expressions (REs).
    You do not use strings to write any part of the RE _except_ for
    regular expression literals. No escape characters, metacharacters,
    etc. Regular expression operations, such as repetition, alternation,
    concatenation, etc., are done via Python operators, methods, or
    functions.

    As an example, take a look at the definition of an RE matching a complex
    number, an example included in test_rex.py. The rex Python code to do this is:

    COMPLEX= PAT.aFloat['re'] + \
    PAT.anyWhitespace + \
    ALT("+", "-")['op'] + \
    PAT.anyWhitespace + \
    PAT.aFloat['im'] + \
    'i'

    while the analogous RE is:

    (?P<re>(?:\+|\-)?\d+(?:.\d*)?)\s*(?P<op>\+|\-)\s*(?P<im>(?:\+|\-)?\d+(?:.\d*)?)i

    The rex code is more verbose than
    the simple RE (which, by the way, was the RE generated by the rex code,
    and is pretty much what you'd produce by hand). It is also FAR easier to
    read, modify, and debug. And, it illustrates how easy it is to reuse rex patterns:
    PAT.aFloat and PAT.anyWhitespace are predefined patterns provided in rex which
    match, respectively, a string representation of a floating point number (no exponent),
    and a sequence of zero or more whitespace characters.


    FILE __init__.py
    ============================================================================
    '''Module to provide a better interface to regular expressions.

    LICENSE

    This is the 'rex' module, written by Kenneth M. McDonald, (c) 2004 Kenneth M. McDonald.
    You may use it for any reason, subject to the following simple conditions:

    1) You may not distribute or publish modified versions of this
    module or files in it, unless you change the name of the module.

    2) You may incorporate this code into other files. If you use more
    than 30 lines (total) of code/text from this module in another file or group of files, you must
    include a line stating "Some of the code in this file/these files are from the
    free Python module 'rex', by Kenneth M. McDonald."

    3) If you sell or otherwise gain revenue from a product using this
    code, then:

    a) The product or its documentation must state, in a reasonably
    prominent place, "Some of the functionality of this program is
    provided by freely available source code", or words to that effect.
    Incorporating this phrase in the 'About' box of the program, or in
    the introduction to the User's manual, will satisfy the requirement
    of 'reasonable prominence'. You do not need to mention this module
    specifically, nor do you need to indicate what functionality this module
    is used for.

    b) You must ensure the buyers of the product have access to the
    source code for this module. You can do this by including this module
    with your software, by providing a URL to a download site for this
    module (I do not maintain such a URL), or by other similar means.

    INTRODUCTION

    'rex' stands for any of (your choice):
    - Regular Expression eXtensions
    - Regular Expressions eXpanded
    - Rex, King of Regular Expressions (ha, ha ha, ha).

    rex provides a completely different way of writing regular expressions (REs). You
    do not use strings to write any part of the RE _except_ for
    regular expression literals. No escape characters, metacharacters,
    etc. Regular expression operations, such as repetition, alternation,
    concatenation, etc., are done via Python operators, methods, or
    functions.

    The major advantages of rex are:

    - rex expressions are checked for well-formedness by the Python
    parser; this will typically provide earlier and easier-to-understand
    diagnoses of syntactically malformed regular expressions

    - rex expressions are all strings! They are, in fact, a specialized subclass
    of strings, which means you can pass them to existing code
    which expects REs. [NOTE: This may change in the future.]

    - rex goes to some lengths to produce REs which are similar to
    those written by hand, i.e. it tries to avoid unnecessary use of
    nongrouping parentheses, uses special escape sequences
    where possible, writes 'A?' instead of 'A{0,1}', etc. In general,
    rex tries to produce concise REs, on the theory that if you
    really need to read the buggers at some point, it's easier to
    read simpler ones than more complex ones.

    - [This is the biggie.] rex permits complex REs to be built up easily
    of smaller parts. In fact, a rex definition for a complex RE is likely
    to end up looking somewhat like a mini grammar.

    - [Another biggie.] As an ancillary to the above, rex permits REs to be easily reused.

    As an example, take a look at the definition of an RE matching a complex
    number, an example included in the test_rex.py. The rex Python code to do this is:

    COMPLEX= PAT.aFloat['re'] + \
    PAT.anyWhitespace + \
    ALT("+", "-")['op'] + \
    PAT.anyWhitespace + \
    PAT.aFloat['im'] + \
    'i'

    while the analogous RE is:

    (?P<re>(?:\+|\-)?\d+(?:.\d*)?)\s*(?P<op>\+|\-)\s*(?P<im>(?:\+|\-)?\d+(?:.\d*)?)i

    The rex code is more verbose than
    the simple RE (which, by the way, was the RE generated by the rex code,
    and is pretty much what you'd produce by hand). It is also FAR easier to
    read, modify, and debug. And, it illustrates how easy it is to reuse rex patterns:
    PAT.aFloat and PAT.anyWhitespace are predefined patterns provided in rex which
    match, respectively, a string representation of a floating point number (no exponent),
    and a sequence of zero or more whitespace characters.

    USE

    This is a quick overview of how to use rex. See documentation associated
    with a specific method/function/name for details on that entity.

    In the following, we use the abbreviation RE to refer to standard regular
    expressions defined as strings, and the word 'rexp' to refer to rex objects
    which denote regular expressions.

    - The starting point for building a rexp is either rex.PAT,
    which we'll just call PAT, or rex.CHAR, which we'll just call CHAR.
    CHAR builds rexps which match single character strings. PAT builds rexps
    which match strings of varying lengths.

    - PAT(string) returns a rexp which will match exactly the string given, and nothing else.

    - PAT._someattribute_ returns (for defined attributes) a corresponding rexp.
    For example, PAT.aDigit returns a rexp matching a single digit.

    - CHAR(a1, a2, . . .) returns a rexp matching a single character from a set
    of characters defined by its arguments. For example, CHAR("-", ["0","9"], ".")
    matches the characters necessary to build basic floating point numbers.
    See CHAR docs for details.

    - Now assume that A, B, C,... are rexps. The following Python expressions
    (_not_ strings) may be used to build more complex rexps:

    - A | B | C . . . : returns a rexp which matches a string if any of the operands
    match that string. Similar to "A|B|C" in normal REs, except of course you can't
    use Python code to define a normal RE.

    - A + B + C ...: returns a rexp which matches a string if all of A, B, C match consecutive
    substrings of the string in succession. Like "ABC" in normal REs.

    - A*n : returns a rexp which matches a number of times as defined by n.
    This replaces '?', '+', and '*' as used in normal REs. See docs for details.

    - A**n : Like A*n, but does nongreedy matching.

    - +A : positive lookahead assertion: matches if A matches, but doesn't
    consume any of the input.

    - ~+A : negative lookahead assertion: matches of A _doesn't_ match,
    but doesn't consume any of the input.

    - -A, ~-A : positive and negative lookback assertions. Lke lookahead assertions,
    but in the other direction.

    - A[name] : name must be a string: anything matched by A can be referred
    to by the given name in the match result object. (This is the equivalent
    of named groups in the re module).

    - A.group() : A will be in an unnamed group, referable by number.

    - In addition, a few other operations can be done:

    - Some of the attributes defined in PAT have "natural inverses"; for such
    attributes, the inverse may be taken. For example, ~ PAT.digit is
    a pattern matching any character except a digit.

    - Character classes may be inverted: ~CHAR("aeiouAEIOU") returns a pattern
    matching anything except a vowel.

    - 'ALT' gives a different way to denote alternation: ALT(A, B, C,...) does
    the same thing as A | B | C | . . ., except that none of the arguments
    to ALT need be rexps; any which are normal strings will be converted
    to a rexp using PAT.

    - 'PAT' can take multiple arguments: PAT(A, B, C,...), which gives the same
    result as PAT(A) + PAT(B) + PAT(C) + . . . .

    - Finally, a very convenient shortcut is that only the first object in a sequence of
    operator/method calls needs to be a rexp; all others will be automatically
    converted as if PAT[...] had been called on them. For example, the
    sequence A | "hello" is the same as A | PAT("hello")

    'CHAR' USE

    CHAR(args...) defines a character class. Arguments are any number of strings or two-tuples/two-element lists.

    eg.
    CHAR("ab-z")
    is the same as the regular expression r"[ab\-z]". NOTE that there are no 'character range metacharacters';
    the preceding define a character class containing four characters, one of which was a '-'.

    This is a character
    class containing a backslash, hyphen, and open/close brackets:

    CHAR(r"\-[]") or CHAR("\\-[]")

    Note that we still need to use raw strings to turn off normal Python string escaping.

    To define ranges, do this :

    CHAR(["a","z"], ["A","Z"])

    To define inverse ranges, use the ~ operator, eg. To define the class of all non-numeric characters:

    ~CHAR(["0","9"])

    Character classes cannot (yet) be doubly negated: ~~CHAR("A") is an error.
    '''

    import re, string

    def _escapeSpecialRangeChars(char):
    '''Function to escape characters which have a special meaning in character ranges.
    We don't actually need to escape '[', but I think it makes the string representation a little
    less confusing.'''
    if char in "^-\\[]": return "\\"+char
    else: return char

    class _rexobj(str):
    '''Class of strings which are to be treated as regular expressions.'''
    def __init__(self, s):
    str.__init__(self, s)

    def compile(self, ignorecase=False, locale=False, multiline=False, dotmatchesnewline=False, unicode=False):
    flags = (ignorecase and re.IGNORECASE) | \
    (locale and re.LOCALE) | \
    (multiline and re.MULTILINE) | \
    (dotmatchesnewline and re.DOTALL) | \
    (unicode and re.UNICODE)
    return re.compile(self, flags)

    def __mul__(self, num):
    '''Greedy repetition operator'''
    return self.repeat(num)

    def __pow__(self, num):
    '''Nongreedy repetition operator'''
    return self.repeat(num, greedy=False)

    def __pos__(self):
    '''Lookahead assertion'''
    return _relookaheadassertion("(?=%s)" % (self, ))

    def __neg__(self):
    '''Lookback assertion'''
    return _relookbackassertion("(?<=%s)" % (self, ))

    def __contains__(self, text):
    '''Another abuse of operators. A regular express can be considered as the set
    of all strings it can match (or generate, for those of you who know the theory.)
    The code

    if text in rexp: body

    executes 'body' if text is in the set of strings generated by the rexp, and false otherwise.
    '''
    pattern = PAT.stringStart + self + PAT.stringEnd
    return bool( pattern.match(text) )

    def repeat(self, num=0, greedy=True, doc=None):
    '''This is the repetition function. However, repetition is normally done
    with the * (greedy) operator, or ** (nongreedy) operator,
    like so:
    A*3 : Three or moreoccurrences of A.
    A*0 : In re terms, same as A*
    A*1: In re terms, same as A+
    A*(2, 4) : 2-4 occurrences of A. In re terms, same as A{2,4}
    A*-5 : Up to 5 occurrences of A. In re terms, same as A{0,5}
    A*-1 : In re terms, same as A?

    A**x : nongreedy versions above, like A*?, A+?, A{1,3}?.

    You can use repeat() as a functional equivalent; in this case, the
    'num' parameter is what you would pass as the second argument
    to */**, 'greedy' is a boolean determining if the operation should
    be greedy or nongreedy, and 'doc' can be used for a documentation
    comment (but is not currently used in any way.)
    '''
    min=0
    max=None
    if isinstance(num, int):
    if num >=0: min=num
    else: max=-num
    else:
    assert isinstance(num, tuple) and len(num)==2
    min, max = num

    if greedy: nongreedy=""
    else: nongreedy="?"
    del greedy
    if min==0 and max==None:
    return _reblock(str(self.block())+"*"+nongreedy)
    elif min==1 and max==None:
    return _reblock(str(self.block())+"+"+nongreedy)
    elif min==0 and max==1:
    return _reblock(str(self.block())+"?"+nongreedy)
    else:
    return _reblock("%s{%s,%s}%s" % (self.block(), min, max, nongreedy))

    def name(self, name, doc=None):
    '''Enclose this RE in a named group.'''
    return _reblock('(?P<%s>%s)' % (name, self))

    def __getitem__(self, key):
    if isinstance(key, str): return self.name(key)
    else: return str.__getitem__(self, key)

    def group(self, doc=None):
    '''Enclose this RE in a numbered group.'''
    return _reblock('(%s)' % (self,))

    def __or__(self, other):
    '''Alternation (choice) operator.'''
    other = _convert(other)
    if self.precedence() > _realt.precedence():
    self = self.block()
    if other.precedence() > _realt.precedence():
    other = other.block()
    return _realt('%s|%s' % (self, other))

    def block(self, doc=None):
    '''Enclose this RE, if necessary, in an anonymous block, i.e. '(?:...)'. If
    the RE is already in a block, it will be returned unchanged.'''
    if self.precedence()==0: return self
    else: return _reblock('(?:%s)' % (self,))

    def precedence():
    '''Indicates the binding precedence of the top-level operators used to form this
    RE. 0 is highest precedence.'''
    raise NotImplementedError, "precedence() should be defined in a subclass."
    precedence = staticmethod(precedence)

    def __add__(self, other):
    '''Concatenation operator.'''
    other = _convert(other)
    if self.precedence() > _recat.precedence():
    self = self.block()
    if other.precedence() > _recat.precedence():
    other = other.block()
    return _recat('%s%s' % (self, other))

    def itersearch(self, text, matched=None):
    '''Iterates sequentially throught nonoverlapping substrings in 'text'
    which match and do not match self. See docs on match results.

    @param matched: If None (the default), returns both substrings from text
    matching the pattern, and those substrings between the matches. If true,
    returns only the matching substrings. If false (other than None), returns only the nonmatching substrings.

    EXAMPLE:
    To print all digit sequences in a string:
    for matchresult in PAT.someDigits.itersearch(aString):
    if matchresult: print matchresult[0]
    To print all the sequences form the string that _weren't_ digits:
    for matchresult in PAT.someDigits.itersearch(aString):
    if not matchresult: print matchresult[0]
    '''
    re = self.compile()
    start = 0
    while True:
    match = re.search(text, start)
    # If we couldn't find any more matches, yield whatever nonmatching remnant is left, and return
    if not match:
    if not matched and start < len(text):
    yield MatchResult(text[start:], start, len(text))
    return
    else:
    # If there was an unmatched substring before the found match, yield it.
    if not matched and start < match.start(): yield MatchResult(text[start:match.start()], start, match.start())
    if matched==None or matched: yield MatchResult(match)
    start = match.end()

    def iterstrings(self, text, matched=None):
    '''Convenience function; simply extracts and returns group 0 from an underlying call
    to itersearch().'''
    for m in self.itersearch(text, matched): yield m[0]

    def search(self, text, matched=None):
    '''Returns the first element from a search of the text using itersearch and the 'matched' param'''
    return self.itersearch(text).next()

    def match(self, text):
    '''@todo: Not yet fully implemented, please don't use.'''
    return MatchResult(self.compile().match(text))


    class MatchResult(object):
    '''A MatchResult specifies what was found by an attempt to look for an pattern
    in a string.'''
    def __init__(self, match, start=None, end=None):
    self.__match = match
    if self:
    self.__start = match.start()
    self.__end = match.end()
    else:
    self.__start = start
    self.__end = end

    def __nonzero__(self):
    '''A match result is 'true' if it is the result of successfully finding
    a pattern in a target string, in which case self.__match will be
    an re.MatchObj or something like that. However, match results
    may also be used to indicate the part of a target string that a
    pattern _failed_ to match; in this case, self.__match will be
    that part of the target, and the match result will be 'false'''
    return not (isinstance(self.__match, str) or self.__match == None)

    def __getitem__(self, key):
    return self.get(key)

    def expand(self, expansion):
    '''@todo: don't use this yet.'''
    return self.__match.expand(expansion)

    def get(self, key=0):
    '''Returns a string matching a group identified by name or by index. If this is a
    failed match result (see docs for '__nonzero__' in this class), the only allowable
    index is 0, which indicates the entire unmatched string.'''
    if not self:
    if key==0: return self.__match
    else: raise KeyError, "Invalid group index: "+ `key` + " (a failed match result only has one group, indexed by 0)."
    result = self.__match.group(key)
    if result==None: raise KeyError, "Invalid group index: "+ `key`
    return result

    start = property(fget=lambda self: self.__start, doc="The starting position of the match result in the search string")
    end = property(fget=lambda self: self.__end, doc="The ending position of the match result in the search string")

    string = property(fget=lambda self: self.get(), doc="The string found by the match result.")

    def __str__(self): return self.string


    class _recat(_rexobj):
    '''RE constructed via concatenation'''
    def __init__(self, s):
    _rexobj.__init__(self, s)

    def precedence():
    '''The precedence of a single character can always be considered as 0, since
    it can't be split into subparts.'''
    #if len(self)==1: return 0
    return 1
    precedence = staticmethod(precedence)

    class _realt(_rexobj):
    '''re constructed from alternation operators'''
    def __init__(self, s):
    _rexobj.__init__(self, s)

    def precedence(): return 2
    precedence = staticmethod(precedence)

    class _reblock(_rexobj):
    '''Superclass for all classes of res which are in a block of
    some sort, i.e. which do not need to be enclosed in parentheses
    in order to assure associativity.'''
    def __init__(self, s):
    _rexobj.__init__(self, s)

    def precedence(): return 0
    precedence = staticmethod(precedence)

    class _relookaheadassertion(_reblock):
    _inverses = {'=':'!', '!':'='}
    def __init__(self, s):
    _reblock.__init__(self, s)

    def __invert__(self):
    assert self[0:2] == "(?"
    return _relookaheadassertion("(?" + self._inverses[self[2]] + self[3:])

    class _relookbackassertion(_reblock):
    _inverses = {'=':'!', '!':'='}
    def __init__(self, s):
    _reblock.__init__(self, s)

    def __invert__(self):
    assert self[0:3] == "(?<"
    return _relookbackassertion("(?<" + self._inverses[self[3]] + self[4:])

    class _range(_reblock):
    def __init__(self, s):
    _reblock.__init__(self, s)

    def __invert__(self):
    return _inverserange(self[0] + "^" +self[1:])

    class _inverserange(_reblock):
    def __init__(self, s):
    _reblock.__init__(self, s)

    class _reprimitive(_reblock):
    '''Atomic REs, i.e. not composed of other REs. Typically single-characters, special sequences, etc.'''
    def __init__(self, s):
    _reblock.__init__(self, s)

    def __invert__(self):
    if self in _primitiveInverses: return _primitiveInverses[self]
    raise NotImplementedError, 'No inverse for ' + self

    def RAW(s):
    '''@todo: This is a hack to let something else work. It will go away.'''
    return _reprimitive(s)

    def _convert(s):
    '''Convert s to a literal rexobj; if it is already a rexobj, leave it unchanged.'''
    if isinstance(s, _rexobj): return s
    else: return _recat(re.escape(s))

    def _CHARfun(*args):
    strings = ["["]
    for a in args:
    if isinstance(a, _reprimitive):
    strings.append(a)
    elif isinstance(a, str):
    strings.append("".join(map(_escapeSpecialRangeChars, a)))
    else:
    assert isinstance(a, tuple) and len(a)==2
    startchar, endchar = a
    strings.append('%s-%s' %(_escapeSpecialRangeChars(startchar), _escapeSpecialRangeChars(endchar)))

    strings.append("]")
    return _range("".join(strings))

    class _CHAR(object):

    def __call__(self, *args):
    return _CHARfun(*args)

    CHAR = _CHAR()

    MAYBE = -1
    ANY = 0
    SOME = 1

    class _PAT(object):
    dot = _reprimitive(".")
    aChar = _reprimitive(r"[.\n]")
    aDigit = _reprimitive(r'\d')
    aWhitespace = _reprimitive(r'\s')
    aBackslash = _reprimitive(r'\\')
    anAlphanum = _reprimitive(r'\w')
    aLetter = _CHARfun(('a','z'), ('A','Z'))
    aPunctuationMark = _CHARfun("""~`!@#$%^&*()_-+={[}]|\:;"'<,>.?/""") # The standard US keyboard punctuation marks: does _not_ include whitspace chars.
    stringStart = _reprimitive(r'\A')
    stringEnd = _reprimitive(r'\Z')
    wordBorder = _reprimitive(r'\b')
    emptyString = _reprimitive('')
    someDigits = aDigit * 1
    anyDigits = aDigit * 0
    someLetters = aLetter*1
    anyLetters = aLetter * 0
    someChars = aChar*1
    anyChars = aChar * 0
    someWhitespace = aWhitespace*1
    anyWhitespace = aWhitespace*0
    anInt = (_convert("+")|"-")*MAYBE + someDigits
    aFloat = anInt + (_recat(".") + anyDigits)*MAYBE

    def __call__(self, arg, *rest):
    ''''Returns a _rexobj, a subclass of the builtin string class, which happens to know it
    is a regular expression.'''
    arg = _convert(arg)
    for next in rest:
    arg = arg + next
    return arg

    PAT = _PAT()

    _primitiveInverses = {
    PAT.aDigit : r'\D',
    PAT.aWhitespace : r'\S',
    PAT.wordBorder : r'\B',
    PAT.anAlphanum: r'\W',
    }
    # Fill in the reverse mappings
    for key, val in _primitiveInverses.items():
    _primitiveInverses[val] = key

    def ALT(arg, *rest):
    '''ALT(a, b, c,...) is the same as PAT(a) | b | c | ...'''
    arg = _convert(arg)
    for next in rest:
    arg = arg | next
    return arg

    if __name__ == '__main__':
    print "Look in 'rex/__init__.py for documentation. Look in rex/_test/test_rex.py for some examples of using rex."



    =============================================================================
    END __init__.py



    FILE test_rex.py
    =============================================================================
    import unittest
    from rex import *

    class rex_test(unittest.TestCase):

    COMPLEX= PAT.aFloat['re'] + \
    PAT.anyWhitespace + \
    ALT("+", "-")['op'] + \
    PAT.anyWhitespace + \
    PAT.aFloat['im'] + \
    'i'

    def testNames(self):
    '''Test extraction of data from named groups.'''
    results = []
    for c in self.COMPLEX.itersearch("3+4i 5.78- +3.14i 1. +2.i"):
    if c: results.append(c['re'] + c['op'] + c['im'] + 'i')
    self.assertEquals(results, ["3+4i", "5.78-+3.14i", "1.+2.i"])

    def testCharacterRanges(self):
    aRange = CHAR(('a','z'), 'C', '+', '\\', '\t', "F-H[]")
    self.assert_('c' in aRange)
    self.assert_('X' not in aRange)
    self.assert_('\t' in aRange, 'Does the raw tab in the char range get processed correctly?')
    self.assert_('\n' not in aRange)
    self.assert_('G' not in aRange)
    self.assert_('-' in aRange)
    self.assert_('[' in aRange and ']' in aRange)

    def testPrecedence(self):
    '''Tests to ensure that '+' functions correctly as having higher precedence than '|'.
    This test is necessary because rex pulls a few tricks to avoid creating too many
    nongrouping parentheses in patterns.'''
    pattern = PAT('a') + 'b' | 'c' + 'd'
    self.assert_('ab' in pattern)
    self.assert_('a' not in pattern)
    self.assert_('b' not in pattern)
    self.assert_('cd' in pattern)
    self.assert_('c' not in pattern)
    self.assert_('d' not in pattern)

    pattern2 = PAT('a') | 'b' + 'c' | 'd'
    self.assert_('bc' in pattern2)
    self.assert_('a' in pattern2)
    self.assert_('d' in pattern2)
    self.assert_('b' not in pattern2)
    self.assert_('c' not in pattern2)

    def testLookAheadBack1(self):
    '''Tests lookahead and lookback assertions.'''
    phrase1 = "12wordOne ()wordTwo_wordThree_wordFour wordFive! wordSix3 wordSeven"

    # For testing purposes, this definition of a 'word' isn't really a word.
    aWordPat1 = PAT(
    ~-(PAT.aDigit | PAT("_") | PAT.aLetter), # The character before the possible word can't be a digit, underscore, or letter.
    PAT.someLetters, # Find the word.
    (+(PAT.aPunctuationMark | PAT.aWhitespace | PAT.stringEnd)) # must be followed by something which will end a word.
    )
    self.assertEquals(
    list(aWordPat1.iterstrings(phrase1, matched=True)),
    ['wordTwo', 'wordFive', 'wordSeven']
    )

    def testLookAheadBack2(self):
    '''Tests lookahead and lookback assertions.'''
    phrase1 = "12wordOne ()wordTwo_wordThree_wordFour wordFive! wordSix3 wordSeven"

    # For testing purposes, this definition of a 'word' isn't really a word.
    aWordPat2 = PAT(
    -(~PAT.aLetter), # Check that the character before the first letter of a possible word isn't a letter...
    ~-PAT('_'), #...or an underscore.
    PAT.someLetters, # Find the word.
    ~+PAT.aLetter, # The following character can't be a letter.
    ~+(PAT("!")) # but only accept words not followed by an exclamation!
    )
    self.assertEquals(
    list(aWordPat2.iterstrings(phrase1, matched=True)),
    ['wordOne', 'wordTwo', 'wordSix', 'wordSeven']
    )

    if __name__=='__main__':
    unittest.main()
     
    Kenneth McDonald, Jun 10, 2004
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    619
    Jay Douglas
    Aug 15, 2003
  2. bp
    Replies:
    0
    Views:
    318
  3. Kenneth McDonald
    Replies:
    1
    Views:
    303
    Skip Montanaro
    Jan 31, 2005
  4. Replies:
    1
    Views:
    125
  5. Noman Shapiro
    Replies:
    0
    Views:
    240
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page