ANN: 'rex' 0.5, a module for easier creation and use of regular expressions.

Discussion in 'Python' started by bp, Jun 27, 2004.

  1. bp

    bp Guest

    NOTE: This is my second attempt at posting this; as far as I can tell,
    the first posting never appeared on the newsgroups. If you've already
    seen this, sorry for the repeat.

    WHAT IS 'rex'?

    rex is a module which makes the creation and use of regular expressions,
    IMHO, much easier than is the case using the 're' module. It does this
    by subclassing strings to provide a special regular-expresssion-type
    string, and then defining operations on that class. This permits
    building regular expressions without the use of arcane sytax, and
    with the help of Python edit modes and the Python syntax checker.
    It also makes regular expressions far, far easier to use.

    WHERE TO GET IT?

    http://homepage.mac.com/ken.mcdonald/FileSharing2.html


    WHAT ELSE?

    I'd appreciate feedback and suggestions. Code contributions
    are welcome too. I don't think I've mentioned the license
    in the 0.5 docs, but it will change from a custom license
    in the previous version to a PSA or similar standard license
    in the final version.

    Appended below is the (current) documentation, which could be
    better, but isn't too bad. It's missing function/method reference
    documentation, but at least shows you what rex can do. The next version
    will have substantially more complete documentation. My editor
    'helpfully' autowrapped the documentation--it's easier to read in the
    copy included in the module.




    rex: A Module to Provide a Better Interface to Regular Expressions.
    ===================================================================

    'rex' provides a much better interface for creating and using regular
    expressions. It is built on top of, and and intended as a functional
    replacement
    for, the Python 're' module.

    Introduction
    ============

    'rex' stands for any of (your choice):

    - Regular Expression eXtensions

    - Regular Expressions eXpanded

    - Rex, King of Regular Expressions (ha, ha ha, ha).


    rex provides a completely different way of writing regular expressions
    (REs).
    You do not use strings to write any part of the RE _except_ for regular
    expression literals. No escape characters, metacharacters, etc. Regular
    expression operations, such as repetition, alternation, concatenation,
    etc., are
    done via Python operators, methods, or functions.

    The major advantages of rex are:

    - [This is a biggie.] rex permits complex REs to be built up easily
    of
    smaller parts. In fact, a rex definition for a complex RE is
    likely to end
    up looking somewhat like a mini grammar.

    - [Another biggie.] As an ancillary to the above, rex permits REs
    to be
    easily reused.

    - rex expressions are checked for well-formedness by the Python
    parser; this
    will typically provide earlier and easier-to-understand diagnoses
    of
    syntactically malformed regular expressions

    - rex expressions are all strings! They are, in fact, a specialized
    subclass
    of strings, which means you can pass them to existing code which
    expects
    REs.

    - rex goes to some lengths to produce REs which are similar to
    those written
    by hand, i.e. it tries to avoid unnecessary use of nongrouping
    parentheses, uses special escape sequences where possible, writes
    'A?'
    instead of 'A{0,1}', etc. In general, rex tries to produce
    concise REs, on
    the theory that if you really need to read the buggers at some
    point, it's
    easier to read simpler ones than more complex ones.


    As an example, take a look at the definition of an RE matching a
    complex number,
    an example included in the test_rex.py. The rex Python code to do this
    is:

    COMPLEX= (
    PAT.aFloat['re']
    + PAT.anyWhitespace
    + ALT("+", "-")['op']
    + PAT.anyWhitespace
    + PAT.aFloat['im']
    + 'i'
    )


    while the analogous RE is:


    (?P<re>(?:\+|\-)?\d+(?:.\d*)?)\s*(?P<op>\+|\-)\s*(?P<im>(?:\+|\-)?\d+(?:.
    \d*)?)i


    The rex code is more verbose than the simple RE (which, by the way, was
    the RE
    generated by the rex code, and is pretty much what you'd produce by
    hand). It is
    also FAR easier to read, modify, and debug. And, it illustrates how
    easy it is
    to reuse rex patterns: PAT.aFloat and PAT.anyWhitespace are predefined
    patterns
    provided in rex which match, respectively, a string representation of a
    floating
    point number (no exponent), and a sequence of zero or more whitespace
    characters.

    Using rex
    =========

    This is a quick overview of how to use rex. See documentation
    associated with a
    specific method/function/name for details on that entity.

    In the following, we use the abbreviation RE to refer to standard
    regular
    expressions defined as strings, and the word 'rexp' to refer to rex
    objects
    which denote regular expressions.

    The starting point for building a rexp is either rex.PAT, which we'll
    just call
    PAT, or rex.CHAR, which we'll just call CHAR. CHAR builds rexps which
    match
    single character strings. PAT builds rexps which match strings of
    varying
    lengths.

    - PAT(string) returns a rexp which will match exactly the string
    given, and
    nothing else.

    - PAT._someattribute_ returns (for defined attributes) a
    corresponding rexp.
    For example, PAT.aDigit returns a rexp matching a single digit.

    - CHAR(a1, a2, . . .) returns a rexp matching a single character
    from a set
    of characters defined by its arguments. For example, CHAR("-",
    ["0","9"],
    ".") matches the characters necessary to build basic floating
    point
    numbers. See CHAR docs for details.


    Now assume that A, B, C,... are rexps. The following Python expressions
    (_not_
    strings) may be used to build more complex rexps:

    - A | B | C . . . : returns a rexp which matches a string if any of
    the
    operands match that string. Similar to "A|B|C" in normal REs,
    except of
    course you can't use Python code to define a normal RE.

    - A + B + C ...: returns a rexp which matches a string if all of A,
    B, C
    match consecutive substrings of the string in succession. Like
    "ABC" in
    normal REs.

    - A*n : returns a rexp which matches a number of times as defined
    by n. This
    replaces '?', '+', and '*' as used in normal REs. See docs for
    details.
    'rex' defines constants which allow you to say A*ANY, A*SOME, or
    A*MAYBE,
    indicating (0 or more matches), (1 or more matches), or (0 or 1
    matches),
    respectively.

    - A**n : Like A*n, but does nongreedy matching.

    - +A : positive lookahead assertion: matches if A matches, but
    doesn't
    consume any of the input.

    - ~+A : negative lookahead assertion: matches of A _doesn't_ match,
    but
    doesn't consume any of the input.

    - -A, ~-A : positive and negative lookback assertions. Lke lookahead
    assertions, but in the other direction.

    - A[name] : name must be a string: anything matched by A can be
    referred to
    by the given name in the match result object. (This is the
    equivalent of
    named groups in the re module).

    - A.group() : A will be in an unnamed group, referable by number.


    In addition, a few other operations can be done:

    - Some of the attributes defined in PAT have "natural inverses";
    for such
    attributes, the inverse may be taken. For example, ~ PAT.digit is
    a
    pattern matching any character except a digit.

    - Character classes may be inverted: ~CHAR("aeiouAEIOU") returns a
    pattern
    matching anything except a vowel.

    - 'ALT' gives a different way to denote alternation: ALT(A, B,
    C,...) does
    the same thing as A | B | C | . . ., except that none of the
    arguments to
    ALT need be rexps; any which are normal strings will be converted
    to a
    rexp using PAT.

    - 'PAT' can take multiple arguments: PAT(A, B, C,...), which gives
    the same
    result as PAT(A) + PAT(B) + PAT(C) + . . . .


    Finally, a very convenient shortcut is that only the first object in a
    sequence
    of operator/method calls needs to be a rexp; all others will be
    automatically
    converted as if PAT[...] had been called on them. For example, the
    sequence A |
    "hello" is the same as A | PAT("hello")

    rex Character Classes
    =====================

    CHAR(args...) defines a character class. Arguments are any number of
    strings or
    two-tuples/two-element lists. eg.

    CHAR("ab-z")


    is the same as the regular expression r"[ab\-z]". NOTE that there are no
    'character range metacharacters'; the preceding define a character class
    containing four characters, one of which was a '-'.

    This is a character class containing a backslash, hyphen, and open/close
    brackets:

    CHAR(r"\-[]")


    or

    CHAR("\-[]")


    Note that we still need to use raw strings to turn off normal Python
    string
    escaping.

    To define ranges, do this :

    CHAR(["a","z"], ["A","Z"])


    To define inverse ranges, use the ~ operator, eg. To define the class
    of all
    non-numeric characters:

    ~CHAR(["0","9"])


    Character classes cannot (yet) be doubly negated: ~~CHAR("A") is an
    error.

    Predefined Constants
    ====================

    rex provides a number of predefined patterns which will likely be of
    use in
    common cases. Generally speaking, rex constant pattern names begin with
    'a' or
    'an' (indicating a pattern that matches a single instance), 'any'
    (indicating a
    pattern that matches 0 or more instances), 'some' (indicating a pattern
    that
    matches 1 or more instances), and 'optional' (meaning the pattern
    matchs 0 or 1
    instance.) Some special names are also provided.

    The 'rex' module may define other constant names, but you should only
    use those
    below; others may change in future release of rex.

    - Matches any character: aChar, someChars, anyChars

    - Matches digits (0-9): aDigit, someDigits, anyDigits

    - Matches whitespace characters: aWhitespace, someWhitespace,
    anyWhitespace

    - Matches letters (a-z, A-Z): aLetter, someLetters, anyLetters

    - Numeric values (signed or unsigned, no exponent): anInt, aFloat

    - Match only the start or end of the string: stringStart, stringEnd

    - Match only at a word border: wordBorder

    - Matches the emptyString: emptyString

    - Any punctuation (non whitespace) character on a standard US
    keyboard):
    aPunctuationMark
    bp, Jun 27, 2004
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    592
    Jay Douglas
    Aug 15, 2003
  2. Kenneth McDonald
    Replies:
    0
    Views:
    324
    Kenneth McDonald
    Jun 10, 2004
  3. Kenneth McDonald
    Replies:
    1
    Views:
    281
    Skip Montanaro
    Jan 31, 2005
  4. Replies:
    1
    Views:
    108
  5. Noman Shapiro
    Replies:
    0
    Views:
    219
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page