compound regex

Discussion in 'Python' started by spir, Feb 9, 2009.

  1. spir

    spir Guest

    Hello,

    (new here)

    Below an extension to standard module re. The point is to allow writing and testing sub-expressions individually, then nest them into a super-expression. More or less like using a parser generator -- but keeping regex grammar and power.
    I used the format {sub_expr_name}: as in standard regexes {} are only used to express repetition number, a pair of curly braces nesting an identifier should not conflict.

    The extension is new, very few tested. I would enjoy comments, critics, etc. I would like to know if you find such a feature useful. You will probably find the code simple enough ;-)

    Denis
    ------
    la vida e estranya

    ===============
    # coding: utf-8

    ''' super_regex

    Define & check sub-patterns individually,
    then include them in global super-pattern.

    uses format {name} for inclusion:
    sub1 = Regex(...)
    sub2 = Regex(...)
    super_format = "...{sub1}...{sub2}..."
    # final regex object:
    super_regex = superRegex(super_format)
    '''

    from re import compile as Regex

    # sub-pattern inclusion format
    sub_pattern = Regex(r"{[a-zA-Z_][a-zA-Z_0-9]*}")

    # sub-pattern expander
    def sub_pattern_expansion(inclusion, dic=None):
    name = inclusion.group()[1:-1]
    ### namespace dict may be specified -- else globals()
    if dic is None:
    dic = globals()
    if name not in dic:
    raise NameError("Cannot find sub-pattern '%s'." % name)
    return dic[name].pattern

    # super-pattern generator
    def superRegex(format):
    expanded_format = sub_pattern.sub(sub_pattern_expansion, format)
    return Regex(expanded_format)

    if __name__ == "__main__": # purely artificial example use
    # pattern
    time = Regex(r"\d\d:\d\d:\d\d") # hh:mm:ss
    code = Regex(r"\S{5}") # non-whitespace x 5
    desc = Regex(r"[\w\s]+$") # alphanum|space --> EOL
    ref_format = "^ref: {time} #{code} --- {desc}"
    ref_regex = superRegex(ref_format)
    # output
    print 'super pattern:\n"%s" ==>\n"%s"\n' % (ref_format,ref_regex.pattern)
    text = "ref: 12:04:59 #%+.?% --- foo 987 bar"
    result = ref_regex.match(text)
    print 'text: "%s" ==>\n"%s"' %(text,result.group())
     
    spir, Feb 9, 2009
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?TWlrZUw=?=
    Replies:
    0
    Views:
    389
    =?Utf-8?B?TWlrZUw=?=
    Nov 19, 2004
  2. Damir Mikoc

    CMR/CMP and Compound Primary Key

    Damir Mikoc, Jul 3, 2003, in forum: Java
    Replies:
    1
    Views:
    566
    Christopher Blunck
    Jul 4, 2003
  3. Apc
    Replies:
    1
    Views:
    861
  4. Greg N.
    Replies:
    20
    Views:
    1,031
    Greg N.
    Feb 15, 2006
  5. Replies:
    3
    Views:
    794
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page