Get all strings matching given RegExp

Discussion in 'Python' started by Alex9968, Apr 3, 2008.

  1. Alex9968

    Alex9968 Guest

    Can I get sequence of all strings that can match a given regular
    expression? For example, for expression '(a|b)|(x|y)' it would be ['ax',
    'ay', 'bx', 'by']

    It would be useful for example to pass these strings to a search engine
    not supporting RegExp (therefore adding such support to it). A program
    can also let user specify sequence of strings using RegExp (filenames to
    process, etc.). If there are other types of expressions for these
    purposes, please let me know.

    I know that for some expressions there would be infinite amount of
    matching strings, but these aren't the cases I'm considering. It'd still
    be possible if string length is limited (there might be large but finite
    number of matching strings).

    Alex9968, Apr 3, 2008
    1. Advertisements

  2. This will give you all (byte-)strings upto a given length which match a
    given regular expression. But beware, it can be slow ;)

    import re

    all_chars = [chr(i) for i in xrange(256)]

    def gen_strings(length, alphabet=all_chars):
    if length == 1:
    for c in alphabet:
    yield c
    for i in alphabet:
    yield c
    for s in gen_strings(length - 1, alphabet):
    yield c + s

    def regex_matches(regex, max_length, alphabet=all_chars):
    r = re.compile('(' + regex + r')\Z')
    return (s for s in gen_strings(max_length, alphabet) if r.match(s))

    Marc Christiansen, Apr 3, 2008
    1. Advertisements

  3. Alex9968

    Jeff Guest

    I don't think there is any built in way. Regular expressions are
    compiled into an expanded pattern internally, but I don't think that
    it is anything that would be useful for you to directly access.

    If you are interested in a lot of work, you could do something with
    PLY and write an re parser that would expand it into a series of
    possible textual matches :)
    Jeff, Apr 3, 2008
  4. Alex9968

    Paul McGuire Guest

    Actually, this regex will only match 'a', 'b', 'x', or 'y' (assuming
    you meant to bracket it with leading '^' and trailing '$'). To get
    the set you listed, you would need to invert '(a|b)(x|y)'.

    The thread that Gabriel Genellina referenced includes a link to a
    pyparsing regex inverter (
    showimage/, which I used to test your regex - I've added
    both regexes in the test cases of that script. Some of the other test
    cases include regexes for all US time zones, and all chemical symbols.

    -- Paul
    Paul McGuire, Apr 3, 2008
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.