Get all strings matching given RegExp

A

Alex9968

Can I get sequence of all strings that can match a given regular
expression? For example, for expression '(a|b)|(x|y)' it would be ['ax',
'ay', 'bx', 'by']

It would be useful for example to pass these strings to a search engine
not supporting RegExp (therefore adding such support to it). A program
can also let user specify sequence of strings using RegExp (filenames to
process, etc.). If there are other types of expressions for these
purposes, please let me know.

I know that for some expressions there would be infinite amount of
matching strings, but these aren't the cases I'm considering. It'd still
be possible if string length is limited (there might be large but finite
number of matching strings).

Thanks
 
M

Marc Christiansen

Alex9968 said:
Can I get sequence of all strings that can match a given regular
expression? For example, for expression '(a|b)|(x|y)' it would be ['ax',
'ay', 'bx', 'by']

It would be useful for example to pass these strings to a search engine
not supporting RegExp (therefore adding such support to it). A program
can also let user specify sequence of strings using RegExp (filenames to
process, etc.). If there are other types of expressions for these
purposes, please let me know.

I know that for some expressions there would be infinite amount of
matching strings, but these aren't the cases I'm considering. It'd still
be possible if string length is limited (there might be large but finite
number of matching strings).

This will give you all (byte-)strings upto a given length which match a
given regular expression. But beware, it can be slow ;)

import re

all_chars = [chr(i) for i in xrange(256)]

def gen_strings(length, alphabet=all_chars):
if length == 1:
for c in alphabet:
yield c
else:
for i in alphabet:
yield c
for s in gen_strings(length - 1, alphabet):
yield c + s

def regex_matches(regex, max_length, alphabet=all_chars):
r = re.compile('(' + regex + r')\Z')
return (s for s in gen_strings(max_length, alphabet) if r.match(s))

Marc
 
J

Jeff

I don't think there is any built in way. Regular expressions are
compiled into an expanded pattern internally, but I don't think that
it is anything that would be useful for you to directly access.

If you are interested in a lot of work, you could do something with
PLY and write an re parser that would expand it into a series of
possible textual matches :)
 
P

Paul McGuire

Can I get sequence of all strings that can match a given regular
expression? For example, for expression '(a|b)|(x|y)' it would be ['ax',
'ay', 'bx', 'by']

Actually, this regex will only match 'a', 'b', 'x', or 'y' (assuming
you meant to bracket it with leading '^' and trailing '$'). To get
the set you listed, you would need to invert '(a|b)(x|y)'.

The thread that Gabriel Genellina referenced includes a link to a
pyparsing regex inverter (http://pyparsing-public.wikispaces.com/space/
showimage/invRegex.py), which I used to test your regex - I've added
both regexes in the test cases of that script. Some of the other test
cases include regexes for all US time zones, and all chemical symbols.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,525
Members
44,997
Latest member
mileyka

Latest Threads

Top