a regexp riddle: re.search(r'(?:(\w+), |and (\w+))+', 'whatever a,bbb, and c') =? ('a', 'bbb', 'c')

Phlip · Nov 25, 2010

HypoNt:

I need to turn a human-readable list into a list():

print re.search(r'(?

\w+), |and (\w+))+', 'whatever a, bbb, and
c').groups()

That currently returns ('c',). I'm trying to match "any word \w+
followed by a comma, or a final word preceded by and."

The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
What do I type for .groups() to also get the 'a'?

Please go easy on me (and no RTFM!), because I have only been using
regular expressions for about 20 years...

Alice Bevanâ€“McGregor · Nov 25, 2010

Accepting input from a human is frought with dangers and edge cases.

Some time ago I wrote a regular expression generator that creates
regexen that can parse arbitrarily delimited text, supports quoting (to
avoid accidentally separating two elements that should be treated as
one), and works in both directions (text<->native).

The code that generates the regex is heavily commented:

https://github.com/pulp/marrow.util/blob/master/marrow/util/convert.py#L123-234

You

should be able to use this as-is and simply handle the optional 'and'
on the last element yourself. You can even create an instance of the
class with the options you want then get the generated regular
expression by running print(parser.pattern).

Note that I have friends who use 'and' multiple times when describing
lists of things.

â€” Alice.

Alice Bevanâ€“McGregor · Nov 25, 2010

Now that I think about it, and can be stripped using a callback
function as the 'normalize' argument to my KeywordProcessor class:

def normalize(value):
value = value.strip()

if value.startswith("and"):
value = value[3:]

return value

parser = KeywordProcessor(',', normalize=normalize, result=list)

â€” Alice.

python · Nov 25, 2010

Phlip,

I'm trying to match "any word \w+ followed by a comma, or a final word preceded by and."

Here's a non-regex solution that handles multi-word values and multiple
instances of 'and' (as pointed out by Alice). The posted code could be
simplified via list comprehension - I chose the more verbose method to
illustrate the logic.

def to_list( text ):

text = text.replace( ' and ', ',' )
output = list()
for item in text.split( ',' ):
if item:
output.append( item.strip() )
return output

test = 'cat, dog, big fish, goat and puppy and horse'

print to_list( test )

Outputs:

['cat', 'dog', 'big fish', 'goat', 'puppy', 'horse']

Malcolm

Steve Holden · Nov 25, 2010

HypoNt:

I need to turn a human-readable list into a list():

print re.search(r'(?\w+), |and (\w+))+', 'whatever a, bbb, and
c').groups()

That currently returns ('c',). I'm trying to match "any word \w+
followed by a comma, or a final word preceded by and."

The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
What do I type for .groups() to also get the 'a'?

Please go easy on me (and no RTFM!), because I have only been using
regular expressions for about 20 years...

A kind of lazy way just uses a pattern for the separators to fuel a call
to re.split(). I assume that " and " and " , " are both acceptable in
any position:

The best I've been able to do so far (due to split's annoying habit of
including the matches of any groups in the pattern I have to throw away
every second element) is:

re.split("\s*(,|and)?\s*", 'whatever a, bbb, and c')[::2]

Click to expand...

Click to expand...

['whatever', 'a', 'bbb', '', 'c']

That empty string is because of the ", and" which isn't recognise as a
single delimiter.

A parsing package might give you better results.

regards
Steve

MRAB · Nov 25, 2010

HypoNt:

I need to turn a human-readable list into a list():

print re.search(r'(?\w+), |and (\w+))+', 'whatever a, bbb, and
c').groups()

That currently returns ('c',). I'm trying to match "any word \w+
followed by a comma, or a final word preceded by and."

The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
What do I type for .groups() to also get the 'a'?

Please go easy on me (and no RTFM!), because I have only been using
regular expressions for about 20 years...

Try re.findall:
[('a', ''), ('bbb', ''), ('', 'c')]

You can get a list of strings like this:

>>> [x or y for x, y in re.findall(r'(\w+), |and (\w+)', 'whatever

Click to expand...

Click to expand...

a, bbb, and c')]
['a', 'bbb', 'c']

Phlip · Nov 25, 2010

Accepting input from a human is fraught with dangers and edge cases.

Here's a non-regex solution

Thanks all for playing! And as usual I forgot a critical detail:

I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
matcher must be a single regexp.

http://c2.com/cgi/wiki?MoreliaViridis

I'm avoiding the current situation, where Morelia pulls out (.*), and
the step handler "manually" splits that up with:

flags = re.split(r', (?:and )?', flags)

That means I already had a brute-force version. A regexp version is
always better because, especially in Morelia, it validates input. (.*)
is less specific than (\w+).

So if the step says:

Alice has crypto keys apple, barley, and flax

Then the step handler could say (if this worked):

def step_user_has_crypto_keys_(self, user, *keys):
r'(\w+) has crypto keys (?

\w+), )+and (\w+)'

# assert that user with those keys here

That does not work because "a capturing group only remembers the last
match". This would appear to be an irritating 'feature' in Regexp. The
total match is 'apple, barley, and flax', but the stored groups behave
as if each () were a slot, so (\w+)+ would not store "more than one
group". Unless there's a Regexp workaround to mean "arbitrary number
of slots for each ()", then I /might/ go with this:

got = re.findall(r'(?

\w+), )?(?

\w+), )?
(?

\w+), and )?(\w+)$', 'whatever a, bbb, and c')
print got # [('a', '', '', '', 'bbb', 'c')]

The trick is to simply paste in a high number of (?

\w+), )?
segments, assuming that nobody should plug in too many. Behavior
Driven Development scenarios should be readable and not run-on.
(Morelia has a table feature for when you actually need lots of
arguments.)

Next question: Does re.search() return a match object that I can get
('a', '', '', '', 'bbb', 'c') out of? The calls to groups() and such
always return this crazy ('a', 2, 'bbb', 'c') thing that would disturb
my user-programmers.

MRAB · Nov 25, 2010

Thanks all for playing! And as usual I forgot a critical detail:

I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
matcher must be a single regexp.

http://c2.com/cgi/wiki?MoreliaViridis

I'm avoiding the current situation, where Morelia pulls out (.*), and
the step handler "manually" splits that up with:

flags = re.split(r', (?:and )?', flags)

That means I already had a brute-force version. A regexp version is
always better because, especially in Morelia, it validates input. (.*)
is less specific than (\w+).

So if the step says:

Alice has crypto keys apple, barley, and flax

Then the step handler could say (if this worked):

def step_user_has_crypto_keys_(self, user, *keys):
r'(\w+) has crypto keys (?\w+), )+and (\w+)'

# assert that user with those keys here

[snip]
You could do:

def step_user_has_crypto_keys_(self, user, keys):
r'(\w+) has crypto keys ((?:\w+, )+and \w+)'

to validate and capture, and then split the keys string.

Aahz · Nov 26, 2010

Thanks all for playing! And as usual I forgot a critical detail:

I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
matcher must be a single regexp.

Why? (You're apparently the author of Morelia, but I don't really
understand it.)

re.search over a list	6	Oct 19, 2008
unexpected regexp behaviour using 'A\|B\|C.....'	1	Jul 28, 2011
Ifs and assignments	0	Jan 2, 2014
if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just 'C'	8	Nov 12, 2010
Evaluate my first python script, please	13	Mar 4, 2010
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
Stripping C-style comments using a Python regexp	4	Jul 27, 2005
Fastest way to detect a non-ASCII character in a list of strings.	2	Oct 17, 2010

a regexp riddle: re.search(r'(?:(\w+), |and (\w+))+', 'whatever a,bbb, and c') =? ('a', 'bbb', 'c')

Phlip

Alice Bevanâ€“McGregor

Alice Bevanâ€“McGregor

python

Steve Holden

MRAB

Phlip

MRAB

Aahz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads