a regexp riddle: re.search(r'(?:(\w+), |and (\w+))+', 'whatever a,bbb, and c') =? ('a', 'bbb', 'c')

P

Phlip

HypoNt:

I need to turn a human-readable list into a list():

print re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and
c').groups()

That currently returns ('c',). I'm trying to match "any word \w+
followed by a comma, or a final word preceded by and."

The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
What do I type for .groups() to also get the 'a'?

Please go easy on me (and no RTFM!), because I have only been using
regular expressions for about 20 years...
 
A

Alice Bevan–McGregor

Accepting input from a human is frought with dangers and edge cases. ;)

Some time ago I wrote a regular expression generator that creates
regexen that can parse arbitrarily delimited text, supports quoting (to
avoid accidentally separating two elements that should be treated as
one), and works in both directions (text<->native).

The code that generates the regex is heavily commented:

https://github.com/pulp/marrow.util/blob/master/marrow/util/convert.py#L123-234

You

should be able to use this as-is and simply handle the optional 'and'
on the last element yourself. You can even create an instance of the
class with the options you want then get the generated regular
expression by running print(parser.pattern).

Note that I have friends who use 'and' multiple times when describing
lists of things. :p

— Alice.
 
A

Alice Bevan–McGregor

Now that I think about it, and can be stripped using a callback
function as the 'normalize' argument to my KeywordProcessor class:

def normalize(value):
value = value.strip()

if value.startswith("and"):
value = value[3:]

return value

parser = KeywordProcessor(',', normalize=normalize, result=list)

— Alice.
 
P

python

Phlip,
I'm trying to match "any word \w+ followed by a comma, or a final word preceded by and."

Here's a non-regex solution that handles multi-word values and multiple
instances of 'and' (as pointed out by Alice). The posted code could be
simplified via list comprehension - I chose the more verbose method to
illustrate the logic.

def to_list( text ):

text = text.replace( ' and ', ',' )
output = list()
for item in text.split( ',' ):
if item:
output.append( item.strip() )
return output

test = 'cat, dog, big fish, goat and puppy and horse'

print to_list( test )

Outputs:

['cat', 'dog', 'big fish', 'goat', 'puppy', 'horse']

Malcolm
 
S

Steve Holden

HypoNt:

I need to turn a human-readable list into a list():

print re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and
c').groups()

That currently returns ('c',). I'm trying to match "any word \w+
followed by a comma, or a final word preceded by and."

The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
What do I type for .groups() to also get the 'a'?

Please go easy on me (and no RTFM!), because I have only been using
regular expressions for about 20 years...

A kind of lazy way just uses a pattern for the separators to fuel a call
to re.split(). I assume that " and " and " , " are both acceptable in
any position:

The best I've been able to do so far (due to split's annoying habit of
including the matches of any groups in the pattern I have to throw away
every second element) is:
re.split("\s*(,|and)?\s*", 'whatever a, bbb, and c')[::2]
['whatever', 'a', 'bbb', '', 'c']

That empty string is because of the ", and" which isn't recognise as a
single delimiter.

A parsing package might give you better results.

regards
Steve
 
M

MRAB

HypoNt:

I need to turn a human-readable list into a list():

print re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and
c').groups()

That currently returns ('c',). I'm trying to match "any word \w+
followed by a comma, or a final word preceded by and."

The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
What do I type for .groups() to also get the 'a'?

Please go easy on me (and no RTFM!), because I have only been using
regular expressions for about 20 years...
Try re.findall:
[('a', ''), ('bbb', ''), ('', 'c')]

You can get a list of strings like this:
>>> [x or y for x, y in re.findall(r'(\w+), |and (\w+)', 'whatever
a, bbb, and c')]
['a', 'bbb', 'c']
 
P

Phlip

Accepting input from a human is fraught with dangers and edge cases.
Here's a non-regex solution

Thanks all for playing! And as usual I forgot a critical detail:

I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
matcher must be a single regexp.

http://c2.com/cgi/wiki?MoreliaViridis

I'm avoiding the current situation, where Morelia pulls out (.*), and
the step handler "manually" splits that up with:

flags = re.split(r', (?:and )?', flags)

That means I already had a brute-force version. A regexp version is
always better because, especially in Morelia, it validates input. (.*)
is less specific than (\w+).

So if the step says:

Alice has crypto keys apple, barley, and flax

Then the step handler could say (if this worked):

def step_user_has_crypto_keys_(self, user, *keys):
r'(\w+) has crypto keys (?:(\w+), )+and (\w+)'

# assert that user with those keys here

That does not work because "a capturing group only remembers the last
match". This would appear to be an irritating 'feature' in Regexp. The
total match is 'apple, barley, and flax', but the stored groups behave
as if each () were a slot, so (\w+)+ would not store "more than one
group". Unless there's a Regexp workaround to mean "arbitrary number
of slots for each ()", then I /might/ go with this:

got = re.findall(r'(?:(\w+), )?(?:(\w+), )?(?:(\w+), )?(?:(\w+), )?
(?:(\w+), and )?(\w+)$', 'whatever a, bbb, and c')
print got # [('a', '', '', '', 'bbb', 'c')]

The trick is to simply paste in a high number of (?:(\w+), )?
segments, assuming that nobody should plug in too many. Behavior
Driven Development scenarios should be readable and not run-on.
(Morelia has a table feature for when you actually need lots of
arguments.)

Next question: Does re.search() return a match object that I can get
('a', '', '', '', 'bbb', 'c') out of? The calls to groups() and such
always return this crazy ('a', 2, 'bbb', 'c') thing that would disturb
my user-programmers.
 
M

MRAB

Thanks all for playing! And as usual I forgot a critical detail:

I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
matcher must be a single regexp.

http://c2.com/cgi/wiki?MoreliaViridis

I'm avoiding the current situation, where Morelia pulls out (.*), and
the step handler "manually" splits that up with:

flags = re.split(r', (?:and )?', flags)

That means I already had a brute-force version. A regexp version is
always better because, especially in Morelia, it validates input. (.*)
is less specific than (\w+).

So if the step says:

Alice has crypto keys apple, barley, and flax

Then the step handler could say (if this worked):

def step_user_has_crypto_keys_(self, user, *keys):
r'(\w+) has crypto keys (?:(\w+), )+and (\w+)'

# assert that user with those keys here
[snip]
You could do:

def step_user_has_crypto_keys_(self, user, keys):
r'(\w+) has crypto keys ((?:\w+, )+and \w+)'

to validate and capture, and then split the keys string.
 
A

Aahz

Thanks all for playing! And as usual I forgot a critical detail:

I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
matcher must be a single regexp.

Why? (You're apparently the author of Morelia, but I don't really
understand it.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top