make RE more cleaver to avoid inappropriate : sre_constants.error: redefinition of group name

A

aspineux

I want to parse

'foo@bare' or '<foot@bar>' and get the email address foo@bar

the regex is

r'<\w+@\w+>|\w+@\w+'

now, I want to give it a name

r'<(?P<email>\w+@\w+)>|(?P<email>\w+@\w+)'

sre_constants.error: redefinition of group name 'email' as group 2;
was group 1

BUT because I use a | , I will get only one group named 'email' !

Any comment ?

PS: I know the solution for this case is to use r'(?P<lt><)?(?P<email>
\w+@\w+)(?(lt)>)'
 
A

attn.steven.kuo

I want to parse

'foo@bare' or '<foot@bar>' and get the email address foo@bar

the regex is

r'<\w+@\w+>|\w+@\w+'

now, I want to give it a name

r'<(?P<email>\w+@\w+)>|(?P<email>\w+@\w+)'

sre_constants.error: redefinition of group name 'email' as group 2;
was group 1

BUT because I use a | , I will get only one group named 'email' !

Any comment ?

PS: I know the solution for this case is to use r'(?P<lt><)?(?P<email>
\w+@\w+)(?(lt)>)'



Regular expressions, alternation, named groups ... oh my!

It tends to get quite complex especially if you need
to reject cases where the string contains a left bracket
and not the right, or visa-versa.
.... matched = pattern.search(email)
.... if matched is not None:
.... print matched.group('email')
....
foo@bar
<foo@bar>


I suggest you try some other solution (maybe pyparsing).
 
A

aspineux

I want to parse

'foo@bare' or '<foot@bar>' and get the email address foo@bar

the regex is

r'<\w+@\w+>|\w+@\w+'

now, if I want to give it a name

r'<(?P<email>\w+@\w+)>|(?P<email>\w+@\w+)'

sre_constants.error: redefinition of group name 'email' as group 2;
was group 1

BUT because I use a | , I will get only one group named 'email' !

THEN my regex is meaningful, and the error is meaningless and
somrthing
should be change into 're'

But maybe I'm wrong ?
Any comment ?

I'm trying to start a discussion about something that can be improved
in 're',
not looking for a solution about email parsing :)
 
P

Paddy

I want to parse

'foo@bare' or '<foot@bar>' and get the email address foo@bar

the regex is

r'<\w+@\w+>|\w+@\w+'

now, I want to give it a name

r'<(?P<email>\w+@\w+)>|(?P<email>\w+@\w+)'

sre_constants.error: redefinition of group name 'email' as group 2;
was group 1

BUT because I use a | , I will get only one group named 'email' !

Any comment ?

PS: I know the solution for this case is to use r'(?P<lt><)?(?P<email>
\w+@\w+)(?(lt)>)'

use two group names, one for each alternate form and if you are not
concerned with whichever matched do something like the following:
s1 = 'foo@bare'
s2 = '<foo@bare>'
matchobj = re.search(r'<(?P<email1>\w+@\w+)>|(?P<email2>\w+@\w+)', s1)
matchobj.groupdict()['email1'] or matchobj.groupdict()['email2'] 'foo@bare'
matchobj = re.search(r'<(?P<email1>\w+@\w+)>|(?P<email2>\w+@\w+)', s2)
matchobj.groupdict()['email1'] or matchobj.groupdict()['email2'] 'foo@bare'

- Paddy.
 
A

aspineux

use two group names, one for each alternate form and if you are not
concerned with whichever matched do something like the following:
The problem is the way I create this regex :)

regex={}
regex['email']=r'(?P<email1>\w+@\w+)'

path=r'<%(email)s>|%(email)s' % regex

Once more, the original question is :
Is it normal to get an error when the same id used on both side of a
|
s1 = 'foo@bare'
s2 = '<foo@bare>'
matchobj = re.search(r'<(?P<email1>\w+@\w+)>|(?P<email2>\w+@\w+)', s1)
matchobj.groupdict()['email1'] or matchobj.groupdict()['email2'] 'foo@bare'
matchobj = re.search(r'<(?P<email1>\w+@\w+)>|(?P<email2>\w+@\w+)', s2)
matchobj.groupdict()['email1'] or matchobj.groupdict()['email2']
'foo@bare'

- Paddy.
 
P

Paddy

use two group names, one for each alternate form and if you are not
concerned with whichever matched do something like the following:

The problem is the way I create this regex :)

regex={}
regex['email']=r'(?P<email1>\w+@\w+)'

path=r'<%(email)s>|%(email)s' % regex

Once more, the original question is :
Is it normal to get an error when the same id used on both side of a
|


s1 = 'foo@bare'
s2 = '<foo@bare>'
matchobj = re.search(r'<(?P<email1>\w+@\w+)>|(?P<email2>\w+@\w+)', s1)
matchobj.groupdict()['email1'] or matchobj.groupdict()['email2'] 'foo@bare'
matchobj = re.search(r'<(?P<email1>\w+@\w+)>|(?P<email2>\w+@\w+)', s2)
matchobj.groupdict()['email1'] or matchobj.groupdict()['email2'] 'foo@bare'

- Paddy.

Groups are numbered left-to-right irrespective of the expression
contents.
I am quite happy with the names being merely apseudonym for the
positional
group number and don't see a problem with not allowing multiple
occurrences of the same group name.
I did see some article about RE's and their speed. It seems that if
Pythons
RE package distinguished between 'grep style' RE' and the full set of
Python
RE's then their are much faster and efficient algorithms available for
the
grep style subset.

- Paddy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top