Help with Regular Expressions

H

Harlin Seritt

I have been looking at the Python re module and have been trying to
make sense of a simple function that I'd like to do. However, no amount
of reading or googling has helped me with this. Forgive my
stone-headedness. I have done this with .NET and Java in the past but
damn if I can't get it done with Python for some reason. As such I am
sure it is something even simpler.

I am trying to find some matches and have them put into a list when
processing is done. I'll use a simple example like email addresses.

My input is the following:
wordList = ['myname1', '(e-mail address removed)', '(e-mail address removed)',
'myname4@domain', '(e-mail address removed)']

My regular expression would be something like '\w\@\w\.\w' (I realize
it could and should be more detailed but that's not the point for now).

I would like to find out how to output the matches for this expression
of my 'wordList' into a neat list variable. How do I get this done?

Thanks,

Harlin Seritt
 
D

Devan L

Harlin said:
I have been looking at the Python re module and have been trying to
make sense of a simple function that I'd like to do. However, no amount
of reading or googling has helped me with this. Forgive my
stone-headedness. I have done this with .NET and Java in the past but
damn if I can't get it done with Python for some reason. As such I am
sure it is something even simpler.

I am trying to find some matches and have them put into a list when
processing is done. I'll use a simple example like email addresses.

My input is the following:
wordList = ['myname1', '(e-mail address removed)', '(e-mail address removed)',
'myname4@domain', '(e-mail address removed)']

My regular expression would be something like '\w\@\w\.\w' (I realize
it could and should be more detailed but that's not the point for now).

I would like to find out how to output the matches for this expression
of my 'wordList' into a neat list variable. How do I get this done?

Thanks,

Harlin Seritt

You need to enclose the '\w's in parentheses. The re module will only
return it if you enclose it in parentheses. Also, you need to use the
'+' so that \w won't just match the first alphanumeric character, but
will match one or more. You also need to escape the '.' because that's
matches any character. So your regular expression would be more like

r'(\w+)@(\w+)\.(\w+)'

Anyways, you can use a list comprehension and the groups() method of a
match object to build a list of tuples
[re.match(r'(\w+)@(\w+)\.(\w+)', address).groups() for address in
wordList]

On a side note, some of the email addresses in your list don't work.
You should use

wordList = ['(e-mail address removed)', '(e-mail address removed)',
'(e-mail address removed)']
 
F

Fredrik Lundh

Harlin said:
I am trying to find some matches and have them put into a list when
processing is done. I'll use a simple example like email addresses.

My input is the following:
wordList = ['myname1', '(e-mail address removed)', '(e-mail address removed)',
'myname4@domain', '(e-mail address removed)']

My regular expression would be something like '\w\@\w\.\w' (I realize
it could and should be more detailed but that's not the point for now).

I would like to find out how to output the matches for this expression
of my 'wordList' into a neat list variable. How do I get this done?

that's more of a list manipulation question than a regular expression
question, of course. to apply a regular expression to all items in a
list, apply it to all items in a list. a list comprehension is the shortest
way to do this:
out = [word for word in wordList if re.match("\w+@\w+\.\w+", word)]
out
['(e-mail address removed)', '(e-mail address removed)', '(e-mail address removed)']

</F>
 
H

Harlin Seritt

Ahh that's it Frederik. That's what I was looking for. The regular
expression problems I will take care of, but first wanted to walk
before running. ;)

Thanks,

Harlin Seritt
 
H

Harlin Seritt

Forgive another question here, but what is the 'r' for when used with
expression: r'\w+...' ?
 
B

Benjamin Niemann

Harlin said:
Forgive another question here, but what is the 'r' for when used with
expression: r'\w+...' ?

r'..' or r".." are "raw strings" where backslashes do not introduce an
escape sequence - so you don't have to write '\\', if you need a backslash
in the string, e.g. r'\w+' == '\\w+'.
Useful for regular expression (because the re module parses the '\X'
sequences itself) or Windows pathes (e.g. r'C:\newfile.txt').

And you should append a '$' to the regular expression, because
r"\w+@\w+\.\w+" would match '(e-mail address removed)-+*junk', too.
 
P

Paul McGuire

If your re demands get more complicated, you could take a look at
pyparsing. The code is a bit more verbose, but many find it easier to
compose their expressions using pyparsing's classes, such as Literal,
OneOrMore, Optional, etc., plus a number of built-in helper functions
and expressions, including delimitedList, quotedString, and
cStyleComment. Pyparsing is intended for writing recursive-descent
parsers, but can also be used (and is best learned) with simple
applications such as this one.

Here is a simple script for parsing your e-mail addresses. Note the
use of results names to give you access to the individual parsed fields
(re's also support a similar capability).

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul

from pyparsing import Literal,Word,Optional,\
delimitedList,alphanums

# define format of an email address
AT = Literal("@").suppress()
emailWord = Word(alphanums+"_")
emailDomain = delimitedList( emailWord, ".", combine=True)
emailAddress = emailWord.setResultsName("user") + \
Optional( AT + emailDomain ).setResultsName("host")

# parse each word in wordList
wordList = ['myname1', '(e-mail address removed)', '(e-mail address removed)',
'myname4@domain', '(e-mail address removed)']

for w in wordList:
addr = emailAddress.parseString(w)
print w
print addr
print "user:", addr.user
print "host:", addr.host
print

Will print out:
myname1
['myname1']
user: myname1
host:

(e-mail address removed)
['myname1', 'domain.tld']
user: myname1
host: domain.tld

(e-mail address removed)
['myname2', 'domain.tld']
user: myname2
host: domain.tld

myname4@domain
['myname4', 'domain']
user: myname4
host: domain

(e-mail address removed)
['myname5', 'domain.tldx']
user: myname5
host: domain.tldx
 
J

Jeff Schwab

Harlin said:
I am trying to find some matches and have them put into a list when
processing is done. I'll use a simple example like email addresses.

My input is the following:
wordList = ['myname1', '(e-mail address removed)', '(e-mail address removed)',
'myname4@domain', '(e-mail address removed)']

My regular expression would be something like '\w\@\w\.\w' (I realize
it could and should be more detailed but that's not the point for now).

FYI, matching all compliant email addresses is ridiculously complicated.
Before you spend too much time on it, you might want to borrow the
complete and thoroughly explained example in Regular Expressions (O'Reilly):

http://www.oreilly.com/catalog/regex/
 
C

Christopher Subich

Paul said:
If your re demands get more complicated, you could take a look at
pyparsing. The code is a bit more verbose, but many find it easier to
compose their expressions using pyparsing's classes, such as Literal,
OneOrMore, Optional, etc., plus a number of built-in helper functions
and expressions, including delimitedList, quotedString, and
cStyleComment. Pyparsing is intended for writing recursive-descent
parsers, but can also be used (and is best learned) with simple
applications such as this one.

As a slightly unrelated pyparsing question, is there a good set of API
documentation around for pyparsing?

I've looked into it for my mud client, but for now have gone with
DParser because I need (desire) custom token generation sometimes.
Pyparsing looks easier to internationalize, though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top