Help with Regex for domain names

F

Feyo

I'm trying to figure out how to write efficiently write a regex for
domain names with a particular top level domain. Let's say, I want to
grab all domain names with country codes .us, .au, and .de.

I could create three different regexs that would work:
regex = re.compile(r'[\w\-\.]+\.us)
regex = re.compile(r'[\w\-\.]+\.au)
regex = re.compile(r'[\w\-\.]+\.de)

How would I write one to accommodate all three, or, better yet, to
accommodate a list of them that I can pass into a method call? Thanks!
 
T

Tim Daneliuk

Feyo said:
I'm trying to figure out how to write efficiently write a regex for
domain names with a particular top level domain. Let's say, I want to
grab all domain names with country codes .us, .au, and .de.

I could create three different regexs that would work:
regex = re.compile(r'[\w\-\.]+\.us)
regex = re.compile(r'[\w\-\.]+\.au)
regex = re.compile(r'[\w\-\.]+\.de)

How would I write one to accommodate all three, or, better yet, to
accommodate a list of them that I can pass into a method call? Thanks!

Just a point of interest: A correctly formed domain name may have a
trailing period at the end of the TLD [1]. Example:

foo.bar.com.

Though you do not often see this, it's worth accommodating "just in
case"...


[1] http://homepages.tesco.net/J.deBoynePollard/FGA/web-fully-qualified-domain-name.html
 
M

MRAB

Feyo said:
I'm trying to figure out how to write efficiently write a regex for
domain names with a particular top level domain. Let's say, I want to
grab all domain names with country codes .us, .au, and .de.

I could create three different regexs that would work:
regex = re.compile(r'[\w\-\.]+\.us)
regex = re.compile(r'[\w\-\.]+\.au)
regex = re.compile(r'[\w\-\.]+\.de)

How would I write one to accommodate all three, or, better yet, to
accommodate a list of them that I can pass into a method call? Thanks!
>
regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

If you have a list of country codes ["us", "au", "de"] then you can
build the regular expression from it:

regex = re.compile(r'[\w\-\.]+\.(?:%s)' % '|'.join(domains))
 
F

Feyo

Feyo said:
I'm trying to figure out how to write efficiently write a regex for
domain names with a particular top level domain. Let's say, I want to
grab all domain names with country codes .us, .au, and .de.
I could create three different regexs that would work:
regex = re.compile(r'[\w\-\.]+\.us)
regex = re.compile(r'[\w\-\.]+\.au)
regex = re.compile(r'[\w\-\.]+\.de)
How would I write one to accommodate all three, or, better yet, to
accommodate a list of them that I can pass into a method call? Thanks!

 >
regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

If you have a list of country codes ["us", "au", "de"] then you can
build the regular expression from it:

regex = re.compile(r'[\w\-\.]+\.(?:%s)' % '|'.join(domains))

Perfect! Thanks.
 
R

rurpy

Feyo said:
I'm trying to figure out how to write efficiently write a regex for
domain names with a particular top level domain. Let's say, I want to
grab all domain names with country codes .us, .au, and .de.
I could create three different regexs that would work:
regex = re.compile(r'[\w\-\.]+\.us)
regex = re.compile(r'[\w\-\.]+\.au)
regex = re.compile(r'[\w\-\.]+\.de)
How would I write one to accommodate all three, or, better yet, to
accommodate a list of them that I can pass into a method call? Thanks!
regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

You might also want to consider that some country
codes such as "co" for Columbia might match more than
you want, for example:

re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')

will match.
 
N

Nobody

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

You might also want to consider that some country
codes such as "co" for Columbia might match more than
you want, for example:

re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')

will match.

.... so put \b at the end, i.e.:

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')
 
M

MRAB

Nobody said:
regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')
You might also want to consider that some country
codes such as "co" for Columbia might match more than
you want, for example:

re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')

will match.

... so put \b at the end, i.e.:

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')
It would still match "www.bbc.co.uk", so you might need:

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b(?!\.\b)')
 
A

Aahz

Nobody said:
regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')
You might also want to consider that some country
codes such as "co" for Columbia might match more than
you want, for example:

re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')

will match.

... so put \b at the end, i.e.:

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')
It would still match "www.bbc.co.uk", so you might need:

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b(?!\.\b)')

If it's a string containing just the candidate domain, you can do

regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)$')
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top