Help with Regex for domain names

Discussion in 'Python' started by Feyo, Jul 30, 2009.

  1. Feyo

    Feyo Guest

    I'm trying to figure out how to write efficiently write a regex for
    domain names with a particular top level domain. Let's say, I want to
    grab all domain names with country codes .us, .au, and .de.

    I could create three different regexs that would work:
    regex = re.compile(r'[\w\-\.]+\.us)
    regex = re.compile(r'[\w\-\.]+\.au)
    regex = re.compile(r'[\w\-\.]+\.de)

    How would I write one to accommodate all three, or, better yet, to
    accommodate a list of them that I can pass into a method call? Thanks!
     
    Feyo, Jul 30, 2009
    #1
    1. Advertising

  2. Feyo

    Tim Daneliuk Guest

    Feyo wrote:
    > I'm trying to figure out how to write efficiently write a regex for
    > domain names with a particular top level domain. Let's say, I want to
    > grab all domain names with country codes .us, .au, and .de.
    >
    > I could create three different regexs that would work:
    > regex = re.compile(r'[\w\-\.]+\.us)
    > regex = re.compile(r'[\w\-\.]+\.au)
    > regex = re.compile(r'[\w\-\.]+\.de)
    >
    > How would I write one to accommodate all three, or, better yet, to
    > accommodate a list of them that I can pass into a method call? Thanks!


    Just a point of interest: A correctly formed domain name may have a
    trailing period at the end of the TLD [1]. Example:

    foo.bar.com.

    Though you do not often see this, it's worth accommodating "just in
    case"...


    [1] http://homepages.tesco.net/J.deBoynePollard/FGA/web-fully-qualified-domain-name.html



    --
    ----------------------------------------------------------------------------
    Tim Daneliuk
    PGP Key: http://www.tundraware.com/PGP/
     
    Tim Daneliuk, Jul 30, 2009
    #2
    1. Advertising

  3. Feyo

    MRAB Guest

    Feyo wrote:
    > I'm trying to figure out how to write efficiently write a regex for
    > domain names with a particular top level domain. Let's say, I want to
    > grab all domain names with country codes .us, .au, and .de.
    >
    > I could create three different regexs that would work:
    > regex = re.compile(r'[\w\-\.]+\.us)
    > regex = re.compile(r'[\w\-\.]+\.au)
    > regex = re.compile(r'[\w\-\.]+\.de)
    >
    > How would I write one to accommodate all three, or, better yet, to
    > accommodate a list of them that I can pass into a method call? Thanks!
    >

    regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

    If you have a list of country codes ["us", "au", "de"] then you can
    build the regular expression from it:

    regex = re.compile(r'[\w\-\.]+\.(?:%s)' % '|'.join(domains))
     
    MRAB, Jul 30, 2009
    #3
  4. Feyo

    Feyo Guest

    On Jul 30, 11:56 am, MRAB <> wrote:
    > Feyo wrote:
    > > I'm trying to figure out how to write efficiently write a regex for
    > > domain names with a particular top level domain. Let's say, I want to
    > > grab all domain names with country codes .us, .au, and .de.

    >
    > > I could create three different regexs that would work:
    > > regex = re.compile(r'[\w\-\.]+\.us)
    > > regex = re.compile(r'[\w\-\.]+\.au)
    > > regex = re.compile(r'[\w\-\.]+\.de)

    >
    > > How would I write one to accommodate all three, or, better yet, to
    > > accommodate a list of them that I can pass into a method call? Thanks!

    >
    >  >
    > regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')
    >
    > If you have a list of country codes ["us", "au", "de"] then you can
    > build the regular expression from it:
    >
    > regex = re.compile(r'[\w\-\.]+\.(?:%s)' % '|'.join(domains))


    Perfect! Thanks.
     
    Feyo, Jul 30, 2009
    #4
  5. Feyo

    Guest

    On Jul 30, 9:56 am, MRAB <> wrote:
    > Feyo wrote:
    > > I'm trying to figure out how to write efficiently write a regex for
    > > domain names with a particular top level domain. Let's say, I want to
    > > grab all domain names with country codes .us, .au, and .de.

    >
    > > I could create three different regexs that would work:
    > > regex = re.compile(r'[\w\-\.]+\.us)
    > > regex = re.compile(r'[\w\-\.]+\.au)
    > > regex = re.compile(r'[\w\-\.]+\.de)

    >
    > > How would I write one to accommodate all three, or, better yet, to
    > > accommodate a list of them that I can pass into a method call? Thanks!

    >
    > >

    > regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')


    You might also want to consider that some country
    codes such as "co" for Columbia might match more than
    you want, for example:

    re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')

    will match.
     
    , Jul 30, 2009
    #5
  6. Feyo

    Nobody Guest

    On Thu, 30 Jul 2009 10:29:09 -0700, rurpy wrote:

    >> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

    >
    > You might also want to consider that some country
    > codes such as "co" for Columbia might match more than
    > you want, for example:
    >
    > re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')
    >
    > will match.


    .... so put \b at the end, i.e.:

    regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')
     
    Nobody, Jul 30, 2009
    #6
  7. Feyo

    MRAB Guest

    Nobody wrote:
    > On Thu, 30 Jul 2009 10:29:09 -0700, rurpy wrote:
    >
    >>> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')

    >> You might also want to consider that some country
    >> codes such as "co" for Columbia might match more than
    >> you want, for example:
    >>
    >> re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')
    >>
    >> will match.

    >
    > ... so put \b at the end, i.e.:
    >
    > regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')
    >

    It would still match "www.bbc.co.uk", so you might need:

    regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b(?!\.\b)')
     
    MRAB, Jul 30, 2009
    #7
  8. Feyo

    Aahz Guest

    In article <>,
    MRAB <> wrote:
    >Nobody wrote:
    >> On Thu, 30 Jul 2009 10:29:09 -0700, rurpy wrote:
    >>
    >>>> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)')
    >>> You might also want to consider that some country
    >>> codes such as "co" for Columbia might match more than
    >>> you want, for example:
    >>>
    >>> re.match(r'[\w\-\.]+\.(?:us|au|de|co)', 'foo.boo.com')
    >>>
    >>> will match.

    >>
    >> ... so put \b at the end, i.e.:
    >>
    >> regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b')
    >>

    >It would still match "www.bbc.co.uk", so you might need:
    >
    >regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)\b(?!\.\b)')


    If it's a string containing just the candidate domain, you can do

    regex = re.compile(r'[\w\-\.]+\.(?:us|au|de)$')
    --
    Aahz () <*> http://www.pythoncraft.com/

    "Many customs in this life persist because they ease friction and promote
    productivity as a result of universal agreement, and whether they are
    precisely the optimal choices is much less important." --Henry Spencer
     
    Aahz, Aug 2, 2009
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Infant Newbie
    Replies:
    2
    Views:
    3,260
    Infant Newbie
    Nov 12, 2003
  2. =?Utf-8?B?VENEb2xwaGluLg==?=

    Webserver needs to be domain controller or just joind to domain?

    =?Utf-8?B?VENEb2xwaGluLg==?=, Sep 22, 2005, in forum: ASP .Net
    Replies:
    7
    Views:
    472
    =?Utf-8?B?VENEb2xwaGluLg==?=
    Sep 22, 2005
  3. AF
    Replies:
    8
    Views:
    1,066
    Chrissy Cruiser
    Aug 23, 2004
  4. mark | r
    Replies:
    1
    Views:
    387
    Adrienne
    Jul 5, 2005
  5. Replies:
    3
    Views:
    770
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page