Regular Expression to match the domain part of an email address

Discussion in 'Java' started by emzyme20@hotmail.com, Nov 29, 2006.

  1. Guest

    Hi,

    I'm trying to compile a regular expression that will match the domain
    part of an email address. The email address has been split into 2
    strings, the part before the @ sign and the part after the @ sign.

    This regular expression is just working with the part after the @ sign.
    The pattern that I have compiled appears to work for all combinations
    except for something like:

    a.com
    b.com

    However, the following do get matched:

    a.co.uk
    b.co.uk

    I think the problem I have is because this combination is only a single
    character long. The regular expression is truly horrendous, but I'm now
    stuck with the way it has been done and need to figure out how to
    modify it to accept the combination of "a.com" as a domain part of an
    email address.

    Can anyone tell me what's causing this problem from the expression
    below?

    Pattern.compile("^([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*[\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])\\.[\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?$");

    Many thanks,

    Emma
     
    , Nov 29, 2006
    #1
    1. Advertising

  2. On 29.11.2006 11:26, wrote:
    > Hi,
    >
    > I'm trying to compile a regular expression that will match the domain
    > part of an email address. The email address has been split into 2
    > strings, the part before the @ sign and the part after the @ sign.
    >
    > This regular expression is just working with the part after the @ sign.
    > The pattern that I have compiled appears to work for all combinations
    > except for something like:
    >
    > a.com
    > b.com
    >
    > However, the following do get matched:
    >
    > a.co.uk
    > b.co.uk
    >
    > I think the problem I have is because this combination is only a single
    > character long. The regular expression is truly horrendous, but I'm now
    > stuck with the way it has been done and need to figure out how to
    > modify it to accept the combination of "a.com" as a domain part of an
    > email address.
    >
    > Can anyone tell me what's causing this problem from the expression
    > below?
    >
    > Pattern.compile("^([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*[\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])\\.[\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?$");


    What exactly do you want to achieve? Do you want to verify that the
    string you have is actually a proper domain name? If so, then looking
    at the RFC \\w seems to be matching too much (namely the underscore):

    http://tools.ietf.org/html/rfc1034#section-3.5

    When making the pattern case insensitive you should be able to match a
    "label" in that spec with

    [a-z](?:[a-z0-9-]*[a-z0-9])?

    From that you can easily construct a complete RX to match a full domain
    name.

    robert
     
    Robert Klemme, Nov 29, 2006
    #2
    1. Advertising

  3. writes:
    ....
    > The pattern that I have compiled appears to work for all
    > combinations except for something like:
    >
    > a.com
    > b.com
    >
    > However, the following do get matched:
    >
    > a.co.uk
    > b.co.uk
    >
    > I think the problem I have is because this combination is only a
    > single character long. The regular expression is truly horrendous,
    > but I'm now stuck with the way it has been done and need to figure
    > out how to modify it to accept the combination of "a.com" as a
    > domain part of an email address.
    >
    > Can anyone tell me what's causing this problem from the expression
    > below?


    Here is your expression laid out on several lines, with the two
    required characters before the last \. marked with <-----.

    ([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*
    [\\*\\w] <-------
    ([\\*\\w\\-]{0,61}
    [\\*\\w]) <-------
    \\.
    [\\*\\w]
    ([\\*\\w\\-]{0,61}[\\*\\w])?

    It seems to me the parenthesised expression before the second arrow
    should be optional. The other two identical expressions are.
     
    Jussi Piitulainen, Nov 29, 2006
    #3
  4. Guest

    Hi,

    I am trying to validate that the domain portion (everything after the @
    sign, including the .com or .co.uk etc).

    I modified the expression to try to make the part before the second
    arrow optional. From the regular expression help that I have found,
    this is done by putting a ?: at the start of the parenthesis.

    (?:[\\*\\w\\-]{0,61}[\\*\\w])

    This change allowed me to recognise a-b.com as a domain, but I am still
    having issues with single character domains e.g. a.com, b.org

    Is this because I've got two sections with \\w?

    I agree with the comment about the underscores being invalid, I'll work
    on a better expression to eliminate them.

    Emma

    Jussi Piitulainen wrote:
    > writes:
    > ...
    > > The pattern that I have compiled appears to work for all
    > > combinations except for something like:
    > >
    > > a.com
    > > b.com
    > >
    > > However, the following do get matched:
    > >
    > > a.co.uk
    > > b.co.uk
    > >
    > > I think the problem I have is because this combination is only a
    > > single character long. The regular expression is truly horrendous,
    > > but I'm now stuck with the way it has been done and need to figure
    > > out how to modify it to accept the combination of "a.com" as a
    > > domain part of an email address.
    > >
    > > Can anyone tell me what's causing this problem from the expression
    > > below?

    >
    > Here is your expression laid out on several lines, with the two
    > required characters before the last \. marked with <-----.
    >
    > ([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*
    > [\\*\\w] <-------
    > ([\\*\\w\\-]{0,61}
    > [\\*\\w]) <-------
    > \\.
    > [\\*\\w]
    > ([\\*\\w\\-]{0,61}[\\*\\w])?
    >
    > It seems to me the parenthesised expression before the second arrow
    > should be optional. The other two identical expressions are.
     
    , Nov 29, 2006
    #4
  5. On 29.11.2006 13:26, wrote:
    > I am trying to validate that the domain portion (everything after the @
    > sign, including the .com or .co.uk etc).
    >
    > I modified the expression to try to make the part before the second
    > arrow optional. From the regular expression help that I have found,
    > this is done by putting a ?: at the start of the parenthesis.


    No. Please reread your documentation. "(?:)" is simply a non capturing
    group as opposed to "()".

    http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

    > I agree with the comment about the underscores being invalid, I'll work
    > on a better expression to eliminate them.


    Yes, do that.

    Regards

    robert
     
    Robert Klemme, Nov 29, 2006
    #5
  6. writes:

    [reordered]

    > Jussi Piitulainen wrote:
    >> Here is your expression laid out on several lines, with the two
    >> required characters before the last \. marked with <-----.
    >>
    >> ([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*
    >> [\\*\\w] <-------
    >> ([\\*\\w\\-]{0,61}
    >> [\\*\\w]) <-------
    >> \\.
    >> [\\*\\w]
    >> ([\\*\\w\\-]{0,61}[\\*\\w])?
    >>
    >> It seems to me the parenthesised expression before the second arrow
    >> should be optional. The other two identical expressions are.

    ....
    > I modified the expression to try to make the part before the second
    > arrow optional. From the regular expression help that I have found,
    > this is done by putting a ?: at the start of the parenthesis.
    >
    > (?:[\\*\\w\\-]{0,61}[\\*\\w])


    Er, no. It's made optional by adding a ? in the end, like so:

    ([\\*\\w\\-]{0,61}[\\*\\w])?

    Your original expression contained two of these already, so I thought
    you knew this. Other optional expressions are E* and E{0,61} but they
    also repeatable.

    It is a good idea to put in that ?: anyway, but for a different
    reason. A merely parenthesised expression is used to "capture" the
    part of the match that corresponds to that expression, and if you
    don't use that mechanism, this computation is just wasted.

    > This change allowed me to recognise a-b.com as a domain, but I am
    > still having issues with single character domains e.g. a.com, b.org


    That should have matched already. The ?: does not change what the
    expression matches, only what parts of the match are captured as
    groups.

    > Is this because I've got two sections with \\w?


    You have two top-level segments that both _have_ to match either a
    literal * or a \w. By the way, you can write just "[*\\w]", the * is
    not special inside brackets.

    > I agree with the comment about the underscores being invalid, I'll
    > work on a better expression to eliminate them.


    I don't even know what is allowed in domain names. Is * really
    allowed? Is a-.com really disallowed?
     
    Jussi Piitulainen, Nov 29, 2006
    #6
  7. Guest

    > Er, no. It's made optional by adding a ? in the end, like so:
    >
    > ([\\*\\w\\-]{0,61}[\\*\\w])?
    >
    > Your original expression contained two of these already, so I thought
    > you knew this. Other optional expressions are E* and E{0,61} but they
    > also repeatable.


    heh thanks for that.. I inherited this particular piece of code. I'm
    trying to diagnose and fix a few problems that have been highlighted
    since conception. When I sat down with the expression and separated it
    into sections following a guide I was using, it stated that ? stood for
    1 or more times, so that's why I never noticed that.

    > You have two top-level segments that both _have_ to match either a
    > literal * or a \w. By the way, you can write just "[*\\w]", the * is
    > not special inside brackets.


    ah yes, now I see those, the regular expression makes it really
    difficult to spot everything and there's just far too many backslashes
    for my liking....

    > I don't even know what is allowed in domain names. Is * really
    > allowed? Is a-.com really disallowed?


    The * is for our benefit I think, we're allowing users to enter
    wildcarded email addresses to save them having to specifically enter
    every single combination in. I'm not sure about the - ended domain
    name, I know you're not allowed to start or end with a dot.
     
    , Nov 29, 2006
    #7
  8. writes:

    >> Er, no. It's made optional by adding a ? in the end, like so:
    >>
    >> ([\\*\\w\\-]{0,61}[\\*\\w])?
    >>
    >> Your original expression contained two of these already, so I
    >> thought you knew this. Other optional expressions are E* and
    >> E{0,61} but they also repeatable.

    >
    > heh thanks for that.. I inherited this particular piece of code. I'm
    > trying to diagnose and fix a few problems that have been highlighted
    > since conception. When I sat down with the expression and separated
    > it into sections following a guide I was using, it stated that ?
    > stood for 1 or more times, so that's why I never noticed that.


    Ok, here are some suggestions. First, if the guide really says ?
    stands for one or more, don't trust it. Sun's documentation for
    java.util.Pattern is actually rather good:
    <http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html>

    Second, my main point here, this particular pattern is a good
    candidate for some abstraction, because it contains a repeated
    sub-pattern. Tame it by naming that pattern. I do this below starting
    from Robert Klemme's pattern and putting in that `*' that you want.

    Third, I'm not convinced that you need to bother with {0,61}. That 61
    is so many that I would just use *.

    Fourth, when a single expression becomes unwieldy, you may be able to
    write separate tests. One test to see that only the allowed characters
    are used, another to see that the input starts and ends properly, for
    example.

    Consider this:

    class Roska { public static void main(String [] args) {

    // Wrapping `word' in (?: ) is a redundant safety
    // measure here, but matters a lot if `word' ends
    // before a quantifier or something.

    String word = "(?:[a-z*](?:[a-z0-9\\-*]*[a-z0-9*])?)";
    String words = "(?:" + word + "[.])+" + word;

    for (int k = 0 ; k < args.length ; ++ k) {
    System.out.println(args[k].matches(words));
    }
    }}

    It seems to work. One or more words ending in a period, and then one
    more word, where a word starts with ...

    I'm not sure if the escape is needed for `-' in a character class, and
    Sun does not seem to tell. It appears to work with or without.
     
    Jussi Piitulainen, Nov 29, 2006
    #8
  9. Lew Guest

    Jussi Piitulainen wrote:
    > I'm not sure if the escape is needed for `-' in a character class, and
    > Sun does not seem to tell. It appears to work with or without.


    You don't need to escape the '-' in a character class if it's the first or
    last character indicated:

    [a-z] matches any character from 'a' to 'z'.
    [a\-z] matches 'a', 'z' or '-'.
    [az-] matches 'a', 'z' or '-'.

    - Lew
     
    Lew, Dec 2, 2006
    #9
  10. Lew writes:
    > Jussi Piitulainen wrote:
    >> I'm not sure if the escape is needed for `-' in a character class,
    >> and Sun does not seem to tell. It appears to work with or without.

    >
    > You don't need to escape the '-' in a character class if it's the
    > first or last character indicated:
    >
    > [a-z] matches any character from 'a' to 'z'.
    > [a\-z] matches 'a', 'z' or '-'.
    > [az-] matches 'a', 'z' or '-'.


    Or otherwise at a point where it does not form a range, when read from
    left to right: [a-z-*]. Apparently.

    It seems to be that way, but this is not documented. At least I can't
    find it stated in Sun's documentation of java.util.regex.Pattern, 1.5,
    which is otherwise rather thorough.
     
    Jussi Piitulainen, Dec 2, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,353
  2. Lad
    Replies:
    2
    Views:
    352
    Tim Roberts
    Sep 27, 2006
  3. Mark B
    Replies:
    13
    Views:
    906
    Juan T. Llibre
    Aug 17, 2009
  4. Sendhil
    Replies:
    0
    Views:
    252
    Sendhil
    Nov 12, 2003
  5. ll
    Replies:
    9
    Views:
    196
    Evertjan.
    Feb 21, 2007
Loading...

Share This Page