Regular Expression to match the domain part of an email address

E

emzyme20

Hi,

I'm trying to compile a regular expression that will match the domain
part of an email address. The email address has been split into 2
strings, the part before the @ sign and the part after the @ sign.

This regular expression is just working with the part after the @ sign.
The pattern that I have compiled appears to work for all combinations
except for something like:

a.com
b.com

However, the following do get matched:

a.co.uk
b.co.uk

I think the problem I have is because this combination is only a single
character long. The regular expression is truly horrendous, but I'm now
stuck with the way it has been done and need to figure out how to
modify it to accept the combination of "a.com" as a domain part of an
email address.

Can anyone tell me what's causing this problem from the expression
below?

Pattern.compile("^([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*[\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])\\.[\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?$");

Many thanks,

Emma
 
R

Robert Klemme

Hi,

I'm trying to compile a regular expression that will match the domain
part of an email address. The email address has been split into 2
strings, the part before the @ sign and the part after the @ sign.

This regular expression is just working with the part after the @ sign.
The pattern that I have compiled appears to work for all combinations
except for something like:

a.com
b.com

However, the following do get matched:

a.co.uk
b.co.uk

I think the problem I have is because this combination is only a single
character long. The regular expression is truly horrendous, but I'm now
stuck with the way it has been done and need to figure out how to
modify it to accept the combination of "a.com" as a domain part of an
email address.

Can anyone tell me what's causing this problem from the expression
below?

Pattern.compile("^([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*[\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])\\.[\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?$");

What exactly do you want to achieve? Do you want to verify that the
string you have is actually a proper domain name? If so, then looking
at the RFC \\w seems to be matching too much (namely the underscore):

http://tools.ietf.org/html/rfc1034#section-3.5

When making the pattern case insensitive you should be able to match a
"label" in that spec with

[a-z](?:[a-z0-9-]*[a-z0-9])?

From that you can easily construct a complete RX to match a full domain
name.

robert
 
J

Jussi Piitulainen

The pattern that I have compiled appears to work for all
combinations except for something like:

a.com
b.com

However, the following do get matched:

a.co.uk
b.co.uk

I think the problem I have is because this combination is only a
single character long. The regular expression is truly horrendous,
but I'm now stuck with the way it has been done and need to figure
out how to modify it to accept the combination of "a.com" as a
domain part of an email address.

Can anyone tell me what's causing this problem from the expression
below?

Here is your expression laid out on several lines, with the two
required characters before the last \. marked with <-----.

([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*
[\\*\\w] <-------
([\\*\\w\\-]{0,61}
[\\*\\w]) <-------
\\.
[\\*\\w]
([\\*\\w\\-]{0,61}[\\*\\w])?

It seems to me the parenthesised expression before the second arrow
should be optional. The other two identical expressions are.
 
E

emzyme20

Hi,

I am trying to validate that the domain portion (everything after the @
sign, including the .com or .co.uk etc).

I modified the expression to try to make the part before the second
arrow optional. From the regular expression help that I have found,
this is done by putting a ?: at the start of the parenthesis.

(?:[\\*\\w\\-]{0,61}[\\*\\w])

This change allowed me to recognise a-b.com as a domain, but I am still
having issues with single character domains e.g. a.com, b.org

Is this because I've got two sections with \\w?

I agree with the comment about the underscores being invalid, I'll work
on a better expression to eliminate them.

Emma

Jussi said:
The pattern that I have compiled appears to work for all
combinations except for something like:

a.com
b.com

However, the following do get matched:

a.co.uk
b.co.uk

I think the problem I have is because this combination is only a
single character long. The regular expression is truly horrendous,
but I'm now stuck with the way it has been done and need to figure
out how to modify it to accept the combination of "a.com" as a
domain part of an email address.

Can anyone tell me what's causing this problem from the expression
below?

Here is your expression laid out on several lines, with the two
required characters before the last \. marked with <-----.

([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*
[\\*\\w] <-------
([\\*\\w\\-]{0,61}
[\\*\\w]) <-------
\\.
[\\*\\w]
([\\*\\w\\-]{0,61}[\\*\\w])?

It seems to me the parenthesised expression before the second arrow
should be optional. The other two identical expressions are.
 
R

Robert Klemme

I am trying to validate that the domain portion (everything after the @
sign, including the .com or .co.uk etc).

I modified the expression to try to make the part before the second
arrow optional. From the regular expression help that I have found,
this is done by putting a ?: at the start of the parenthesis.

No. Please reread your documentation. "(?:)" is simply a non capturing
group as opposed to "()".

http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
I agree with the comment about the underscores being invalid, I'll work
on a better expression to eliminate them.

Yes, do that.

Regards

robert
 
J

Jussi Piitulainen

(e-mail address removed) writes:

[reordered]
Jussi said:
Here is your expression laid out on several lines, with the two
required characters before the last \. marked with <-----.

([\\*\\w]([\\*\\w\\-]{0,61}[\\*\\w])?\\.)*
[\\*\\w] <-------
([\\*\\w\\-]{0,61}
[\\*\\w]) <-------
\\.
[\\*\\w]
([\\*\\w\\-]{0,61}[\\*\\w])?

It seems to me the parenthesised expression before the second arrow
should be optional. The other two identical expressions are.
....
I modified the expression to try to make the part before the second
arrow optional. From the regular expression help that I have found,
this is done by putting a ?: at the start of the parenthesis.

(?:[\\*\\w\\-]{0,61}[\\*\\w])

Er, no. It's made optional by adding a ? in the end, like so:

([\\*\\w\\-]{0,61}[\\*\\w])?

Your original expression contained two of these already, so I thought
you knew this. Other optional expressions are E* and E{0,61} but they
also repeatable.

It is a good idea to put in that ?: anyway, but for a different
reason. A merely parenthesised expression is used to "capture" the
part of the match that corresponds to that expression, and if you
don't use that mechanism, this computation is just wasted.
This change allowed me to recognise a-b.com as a domain, but I am
still having issues with single character domains e.g. a.com, b.org

That should have matched already. The ?: does not change what the
expression matches, only what parts of the match are captured as
groups.
Is this because I've got two sections with \\w?

You have two top-level segments that both _have_ to match either a
literal * or a \w. By the way, you can write just "[*\\w]", the * is
not special inside brackets.
I agree with the comment about the underscores being invalid, I'll
work on a better expression to eliminate them.

I don't even know what is allowed in domain names. Is * really
allowed? Is a-.com really disallowed?
 
E

emzyme20

Er, no. It's made optional by adding a ? in the end, like so:
([\\*\\w\\-]{0,61}[\\*\\w])?

Your original expression contained two of these already, so I thought
you knew this. Other optional expressions are E* and E{0,61} but they
also repeatable.

heh thanks for that.. I inherited this particular piece of code. I'm
trying to diagnose and fix a few problems that have been highlighted
since conception. When I sat down with the expression and separated it
into sections following a guide I was using, it stated that ? stood for
1 or more times, so that's why I never noticed that.
You have two top-level segments that both _have_ to match either a
literal * or a \w. By the way, you can write just "[*\\w]", the * is
not special inside brackets.

ah yes, now I see those, the regular expression makes it really
difficult to spot everything and there's just far too many backslashes
for my liking....
I don't even know what is allowed in domain names. Is * really
allowed? Is a-.com really disallowed?

The * is for our benefit I think, we're allowing users to enter
wildcarded email addresses to save them having to specifically enter
every single combination in. I'm not sure about the - ended domain
name, I know you're not allowed to start or end with a dot.
 
J

Jussi Piitulainen

Er, no. It's made optional by adding a ? in the end, like so:

([\\*\\w\\-]{0,61}[\\*\\w])?

Your original expression contained two of these already, so I
thought you knew this. Other optional expressions are E* and
E{0,61} but they also repeatable.

heh thanks for that.. I inherited this particular piece of code. I'm
trying to diagnose and fix a few problems that have been highlighted
since conception. When I sat down with the expression and separated
it into sections following a guide I was using, it stated that ?
stood for 1 or more times, so that's why I never noticed that.

Ok, here are some suggestions. First, if the guide really says ?
stands for one or more, don't trust it. Sun's documentation for
java.util.Pattern is actually rather good:
<http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html>

Second, my main point here, this particular pattern is a good
candidate for some abstraction, because it contains a repeated
sub-pattern. Tame it by naming that pattern. I do this below starting
from Robert Klemme's pattern and putting in that `*' that you want.

Third, I'm not convinced that you need to bother with {0,61}. That 61
is so many that I would just use *.

Fourth, when a single expression becomes unwieldy, you may be able to
write separate tests. One test to see that only the allowed characters
are used, another to see that the input starts and ends properly, for
example.

Consider this:

class Roska { public static void main(String [] args) {

// Wrapping `word' in (?: ) is a redundant safety
// measure here, but matters a lot if `word' ends
// before a quantifier or something.

String word = "(?:[a-z*](?:[a-z0-9\\-*]*[a-z0-9*])?)";
String words = "(?:" + word + "[.])+" + word;

for (int k = 0 ; k < args.length ; ++ k) {
System.out.println(args[k].matches(words));
}
}}

It seems to work. One or more words ending in a period, and then one
more word, where a word starts with ...

I'm not sure if the escape is needed for `-' in a character class, and
Sun does not seem to tell. It appears to work with or without.
 
L

Lew

Jussi said:
I'm not sure if the escape is needed for `-' in a character class, and
Sun does not seem to tell. It appears to work with or without.

You don't need to escape the '-' in a character class if it's the first or
last character indicated:

[a-z] matches any character from 'a' to 'z'.
[a\-z] matches 'a', 'z' or '-'.
[az-] matches 'a', 'z' or '-'.

- Lew
 
J

Jussi Piitulainen

Lew said:
Jussi said:
I'm not sure if the escape is needed for `-' in a character class,
and Sun does not seem to tell. It appears to work with or without.

You don't need to escape the '-' in a character class if it's the
first or last character indicated:

[a-z] matches any character from 'a' to 'z'.
[a\-z] matches 'a', 'z' or '-'.
[az-] matches 'a', 'z' or '-'.

Or otherwise at a point where it does not form a range, when read from
left to right: [a-z-*]. Apparently.

It seems to be that way, but this is not documented. At least I can't
find it stated in Sun's documentation of java.util.regex.Pattern, 1.5,
which is otherwise rather thorough.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top