Hi Clinton,
Regular Expressions are a bear to learn, ieven if you have good tools to
work with them. I've spent hours working out a relatively "simple" one
(at least it seemed simple at first), but learning a bit more with each
hour. Still, I'm a long way from an expert. I can read most of it fairly
well by now, but certain concepts are still a bit difficult to deal
with. I still struggle some with Lookarounds in particular. One thing to
keep in mind is that Regular Expressions consume a string as they move
through it, with a few exceptions (like Lookarounds). They are basically
sequential in nature.
You may find the "Analyze" tool helpful with this sort of thing.
Fortunately, I have not 2 but THREE Regular Expression tools to work
with (2 of them are Freeware), which enables me to use the one(s) that
are best for the particular type of work I need regarding any individual
Regular Expression and/or problem with one.
The expression you posted,
\w*@\w*\.\w*((\.\w*)*)?
Can be analyzed in so many words as (with the parsing of the email
address where the match begins):
Match any word character, zero or more times. \w*
someone
Next, Match the '@' character once. @ @
Next match any word character zero or more times \w*
somewhere
Next, Match the '.' character once
\. .
Next, Match any word character zero or more times \w* com
Next, put the following into Group 1 zero or 1 time: (......)?
Match the following into Group 2 zero or more times: (......)*
Match the '.' character once \.
Match any word character zero or more times \w*
Result of Group 1: (\.\w*)* Group 2 (Nothing)
Result of Group 2 \.\w* Nothing
Basically, there is no match for either Group 1 or Group 2, as the '.'
has been consumed by the previous Match. However, as both Groups specify
a minimum of Zero times, they don't disqualify the Match, as they appear
zero times each.
Why does Expresso report Group 1 at position 32 (end of string)? Well,
no match has been returned prior to the end of the string. So, that's
where the null match begins. Why does Expressio begin at position 0?
Well, I'm not that good with it!
Still, your regular expression is a bit lax in terms of standards. We
worked one up for valid email addresses the other day, and you may want
to borrow it:
(?i)([-.\w]+)\@(?
(?:\d{1,3}\.){3}\d{1,3})|([-a-z0-9]+(?:\.[-a-z0-9]+)*)\.((?:com|edu|gov|int|mil|net|org|biz|info|name|museum|coop|aero|[a-z]{2})))
It is case-insensitive, and matches both domain name and IP domain email
addresses. It puts the results into 4 possible groups:
1-User Name, 2-Domain IP Address, 3-Domain Name, 4-Root Domain.
Note that groups 2 and (3,4) are exclusive of one another. The email
address can either be an IP address, or a named domain, but not both. It
supports 2-letter country suffixes, and multiple-dot domain addresses.
And it's case-sensitive.
I'm not sure we covered all the possible permutations, but it's pretty
strong.
--
HTH,
Kevin Spencer
Microsoft MVP
.Net Developer
Ambiguity has a certain quality to it.
Basically, the whole string has been consumed by the
message Hello Kevin,
Well I'm bright eyed but not so bushy-tailed this morning. Thanks for
working this out. Its one of those 'must know' issues one needs to be
concerned with when generating valid XML from an application. I'll be
working with it later today and I'm starting to get a feel for Expresso
which I have a question about. I'm at the point where I've almost come
to understand how expressions are actually processed which -- for me --
means I will understand how I need to think to put them together.
You've been a real help again and your source is an inspiration which
shows how elegant self-documenting code can be.
As for the Expresso question, what is 1:? supposed to indicate? (noting
that's the closest I could come at the moment to replicate the
rectangular 'non-printable' character Expresso uses to indicate some
'thing' it has matched) In the following simple example it seems to
match a white space although in a manner that is confusing as I will
point out but in other examples with many more characters and white
space in the string to be matched I have counted the position where the
? is said to be matched and the position reported does not fall on a
white space at all.
// Expression
\w*@\w*\.\w*((\.\w*)*)?
// String to match
An example (e-mail address removed) of an email address.
Expresso reports 1:? at Postion 32 Length 0 which infers white space in
the simple example as given noting there was white space characters
before the matched characters and motivating one to ask why Expresso
would ignore those previous white space characters and then report 2:?
at Position 0 Length 0 which suggests the parser returned to the
beginning of the string to be matched and found what?
Is this clear as mud or what
<%= Clinton Gallagher
Hi Clinton,
The following Regular Expression will give you the ability to do a
Regex.Replace on a string containing both single "&" characters and
"&" strings. It captures the "&" strings into their own
separate matches, and the "&" characters into their own matches,
putting the "&" characters into a Group. It is also case-insensitive:
(?i)[^&][^&]*|&|(&(?!=amp))
Here's some sample code for reeplacing the single "&" characters with
& -
/// <summary>
/// Replaces Ampersand in a Match with "&"
/// </summary>
/// <param name="m">Match</param>
/// <returns>Replaced Match value</returns>
public static string ampReplacer(Match m)
{
if (m.Groups[1].Captures.Count == 0) return m.Value;
return m.Value.Replace("&", "&");
}
/// <summary>
/// Replaces all single Ampersand characters in a string with "&"
/// </summary>
/// <param name="s">String to process</param>
/// <returns>Processed String</returns>
public static string ReplaceAmpersand(string s)
{
return Regex.Replace(s, @"(?i)[^&][^&]*|&|(&(?!=amp))",
new MatchEvaluator(ampReplacer));
}
The "ampReplacer function is the function passed as the MatchEvaluator
delegate in the Regex.Replace() method used in the "ReplaceAmpersand"
method. The "ReplaceAmpersand" method takes a string as an argument,
and uses Regex.Replace to replace all matches in the string that
contain a value in Groups[1] with "&".
As a side note, I used both Expresso and Regex Buddy to come up with
this. It was indeed a challenge, as I'm not quite a master of Regular
Expressions. But I enjoy learning, so it was a good exercise for me!
--
HTH,
Kevin Spencer
Microsoft MVP
.Net Developer
Ambiguity has a certain quality to it.
message Kevin, have you ever heard the expression "preaching to the choir?"
I've got the basic pattern matching theory understood but its the use
of expressions to disallow or replace certain characters and/or
strings that I'm trying to really understand thoroughly. The
following example illustrates...
// Example
Lawn Mowers, Repairs & Services - lawnmowers.com
A typical page title that when entered into a TextBox meant to
capture string data for an RSS 2.0 title element should use &
instead of the & to represent the ampersand. I've got an expression
that works well for the example but can't figure out (with the
expression I have) how to match the & and replace it with &
(yet) -- or -- how to use the expression I have to force the 2.0
Regular Expression Validator to fail when the & is present in the
string.
// Expression
[a-z]+([a-z0-9-]*[a-z0-9]+)?(\.([a-z]+([a-z0-9-]*[a-z0-9]+)?)+)*
I also really appreciate Expresso's Analyzer. It is outstanding that
Expresso seems to make it easy for us to pick expressions apart piece
by piece and explain them in English.
<%= Clinton Gallagher
Hi Juan,
The kind of RegEx tool I'd like is one which can take a string
I write, and create a RegEx expression which matches it.
The problem with that is that you can write a Regular Expression
that matches a literal string quite easily. For example:
literal string
The above is a regular expression which will match the substring
"literal string" in my first sentence. Of course, the real power of
regular expressions is the abilty to match *patterns* in a string,
perform grouping, etc. So, like any programming language (which it
is, in a sense), Regular Expressions have a shorthand syntax that
allows one to create patterns of a large variety of types. A simple
example of this would be:
(literal) (string)
This captures the same match as the first, but puts the string
"literal" into a group, and the string "string" into a second group.
But of course, we have already exceeded your desired requirement. On
the other hand, we have made a regular expression that is perhaps
more useful (in some situations) than the first.
And of course, the possible types and combinations of patterns are
almost endless, including wildcard patterns, special characters,
boolean rules, and so on.
Yeah, it's like reading some kind of incredibly concise shorthand
code, without even line breaks or brackets to help. That's why I was
so pleased to see that Expresso allows you to break your regular
expression across multiple lines while building it. That helps a
good bit!
--
HTH,
Kevin Spencer
Microsoft MVP
.Net Developer
Ambiguity has a certain quality to it.
The kind of RegEx tool I'd like is one which can take a string
I write, and create a RegEx expression which matches it.
*That* will be the RegEx tool that will corner the market.
message Thanks Kevin. I saw that post too and am going to download
Expresso in a few minutes. I know you don't need to be psychic to
figure out what I'm likely to be asking next
<%= Clinton Gallagher
message I saw a response to this question in the CSharp group, regarding a
product named "Expresso"
http://www.ultrapico.com/Expresso.htm
Expresso is .Net freeware, and after downloading, installing, and
playing with it, I'd give it a try! So far I have found it to be
excellent, having capabilities that Regex Buddy does not have,
and a much more intuitive GUI.
--
HTH,
Kevin Spencer
Microsoft MVP
.Net Developer
Ambiguity has a certain quality to it.
message Hi Clinton,
Yes, I have it. I previously used the freeware Regex Coach
Utility, but it is nowhere near as complete in its support for
various newer Regular Expression syntax and programming
languages in general. It did have one nice feature about it. You
could split a Regular Expression across multiple lines, which
often made it easier to analyze. However, Regex Buddy has the
graphical tree view, and it is synchronized with the Regular
Expression itself, which more than makes up for the omission of
breaking a Regular Expression across multiple lines.
BTW, it also has a GREP utility built in.
In short, it is well worth the 30 bucks.
--
HTH,
Kevin Spencer
Microsoft MVP
.Net Developer
Ambiguity has a certain quality to it.
in message I was looking at PowerGrep from the same dev group but like
Regex Buddy I don't like the buy before you try business model
so that choice has to be on the shelf for the moment but thanks
for bringing it up. I assume you've used Regex Buddy?
<%= Clinton Gallagher
message Regex Buddy is very good. It costs around $30.00, includes
quite a few nice features, including the ability to copy
regular expressions in various language string syntaxes,
including C#. It has the ability to create libraries of
regular expressions, a nice visual builder, color-coding, and
quite a bit more. Good testing environment. And it has some
nice reference material included.
--
HTH,
Kevin Spencer
Microsoft MVP
.Net Developer
Ambiguity has a certain quality to it.
"clintonG" <
[email protected]>
wrote in message
I'm using an .aspx tool I found at [1] but as nice as the
interface is I think I need to consider using others. Some
can generate C# I understand. Your preferences please...
<%= Clinton Gallagher
[1]
http://forta.com/books/0672325667/