regex

  • Thread starter Horatiu Stanciu
  • Start date
H

Horatiu Stanciu

Hi list

I want to validate a string that should respect the rules:

a. Should not contain two or more consecutive dots
b. Any ASCII graphic (printing) character may appear except:
@ \ ", [ ]
The above excluded characters are also allowed if they are quoted either by
using a backslash ("\") before each excluded character or by surrounding the
entire local-part that contains one or more excluded character(s) with
double-quote characters.

I tried to implement this with regular expressions, but without success.

Thank you,
H
 
C

Carl Howells

Horatiu said:
Hi list

I want to validate a string that should respect the rules:

a. Should not contain two or more consecutive dots
b. Any ASCII graphic (printing) character may appear except:
@ \ ", [ ]
The above excluded characters are also allowed if they are quoted either by
using a backslash ("\") before each excluded character or by surrounding the
entire local-part that contains one or more excluded character(s) with
double-quote characters.

I tried to implement this with regular expressions, but without success.

Post what you tried, and we'll let you know what mistake you may have made.
 
A

Alan Moore

Hi list

I want to validate a string that should respect the rules:

a. Should not contain two or more consecutive dots
b. Any ASCII graphic (printing) character may appear except:
@ \ ", [ ]
The above excluded characters are also allowed if they are quoted either by
using a backslash ("\") before each excluded character or by surrounding the
entire local-part that contains one or more excluded character(s) with
double-quote characters.

I tried to implement this with regular expressions, but without success.

Thank you,
H

So if there are three double-quotes in a row, should it be interpreted
a single, escaped double-quote? No, that would make the job
prohibitively complex (if not impossible), so I'll assume that a
double-quote can only be escaped with a backslash.


String regex = "(?:"
+ "(?:\"[^\"\\\\]++|\\\\.)*\"" // any quoted chars, including
// escaped quotes
+ "|\\\\." // backslash plus anything
+ "|\\.(?!\\.)" // dot, if not followed by a dot
+ "|[^.@\"\\[\\]\\\\]++" // any non-special chars
+ ")*";


Are you actually likely to see any non-ASCII or control characters?
Because if you really have to limit the match to printing ASCII chars,
the regex becomes about three times as ugly as it already is. In
fact, you would probably be better off making two passes:


if (str.matches("\\p{Graph}*") && str.matches(regex))
...


Of course, if you're going to be doing a lot of validating, you'll
want to precompile the regexes:


private Pattern p1 = Pattern.compile("\\p{Graph}*");
private Pattern p2 = Pattern.compile(regex);

...

if (p1.matcher(str).matches() && p2.matcher(str).matches())
...
 
M

Mark Wright

One joyful day (Thu, 09 Sep 2004 07:49:03 GMT to be precise), Alan Moore
I want to validate a string that should respect the rules:

a. Should not contain two or more consecutive dots
b. Any ASCII graphic (printing) character may appear except:
@ \ ", [ ]
The above excluded characters are also allowed if they are quoted either by
using a backslash ("\") before each excluded character or by surrounding the
entire local-part that contains one or more excluded character(s) with
double-quote characters.

So if there are three double-quotes in a row, should it be interpreted
a single, escaped double-quote? No, that would make the job
prohibitively complex (if not impossible), so I'll assume that a
double-quote can only be escaped with a backslash.

Although this seems like a case of doing somebody's homework for them,
I'll jump in here since your assertion seems incorrect.

Three double-quotes in a row is easily accommodated by:

\".+\"

Placing this at the start of the regexp group will allow it to contain
any number of " characters since the final one will be used (and
necessary) to complete the pattern match. Regular expressions are, by
default, greedy but exhaustive. That is, they will eat as much as they
can whilst allowing the patten to match.

The first predicate is achieved via lookahead as you said by:

\\.(?!\\.)

The backslash escape can't use lookbehind for the \ since this would
require allowing it in a non-escape context (it must first be allowed
before it can be used as a lookbehind), so a normal match is required:

\\\\[@\"\\\\\\[\\]]

Giving a regexp of:

"^(?:" +
"\".+\"" + "|" +
"\\\\[@\"\\\\\\[\\]]" + "|" +
"\\.(?!\\.)" + "|" +
"[^@\"\\\\\\[\\]\\.]" +
")*$")


And a test demo:

String test_data[] = new String[]
{
"This is a pass test",
"This is a .. fail test",
"This @ is also a fail test",
"This is a \\@ \\\" pass test",
"This is also a \"@[]\"\"@[\"]\" \\@ \\[ \\] pass test",
"This is a \"\"\" \\\" pass test",
"This is a \"#ABC[]blah\" pass test",
"This is a [fail] test"
};

Pattern pattern = Pattern.compile(
"^(?:" +
"\".+\"" + "|" +
"\\\\[@\"\\\\\\[\\]]" + "|" +
"\\.(?!\\.)" + "|" +
"[^@\"\\\\\\[\\]\\.]" +
")*$");

System.out.println("Trying data...");
for (int i = 0; i < test_data.length; ++i)
{
String test_string = test_data;
System.out.print(" Testing: '" + test_string + "': ");
System.out.println(pattern.matcher(test_string).matches() ?
"Pass" : "Fail");
}

System.out.println("...done");



Mark Wright
- (e-mail address removed)

================Today's Thought====================
"In places where books are burned, one day,
people will be burned" - Heinrich Heine, Germany -
100 years later, Hitler proved him right
===================================================
 
A

Alan Moore

Alan said:
I want to validate a string that should respect the rules:

a. Should not contain two or more consecutive dots
b. Any ASCII graphic (printing) character may appear except:
@ \ ", [ ]
The above excluded characters are also allowed if they are quoted either by
using a backslash ("\") before each excluded character or by surrounding the
entire local-part that contains one or more excluded character(s) with
double-quote characters.

So if there are three double-quotes in a row, should it be interpreted
a single, escaped double-quote? No, that would make the job
prohibitively complex (if not impossible), so I'll assume that a
double-quote can only be escaped with a backslash.

Although this seems like a case of doing somebody's homework for them,
I'll jump in here since your assertion seems incorrect.

Three double-quotes in a row is easily accommodated by:

\".+\"

Placing this at the start of the regexp group will allow it to contain
any number of " characters since the final one will be used (and
necessary) to complete the pattern match. Regular expressions are, by
default, greedy but exhaustive. That is, they will eat as much as they
can whilst allowing the patten to match.

If the input contains more than one quote-escaped sequence, like

"xxx\"quoted\"yyy\"more quoted\"zzz"

....the \".+\" will gobble up everything from the first quote to the
last one, meaning the 'yyy' part won't be validated correctly. You
have to either require quotes to be escaped within quoted sequences,
or forbid them. Whether other backslash escapes are permitted in
quoted sequences is another question; if not, that part of the regex
should read

"\"(?:[^\"\\\\]++|\\\\\")*\""

....but I think my "escaped anything" approach is probably okay.
The backslash escape can't use lookbehind for the \ since this would
require allowing it in a non-escape context (it must first be allowed
before it can be used as a lookbehind), so a normal match is required:

Lookbehind? I never used lookbehind. But I did overlook that
backslash is only allowed if it's escaping another special character,
which means "\\\\." isn't correct. Also, I just noticed the comma in
the list of special characters. Adding it in yields

"\\\\[@,\"\\[\\]\\\\]"

Also, I was assuming that a dot could be escaped to get around the "no
consecutive dots" rule, but maybe that isn't the case.

This gives a regex of:

"^(?:" +
"\"(?:[^\"\\\\]++|\\\\.)*\"" + "|" +
"\\\\[@,\"\\[\\]\\\\]" + "|" +
"\\.(?!\\.)" + "|" +
"[^@,.\"\\[\\]\\\\]++" +
")*$"

With these changes, your fifth and sixth tests fail instead of
passing, and I think that's correct. Maybe the OP can give us a
ruling, though.
String test_data[] = new String[]
{
"This is a pass test",
"This is a .. fail test",
"This @ is also a fail test",
"This is a \\@ \\\" pass test",
"This is also a \"@[]\"\"@[\"]\" \\@ \\[ \\] pass test",
"This is a \"\"\" \\\" pass test",
"This is a \"#ABC[]blah\" pass test",
"This is a [fail] test"
};
 
M

Mark Wright

One joyful day (Thu, 09 Sep 2004 19:55:21 GMT to be precise), Alan Moore
If the input contains more than one quote-escaped sequence, like

"xxx\"quoted\"yyy\"more quoted\"zzz"

...the \".+\" will gobble up everything from the first quote to the
last one, meaning the 'yyy' part won't be validated correctly.

Good point!
You
have to either require quotes to be escaped within quoted sequences,
or forbid them. Whether other backslash escapes are permitted in
quoted sequences is another question; if not, that part of the regex
should read

"\"(?:[^\"\\\\]++|\\\\\")*\""

...but I think my "escaped anything" approach is probably okay.

Probably, escaping is also a convenient way to resolve potential
ambiguities for the author/reader.

In any case, lookbehind should allow escaped quotes, with a reluctant *
quantifier to allow empty strings (something else I missed before) and a
separate independent group to isolate the quoted sections:

"(?>\".*?(?<!\\\\)\")"

Without the independent group the regex will allow this to pass:

"This is a \"#ABC\"..[]blah \".\" fail test"
Lookbehind? I never used lookbehind.

No, I didn't mean to imply you did. I was just pre-empting the
apparently obvious use of lookbehind for an escaped sequence.

With these changes, your fifth and sixth tests fail instead of
passing, and I think that's correct. Maybe the OP can give us a
ruling, though.

Indeed. Clearly defined problems are essential with regular expressions
in order to avoid the symbol spaghetti required for ambiguous logic.

In any case, assuming the fifth and sixth cases should fail, this is the
result:

"^(?:" +
"(?>\".*?(?<!\\\\)\")" + "|" +
"\\\\[@\"\\\\\\[\\]]" + "|" +
"\\.(?!\\.)" + "|" +
"[^@\"\\\\\\[\\]\\.]" +
")*$"

Mark Wright
- (e-mail address removed)

================Today's Thought====================
"In places where books are burned, one day,
people will be burned" - Heinrich Heine, Germany -
100 years later, Hitler proved him right
===================================================
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top