simple regex pattern sought

R

Roedy Green

I often have to search for things of the form

"xxxxx"
or
'xxxxx'

where xxx is anything not " or '. It might be Russian or English or
any other language.

What is the cleanest way to do that?
--
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
..
 
M

markspace

I often have to search for things of the form

"xxxxx"
or
'xxxxx'

where xxx is anything not " or '. It might be Russian or English or
any other language.

What is the cleanest way to do that?


Would this work?

'[^']+'|"[^"]+"
 
L

Lew

Roedy said:
I often have to search for things of the form

"xxxxx"
or
'xxxxx'

where xxx is anything not " or '. It might be Russian or English or
any other language.

What is the cleanest way to do that?

Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
 
L

Lew

Roedy said:
I often have to search for things of the form

"xxxxx"
or
'xxxxx'

where xxx is anything not " or '. It might be Russian or English or
any other language.

What is the cleanest way to do that?

Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.

"([\"'])[^\"']+\\1"

That way you match the opening quote.

(The extra backslashes are to escape the characters in the string. Regex sees one fewer per each set.)
 
M

markspace

Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.

This would match "John's restaurant" as "John'.

The first quote matches ", John does not contain either ' or " as
specified, and the last character class matches the '. Not I think what
is wanted.
 
R

Robert Klemme

Roedy said:
I often have to search for things of the form

"xxxxx"
or
'xxxxx'

where xxx is anything not " or '. It might be Russian or English or
any other language.

What is the cleanest way to do that?

Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.

That does not match quoting properly. Better do something like

"([\"'])[^\"']*\\1"

Still I prefer

"\"[^\"]*\"|'[^']*'"

Because it allows for quotes of the other type inside quotes.

With proper escaping (using \ as escape char, any other works, too) this
becomes

"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"

Kind regards

robert


package rx;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Quotes {

private static final Pattern Q1 = Pattern.compile("([\"'])[^\"']*\\1");
private static final Pattern Q2 = Pattern.compile("\"[^\"]*\"|'[^']*'");
private static final Pattern Q3 =
Pattern.compile("\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'");

public static void main(String[] args) {
System.out.println(Q1);
for (final Matcher m = Q1.matcher("'a' \"b\" 'c'"); m.find();) {
System.out.println(m.group());
}

System.out.println(Q2);
for (final Matcher m = Q2.matcher("'a' \"b\" 'c'"); m.find();) {
System.out.println(m.group());
}

System.out.println(Q3);
for (final Matcher m = Q3.matcher("'a' \"\\\"b\" 'c'"); m.find();) {
System.out.println(m.group());
}
}

}
 
M

markspace

"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"


This looks overly baroque to me. You don't need to escape \ single
quotes ' in a Java string, and I don't think you need to in a regex
either (although I didn't check that). I'm also not seeing the need for
the parenthesis around the character classes [] (but again, without
having tried it, I could be wrong). And the dot . inside the
parenthesis just looks wrong.

Great post overall though.
 
R

Roedy Green

/*
* [TestRegexFindQuotedString.java]
*
* Summary: Finding a quoted String with a regex.
..
*
* Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2012-05-25 initial release
*/
package com.mindprod.example;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.lang.System.out;

/**
* Finding a quoted String with a regex.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2012-05-25 initial release
* @since 2012-05-25
*/
public class TestRegexFindQuotedString
{
// ------------------------------ CONSTANTS
------------------------------

private static final String lookIn = "George said \"that's the
ticket\"." +
" Jeb replied '\"ticket?\"
what ticket'." +
" \"How na\u00efve!\"." +
" empty: \"\"" +
" 'unbalanced\"";

// -------------------------- STATIC METHODS
--------------------------

/**
* exercise that pattern to see what if can find
*/
static void exercisePattern( Pattern pattern )
{
out.println();
out.println( "Pattern: " + pattern.toString() );
final Matcher m = pattern.matcher( lookIn ); // Matchers are
used both for matching and finding.
while ( m.find() )
{
out.println( m.group( 0 ) );
}
}

// --------------------------- main() method
---------------------------

/**
* test harness
*
* @param args not used
*/
public static void main( String[] args )
{
// We want to find Strings of the form "xx'xx" or 'xx"xx'
// We want to avoid the following problems:
// 1. Works even if String contains foreign languages, even
Russian or accented letters.
// 2. If starts with " must end with ", if starts with ' must
end with '.
// 3. ' is ok inside "...", and " is ok inside '...'
// 4. We don't worry about how to use ' inside '...'.

// here are some suggested techniques:

exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
); // fails 1 2 3

exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) ); //
fails 2 3

exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) ); //
fails 3, uses a capturing group.

exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
works, rejects empty strings by Mark Space.

exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) ); //
works, accepts empty strings by Robert Klemme.

exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
empty strings
// (?: ) is a non-capturing group. This is Robert Klemme's
contribution. I don't understand how it works.
}
}
--
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
..
 
M

markspace

exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
works, rejects empty strings by Mark Space.


If you want it to accept empty strings, replace the +'s with *'s. You
didn't specify empty strings in your original problem statement, so I
decided to disallow them.

Thanks for posting that SSCCE, btw. I was too lazy to cook one up.
 
R

Robert Klemme

"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"


This looks overly baroque to me. You don't need to escape \ single
quotes ' in a Java string,

I didn't.
and I don't think you need to in a regex
either (although I didn't check that).

There is also no regexp escaping of single quotes either. The only
regexp escaping you can see are the \\\\ which translate into \\ in the
string which is a literal backslash for the regexp engine.
I'm also not seeing the need for
the parenthesis around the character classes [] (but again, without
having tried it, I could be wrong).

It's not parenthesis around character classes but around the alternative
of "match a backslash followed by any char" and "any char which is not
backslash or the opening quote type of this string variant".
And the dot . inside the parenthesis just looks wrong.

It isn't - see above.
Great post overall though.

Thank you! It does seem to need some time to sink in though... :)

Kind regards

robert
 
M

markspace

exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
empty strings
// (?: ) is a non-capturing group. This is Robert Klemme's
contribution. I don't understand how it works.


Ah, OK, so here's my contribution to your excellent SSCCE. First this
pattern is basically the same as mine. It uses alternation (the
vertical bar |) to pick a string delimited by either ' or "

Here's his regex string without the extra escapes for Java:

"(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
^^^^^^^^^^^^^^^^

Let's look at just the first half for a moment, without the (?:\\. part.

"[^\"]*"
^^^^^^^^
12 3
Example for the first part:
1. " string starts with double quote
2. [^\"]* doesn't contain a "
3. " ends with double quote

Same for the second half of the string.

Notice he's using * instead of +'s, which is why his matches 0 width
strings.

The other part didn't appear in your problem statement, but in HTML/XML
it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
inclusion is very reasonable.

So he Robert adds (\\.|[^\"])* to the first part, which is
12 345 6

1. Start a group
2. A slash. It needs to be escaped for regex, hence \\.
3. . is regex "any character". 2 and 3 together mean "match \ followed
by any character"
4. OR (alternation again)
5. character class, negated (the ^), matches anything except \ or ". I
think this is a mistake: the \ needs to be quoted.
6. zero or more.

Then after that mess, he does the obvious thing and adds non-capturing
group, to make the regex do a little less work.

"(?:\\.|[^\"])*"

Phew! Next, he adds one alternation and does the same for a ' delimited
string.

|'(?:\\.|[^\'])*'

Same thing, just ' instead of ".

Finally I think this could be simplified slightly with Lew's
back-reference idea.

(['"])(?:\\.|[^\1\\])*

(Untested.) This allows empty strings between delimiters; instead of a
* use + for only non-empty strings between the quotes.



My executive summary:

Regex is a great rapid development tool, except when it isn't. You
realize your problem is simple, and you could have hand-coded a parser
to do this much quicker than all these news post exchanges?
 
M

markspace

"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
....
and I don't think you need to in a regex
either (although I didn't check that).

There is also no regexp escaping of single quotes either. The only
regexp escaping you can see are the \\\\ which translate into \\ in the
string which is a literal backslash for the regexp engine.


Yes, there is, although I think it's a typo. Both \\\" and \\' get
passed to the regex as \" and \', which means just a single character "
and ' respectively.

You're right about the rest of it though. With so many \'s floating
around, I have a hard time reading Java regex!

It's not parenthesis around character classes but around the alternative
of "match a backslash followed by any char" and "any char which is not
backslash or the opening quote type of this string variant".


Yup, I totally missed this too. Thanks for pointing it out.
 
R

Robert Klemme

exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
empty strings
// (?: ) is a non-capturing group. This is Robert Klemme's
contribution. I don't understand how it works.


Ah, OK, so here's my contribution to your excellent SSCCE. First this
pattern is basically the same as mine. It uses alternation (the vertical
bar |) to pick a string delimited by either ' or "

Here's his regex string without the extra escapes for Java:

"(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
^^^^^^^^^^^^^^^^

Let's look at just the first half for a moment, without the (?:\\. part.

"[^\"]*"
^^^^^^^^
12 3
Example for the first part:
1. " string starts with double quote
2. [^\"]* doesn't contain a "
3. " ends with double quote

Same for the second half of the string.

Notice he's using * instead of +'s, which is why his matches 0 width
strings.

The other part didn't appear in your problem statement, but in HTML/XML
it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
inclusion is very reasonable.

So he Robert adds (\\.|[^\"])* to the first part, which is
12 345 6

1. Start a group
2. A slash. It needs to be escaped for regex, hence \\.
3. . is regex "any character". 2 and 3 together mean "match \ followed
by any character"
4. OR (alternation again)
5. character class, negated (the ^), matches anything except \ or ". I
think this is a mistake: the \ needs to be quoted.

Oh, right, thanks for finding that!
6. zero or more.

Then after that mess, he does the obvious thing and adds non-capturing
group, to make the regex do a little less work.

"(?:\\.|[^\"])*"

Phew! Next, he adds one alternation and does the same for a ' delimited
string.

|'(?:\\.|[^\'])*'

Same thing, just ' instead of ".

Finally I think this could be simplified slightly with Lew's
back-reference idea.

(['"])(?:\\.|[^\1\\])*

(Untested.) This allows empty strings between delimiters; instead of a *
use + for only non-empty strings between the quotes.

Interesting approach - but it doesn't work. Simple test with
Pattern.compile("(.)[a\\1]"):

Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 6
(.)[a\1]
^
My executive summary:

Regex is a great rapid development tool, except when it isn't. You
realize your problem is simple, and you could have hand-coded a parser
to do this much quicker than all these news post exchanges?

Maybe, maybe not.

Kind regards

robert
 
R

Robert Klemme

On 5/25/2012 3:12 PM, Robert Klemme wrote:

"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
...
and I don't think you need to in a regex
either (although I didn't check that).

There is also no regexp escaping of single quotes either. The only
regexp escaping you can see are the \\\\ which translate into \\ in the
string which is a literal backslash for the regexp engine.


Yes, there is, although I think it's a typo. Both \\\" and \\' get
passed to the regex as \" and \', which means just a single character "
and ' respectively.

Right you are - both times: there is regexp escapind and it was in fact
a typo (missing \\)!
You're right about the rest of it though. With so many \'s floating
around, I have a hard time reading Java regex!

That's true for other languages as well - the basic reason is that the
same character is used for

- escaping in strings
- escaping in backslashes
- escaping in the source text (in this case we could pick another
character)
Yup, I totally missed this too. Thanks for pointing it out.

You're welcome! Thank you again for finding the missing escape.

Cheers

robert
 
M

markspace

Finally I think this could be simplified slightly with Lew's
back-reference idea.

(['"])(?:\\.|[^\1\\])*

(Untested.) This allows empty strings between delimiters; instead of a *
use + for only non-empty strings between the quotes.

Interesting approach - but it doesn't work. Simple test with
Pattern.compile("(.)[a\\1]"):

Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 6
(.)[a\1]
^


Yup, [] is for characters, and \1 could be a string. Gets rejected. I
think you could use "negative lookahead" to say "not this string" when
parsing. Gets kinda ugly though.

<http://www.regular-expressions.info/conditional.html>

Java:

"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1"

Regex:

(['"])(?:\\.|(?!\1|\\).)+\1

I re-did Roedy's test program to be a bit more clear about what it was
looking for, and the results. This could be even cleaner if it was run
with a JUnit test harness.

At this point though the regex is basically just a mess. Download antlr
and get an XML/HTML grammar from online.



package quicktest;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.lang.System.out;

/**
*
* @author Brenden
*/
public class MindProdRegex {

}

/*
* [TestRegexFindQuotedString.java]
*
* Summary: Finding a quoted String with a regex.
..
*
* Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2012-05-25 initial release
*/

/**
* Finding a quoted String with a regex.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2012-05-25 initial release
* @since 2012-05-25
*/
class TestRegexFindQuotedString
{
// ------------------------------
CONSTANTS------------------------------

private static final String[] vectors =
{"Basic: George said \"that's theticket\".",
"\"that's theticket\"",
"Nested: Jeb replied '\"ticket?\"what ticket'.",
"'\"ticket?\"what ticket'",
"Non-ASCII: \"How na\u00efve!\".",
"\"How na\u00efve!\"",
" empty: \"\"xx",
"\"\"",
" escaped: 'Bob\\'s your uncle.'",
"'Bob\\'s your uncle.'",
" 'unbalanced\"",
"",
};

// -------------------------- STATIC METHODS--------------------------

/**
* exercise that pattern to see what if can find
*/
static void exercisePattern( Pattern pattern )
{
out.println();
out.println( "Pattern: " + pattern.toString() );
for( int i = 0; i < vectors.length; i+=2 ) {
String test = vectors;
String result = vectors[i+1];
final Matcher m = pattern.matcher( test );
boolean found = m.find();
boolean correct = false;
String groupString = null;
if( found ) {
correct = m.group(0).equals( result );
groupString = m.group();
}
System.out.println( test+", found: "+ found +
", correct: "+correct+" ("+groupString+")");
}
}

// --------------------------- main() method---------------------------

/**
* test harness
*
* @param args not used
*/
public static void main( String[] args )
{
// We want to find Strings of the form "xx'xx" or 'xx"xx'
// We want to avoid the following problems:
// 1. Works even if String contains foreign languages,
evenRussian or accented letters.
// 2. If starts with " must end with ", if starts with '
mustend with '.
// 3. ' is ok inside "...", and " is ok inside '...'
// 4. We don't worry about how to use ' inside '...'.

// here are some suggested techniques:

exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
); // fails 1 2 3

exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) );
//fails 2 3

exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) );
//fails 3, uses a capturing group.

exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) );
//works, rejects empty strings by Mark Space.
exercisePattern( Pattern.compile(
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings
by Mark Space.

exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) );
//works, accepts empty strings by Robert Klemme.
exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty
strings
// (?: ) is a non-capturing group. This is Robert
Klemme'scontribution. I don't understand how it works.
}
}
 
L

Lew

markspace said:
Lew said:
Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
This would match "John's restaurant" as "John'.

The first quote matches ", John does not contain either ' or " as specified,
and the last character class matches the '. Not I think what is wanted.

As I correct6ed in my very next post.
 
R

Roedy Green

I re-did Roedy's test program to be a bit more clear about what it was
looking for, and the results. This could be even cleaner if it was run
with a JUnit test harness.

Thanks Brendan. I have incorporated your suggestions plus a bit more
polishing.

See http://mindprod.com/jgloss/regex.html#FINDQUOTED

for a formatted listing + output.

The next task, probably procrastinated, is to solve it with a little
finite state automaton that decodes \x as well, and a simpler version
without. If a newbie is interested in tackling that, they can look at
my Java snippet parser as part of JPrep/JDisplay and strip it down.
--
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
..
 
M

markspace

markspace said:
Lew said:
Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
don't know.
This would match "John's restaurant" as "John'.

The first quote matches ", John does not contain either ' or " as
specified,
and the last character class matches the '. Not I think what is wanted.

As I correct6ed in my very next post.


Unfortunately that one doesn't work either. The central part, [^"'],
doesn't allow a match of a ' if the starting delimiter was a ", and that
doesn't match Roedy's spec. "John's restaurant" wouldn't be matched at
all, because the matcher couldn't match past the ' to get to the ".

I think the easiest is to write out a grammar for the expression, then
translate to regex.

QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING

SQUOTED_STRING := ' NON_S_QUOTE + '

DQUOTED_STRING := " NON_D_QUOTE + "

NON_S_QUOTE := [^']

NON_D_QUOTE := [^"]

At this point the grammar is very clear. (Note I haven't included
Robert's \x escape sequences.) I think it's worth learning to use antlr
rather than regex, which tends to obfuscate more than it helps.
However, a literal translation into regex isn't hard, and a literal
translation avoids mis-optimizations.
 
L

Lew

markspace said:
Lew said:
markspace said:
Lew wrote:
Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
don't know.

This would match "John's restaurant" as "John'.

The first quote matches ", John does not contain either ' or " as
specified,
and the last character class matches the '. Not I think what is wanted.

As I correct6ed in my very next post.

Unfortunately that one doesn't work either. The central part, [^"'], doesn't
allow a match of a ' if the starting delimiter was a ", and that doesn't match
Roedy's spec. "John's restaurant" wouldn't be matched at all, because the
matcher couldn't match past the ' to get to the ".

I think the easiest is to write out a grammar for the expression, then
translate to regex.

QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING

SQUOTED_STRING := ' NON_S_QUOTE + '

DQUOTED_STRING := " NON_D_QUOTE + "

NON_S_QUOTE := [^']

NON_D_QUOTE := [^"]

At this point the grammar is very clear. (Note I haven't included Robert's \x
escape sequences.) I think it's worth learning to use antlr rather than regex,
which tends to obfuscate more than it helps. However, a literal translation
into regex isn't hard, and a literal translation avoids mis-optimizations.

Very illuminating. Thank you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Ant 1.8.4 releeased 1
New ant version 2
Captchas 0
rt.jar classlpath confusion 0
Oops! Integer.compare 2
negative regexes. 10

Members online

Forum statistics

Threads
473,767
Messages
2,569,573
Members
45,046
Latest member
Gavizuho

Latest Threads

Top