Regex: Any character in character class

Sebastian · Jan 30, 2013

I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

-- Sebastian

Mikhail Vladimirov · Jan 30, 2013

What about [^]?

Mikhail Vladimirov · Jan 30, 2013

Another option is .|\n

Arne Vajhøj · Jan 31, 2013

What about [^]?

java.util.regex.PatternSyntaxException

Arne

Arne Vajhøj · Jan 31, 2013

I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Do you always want to accept line breaks or not? If not then when?

Arne

Arved Sandstrom · Feb 1, 2013

I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Click to expand...

Do you always want to accept line breaks or not? If not then when?

Arne

Good question.

I take it the suffix is a generic last-N characters of the string
(Assumption #1). I take it that line breaks are OK in the suffix, not
necessarily so in the rest of the string (Assumption #2).

If you don't mind me asking, why don't you just grab the suffix, the
last N characters, with substring()? That *is* your match.

AHS

Sebastian · Feb 1, 2013

Am 31.01.2013 04:27, schrieb Arne Vajhøj:

I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Click to expand...

Do you always want to accept line breaks or not? If not then when?

Arne

the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.
-- S.

Lew · Feb 1, 2013

Sebastian said:
the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.

How do you tell which part is which?

Arne Vajhøj · Feb 1, 2013

Am 31.01.2013 04:27, schrieb Arne Vajhøj:

I want to match any sequence of characters, including line breaks, in a
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Click to expand...

Do you always want to accept line breaks or not? If not then when?

Click to expand...

the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.

Do you have a separator between the two parts like colon in URL's?

If yes then something like:

[.]+:[.|\n]+

Arne

markspace · Feb 1, 2013

[.]+:[.|\n]+

Watch out for this. +, being greedy, will match a : in the selection
expression (the 2nd part) if : is allowed in the second part.

The reluctant modifier might be a better idea here:

..+?:[.|\n]+

Note that I don't think the initial brackets [] were needed. Also we're
yet again starting to see the problem with regex: it always evolves into
something that looks like your cat walked across the keyboard.

Arne Vajhøj · Feb 1, 2013

[.]+:[.|\n]+

Click to expand...

Watch out for this. +, being greedy, will match a : in the selection
expression (the 2nd part) if : is allowed in the second part.

The reluctant modifier might be a better idea here:

.+?:[.|\n]+

Note that I don't think the initial brackets [] were needed. Also we're
yet again starting to see the problem with regex: it always evolves into
something that looks like your cat walked across the keyboard.

You are absolutely right.

Non greedy.

No square brackets for first part.

And also round brackets for the last part.

..+?

.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.

:-(

Arne

Robert Klemme · Feb 1, 2013

Am 31.01.2013 04:27, schrieb Arne Vajhøj:

I want to match any sequence of characters, including line breaks, ina
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not
permissible everywhere. I cannot write [.]* because dot loses its
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl.
line-breaks). Is there a better way?

Click to expand...

Click to expand...

Yes.

Do you always want to accept line breaks or not? If not then when?

Click to expand...

the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.

Of course you can use DOTALL - as an embedded flag:

package rx;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Dotty {

private static final Pattern PAT =
Pattern.compile("proto.*(?s:sel.*)");

public static void main(String[] args) {
test("protoPselS");
test("protoPPselS\nS");
test("protoP\nPselS\nS");
}

public static void test(final CharSequence cs) {
System.out.println("cs=\"" + cs + "\"");
final Matcher m = PAT.matcher(cs);

if (m.matches()) {
System.out.println("Match: \"" + m.group() + "\"");
} else {
System.out.println("Mismatch");
}

System.out.println();
}

}

Kind regards

robert

Sebastian · Feb 2, 2013

Am 01.02.2013 23:13, schrieb Arne Vajhøj:
[snip]

And also round brackets for the last part.

.+?.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.

:-(

Arne

Here's a concrete example:

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]

The second part is everything after the first comma. I was using
(.+?),[\s\S]+

Arne's suggestion modified for my needs (comma as separator, and I only
want to capture the first part as a group) will work fine as well:
(.+?),(?:.|\n)+

Can't say though that I find anything to prefer the one to the other.
Perhaps the second looks even more like the result of a cat walk...

-- Sebastian

markspace · Feb 2, 2013

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]

For something this simple you might want to consider just String::split().

String test =
"SCA:LIST,select[werks_s:default_plant],values[bukrs:bukrs,company:company]
";
String[] parse = test.split( ",\\s*", 2 );
System.out.println( Arrays.toString( parse ) );

This could be faster since the second half of the regex, (?:.|\n)+,
doesn't have to execute.

Arne Vajhøj · Feb 2, 2013

Am 01.02.2013 23:13, schrieb Arne Vajhøj:
[snip]

And also round brackets for the last part.

.+?.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.

:-(

Click to expand...

Here's a concrete example:

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]

The second part is everything after the first comma. I was using
(.+?),[\s\S]+

Arne's suggestion modified for my needs (comma as separator, and I only
want to capture the first part as a group) will work fine as well:
(.+?),(?:.|\n)+

Can't say though that I find anything to prefer the one to the other.
Perhaps the second looks even more like the result of a cat walk...

It is not unusual that there is more than one regex that
does the job.

Arne

Lew · Feb 2, 2013

Arne said:
Sebastian said:

schrieb Arne Vajhï¿œj:
[snip]

And also round brackets for the last part.

.+?.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.
:-(

Click to expand...

Click to expand...

Here's a concrete example:

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]

Click to expand...

The second part is everything after the first comma. I was using

Click to expand...

You mean 'expression.substring(expression.indexOf(',') + 1)'?
(modulo the usual error checks, of course)

(.+?),[\s\S]+
Arne's suggestion modified for my needs (comma as separator, and I only
want to capture the first part as a group) will work fine as well:

Click to expand...

You mean 'expression.substring(0, expression.indexOf(','))'?

If all you need to do is split a string on a comma, why use regexes at all?

It is not unusual that there is more than one regex that
does the job.

It is not unusual that there is more than one non-regex that does the job.

Arne Vajhøj · Feb 3, 2013

If all you need to do is split a string on a comma, why use regexes at all?

It is not unusual that there is more than one non-regex that does the job.

True.

But less surprising.

Arne

Gene Wirchenko · Feb 4, 2013

[snip]

I think I must have set a new world record. 3 bugs in 12 characters.

:-(

I may be able to save your honour. <G>

IBM had bugs in a one-instruction program of two bytes long. The
program was IEFBR14, and you can read about it on Wikipedia. There
was a series of corrections which resulted in a program several times
larger.

Sincerely,

Gene Wirchenko

Formatting a long regex: can a character class [] be split overlines?	4	May 1, 2011
Clickable link conversion regex?	0	Nov 30, 2012
problem with regex, how to conclude more than one character	3	Nov 7, 2008
Prevent REXML from doing any character decoding	1	Sep 21, 2007
First character a space in tag?	3	Jul 18, 2008
Regex failed to replace utf8 character	10	Nov 29, 2006
how to match any character with ruby regexp?	2	Apr 27, 2008
FAQ 6.9 How can I quote a variable to use in a regex?	10	Apr 12, 2011

Regex: Any character in character class

Sebastian

Mikhail Vladimirov

Mikhail Vladimirov

Arne Vajhøj

Arne Vajhøj

Arved Sandstrom

Sebastian

Lew

Arne Vajhøj

markspace

Arne Vajhøj

Robert Klemme

Sebastian

markspace

Arne Vajhøj

Lew

Arne Vajhøj

Gene Wirchenko

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads