Keeping the split token in a Java regular expression

J

Jim Janney

laredotornado said:
Hi,

I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

Thanks, - Dave

You want to match ,? only when it is preceded by (am|pm). That's what
lookbehind is for:

public class LookBehind {
public static void main(String[] args) {

String data = "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM";
String pattern = "(?i)(?<=am|pm),?";

String[] split = data.split(pattern);
for (String s : split) {
System.out.println("'" + s + "'");
}
}
}

See http://www.regular-expressions.info/lookaround.html for a tutorial.
 
L

laredotornado

laredotornado said:
I'm using Java 6.  I want to split a Java string on a regular
expression, but I would like to keep part of the string used to split
in the results.  What I have are Strings like
    Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM
What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i .  Hopefully I got that right.  In
the above example, I would like the results to be
    Fri 7:30 PM
    Sat 2 PM
    Sun 2:30 PM
But with String.split, the split token is not kept within the
results.  How would I write a Java parsing expression to do what I
want?
Thanks, - Dave

You want to match ,? only when it is preceded by (am|pm).  That's what
lookbehind is for:

public class LookBehind {
  public static void main(String[] args) {

    String data = "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM";
    String pattern = "(?i)(?<=am|pm),?";

    String[] split = data.split(pattern);
    for (String s : split) {
      System.out.println("'" + s + "'");
    }
  }

}

Seehttp://www.regular-expressions.info/lookaround.htmlfor a tutorial.

Jim, That's absolutely brilliant and does exactly what I want in a
short amount of code.

Stefan, thanks for your solution as well. I tried that out first and
it works too. - Dave
 
J

Jim Janney

laredotornado said:
Jim, That's absolutely brilliant and does exactly what I want in a
short amount of code.

Stefan, thanks for your solution as well. I tried that out first and
it works too. - Dave

It turns out that lookbehind only works with some patterns; the engine
has to be able to determine the length of the match in advance. Not
surprising when you think about it. It's an interesting question and
gave me a reason to learn something new.
 
G

Gene Wirchenko

exact rules to parse the input, and what to do when the input format
fails quality checks.
You've been awfully poetic lately Lew.

I prefer the "new" Lew. He has dropped the antagonism that I
often saw, and it has made his posts much more readable and useful.

Sincerely,

Gene Wirchenko
 
D

Daniel Pitts

It turns out that lookbehind only works with some patterns; the engine
has to be able to determine the length of the match in advance. Not
surprising when you think about it. It's an interesting question and
gave me a reason to learn something new.
That's interesting. I've written my own Deterministic FSA to implement a
subset of regex functionality, and arbitrary lookbehind actually would
be an easy feature to add. Easier than zero-width matches (for example
word-boundaries).

Anyway, one thing to point out is that Stefan's is likely to perform
better, and definitely has lower memory overhead for long inputs than
"split".
 
L

Lew

Gene said:
I prefer the "new" Lew. He has dropped the antagonism that I
often saw, and it has made his posts much more readable and useful.

I give your preference all the consideration that it is due.
 
G

Gene Wirchenko

I give your preference all the consideration that it is due.

As manners are a social lubricant and a fairly inexpensive one,
that would be quite a lot. Thank you. If you did not mean that,
consider meaning that. You are quite knowledgeable, and without an
antagonistic curve, your posts are very good indeed. This same
statement applies to many people posting on USENET.

Call my preference the USENET Manners Project if you want.
Disagreeing is one thing; being disagreeable is quite another.
http://xkcd.com/386/
is a good joke but a poor reality.

I look forward to your next politely informative post, Lew. Your
recent one clarifying a sentence of yours was very nice indeed.

Sincerely,

Gene Wirchenko
 
R

Robert Klemme

StringTokenizer is somewhat obsoleted by String split.

I find regular expressions are quite a bit of overhead for splitting at
commas only. (Now we know that the OP has more demanding requirements
so regexp is probably the tool of choice.)

Hmm... I don't like those methods in class String that much which use a
String with a regular expression which is then parsed on every
invocation of the method. That might be good for one off usage but for
everything else I prefer solutions which at least use a Pattern constant
to avoid parsing overhead per call. Even if it wasn't for runtime
overhead of parsing I like to have the constant which can have it's own
JavaDoc explaining what's going on plus I can reuse it and quickly find
all places of usage etc.

Kind regards

robert
 
A

Arne Vajhøj

I find regular expressions are quite a bit of overhead for splitting at
commas only. (Now we know that the OP has more demanding requirements so
regexp is probably the tool of choice.)

Hmm... I don't like those methods in class String that much which use a
String with a regular expression which is then parsed on every
invocation of the method. That might be good for one off usage but for
everything else I prefer solutions which at least use a Pattern constant
to avoid parsing overhead per call. Even if it wasn't for runtime
overhead of parsing I like to have the constant which can have it's own
JavaDoc explaining what's going on plus I can reuse it and quickly find
all places of usage etc.

Split is the way you do it.

To cut down on overhead a non-regex split should be added.

Arne
 
A

Arne Vajhøj

I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

A hackish solution:

String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

Nice. As far as hackish, using "split" for this purpose at all is
hackish.

That type of split is the typical way in most modern languages
(though usually in a non regex flavor).

Arne
 
D

Daniel Pitts

I find regular expressions are quite a bit of overhead for splitting at
commas only. (Now we know that the OP has more demanding requirements so
regexp is probably the tool of choice.)

Hmm... I don't like those methods in class String that much which use a
String with a regular expression which is then parsed on every
invocation of the method. That might be good for one off usage but for
everything else I prefer solutions which at least use a Pattern constant
to avoid parsing overhead per call.
Premature optimization. Regex parsing inside an inner loop *migh* add
unacceptable overhead, however that should be determined via profiling.
Even if it wasn't for runtime
overhead of parsing I like to have the constant which can have it's own
JavaDoc explaining what's going on plus I can reuse it and quickly find
all places of usage etc.
That's a better reason to factor it out.

My personal philosophy for this kind of thing:
Correct first, easy second, fast third.

If its not correct, it doesn't matter.
If its not easy, its likely not correct, at least not for long.
If its not fast, it should be "easy" to make it fast as long as it's
already correct and easy :)
 
R

Robert Klemme

Stefan said:
laredotornado said:
What I would like to do is split the expression wherever I have an

public class Main
{
public static void split
( final java.lang.String text )
{ java.util.regex.Pattern pattern =
java.util.regex.Pattern.compile
( ".*?(?:am|pm),?", java.util.regex.Pattern.CASE_INSENSITIVE );
java.util.regex.Matcher matcher = pattern.matcher( text );
while( matcher.find() )
java.lang.System.out.println( matcher.group( 0 )); }

public static void main( final java.lang.String[] args )
{ split( "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM" ); }}

This excellent (except for layout) example deserves to be archived.

What do you find excellent about this? I find it has some deficiencies

- the separator is included in the match (which goes against the
requirements despite the thread subject)
- spaces after a separator comma are included in the next token as
leading text
- the method really does more than splitting (namely printing), so the
name does not reflect what's going on
- the Pattern is compiled on _every_ invocation of the method
- the method is unnecessary restricted, argument type CharSequence is
sufficient

Test output for
"Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM"
"Fri 8 PM, Sat 1, 3, and 5 PM"

Fri 7:30 PM,
Sat 2 PM,
Sun 2:30 PM
---
Fri 8 PM,
Sat 1, 3, and 5 PM
---

I would change that to

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
private static final Pattern SPLIT_PATTERN = Pattern.compile(
"(\\S.*?(?:am|pm))(?:,\\s*)?", Pattern.CASE_INSENSITIVE);

public static void splitPrint(final CharSequence text) {
for (final Matcher m = SPLIT_PATTERN.matcher(text); m.find();) {
System.out.println(m.group(1));
}
}

public static List<String> split(final CharSequence text) {
final List<String> result = new ArrayList<String>();

for (final Matcher m = SPLIT_PATTERN.matcher(text); m.find();) {
result.add(m.group(1));
}

return result;
}

public static void main(final java.lang.String[] args) {
splitPrint("Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM");
System.out.println("---");
splitPrint("Fri 8 PM, Sat 1, 3, and 5 PM");
System.out.println("---");
}
}

I might even sneak a "\\s*" in between "pm)" and "(?:," to even catch
cases where there are spaces before the separator.

Kind regards

robert
 
D

Daniel Pitts

As manners are a social lubricant and a fairly inexpensive one,
that would be quite a lot. Thank you. If you did not mean that,
consider meaning that. You are quite knowledgeable, and without an
antagonistic curve, your posts are very good indeed. This same
statement applies to many people posting on USENET.
At the same time, it is ones personal loss to ignore something because
of who said it or how it was said. Part of the problem is the jadedness
that some of the old-timers on this group have, due to certain
trolls-who-shall-not-be-named. Lew is a very analytical and structured
person, arguing facts logically, with references is more likely to
persuade him than talking about feelings. I'm very much the same way,
though I have tried to include my understanding of psychology in my
responses.
Call my preference the USENET Manners Project if you want.
Disagreeing is one thing; being disagreeable is quite another.
http://xkcd.com/386/
is a good joke but a poor reality.

I look forward to your next politely informative post, Lew. Your
recent one clarifying a sentence of yours was very nice indeed.

I just want to point out that while your intentions *may* be good, the
tone of your message comes off just as smug as what you're attempting to
decry. I'm not trying to stir up a flame war, but I'm hoping that you
can see the other side of this as well. Lew has been a long time
contributor to the Java newsgroups, and I have never found any of this
posts personally distasteful in any way. This is the internet, and some
slight thickness of skin is expected.

So, please, stop baiting each other, and keep these messages on topic.
 
M

Martin Gregorie

Its rather late here, so I'll leave this as an exercise for anybody
who feels keen. If nobody has touched it by mid morning tomorrow I
may see if it works.
I put together the following this morning. Hopefully its enough of an SSCE
to pass muster.

As promised, I first implemented a two-pass splitter (the 'classico'
method): its ugly all right, even though it does the trick.

Then I swiped Stefan's code (the 'patternista' method), tewaked it
slightly and used it to drive both his and my regexes. The only other
changed it needs is to parameterise Matcher.group() because Stefan's regex
treats the whole pattern as a capture group while mine only uses the
first capture group in the pattern which lets it discard the comma
separators. This was one of my design aims: to output the exact same
strings as the classico() method does.

==========================================================================
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Splitter
{
public static ArrayList<String> classico(String in)
{
String[] sList = in.split("PM, +|PM");
for (int i=0; i<sList.length; i++)
sList = sList.trim() + " PM";

ArrayList<String> aList = new ArrayList<String>();
for (String s : sList)
{
String sp[] = s.split("AM, +|AM");
for (int j=0; j < sp.length - 1; j++)
aList.add(sp[j].trim() + " AM");

aList.add(sp[sp.length - 1]); // The last element is
// always ended wth PM
}

return aList;
}

public static ArrayList<String> patternista(String p, int g, String in)
{
Pattern pattern = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(in);
ArrayList<String> aList = new ArrayList<String>();
while(matcher.find())
{
String s = matcher.group(g);
aList.add(s.trim());
}

return aList;
}

public static void showResult(String source,
String method,
ArrayList<String> s)
{
System.out.println(String.format("\n'%s' ==> '%s'",
source,
method));
for (int i = 0; i < s.size(); i++)
System.out.println(String.format("%2d: %s", i, s.get(i)));
}

public static void main(String[] args)
{
String SOURCE = "Fri 7:30 PM, Sat 1, 3 and 5 AM, Sun 2:30 PM";
String martin = "(.*?[AP]M),?";
String stefan = ".*?(?:am|pm),?";

ArrayList<String> s;
s = classico(SOURCE);
showResult(SOURCE, "classico", s);
s = patternista(martin, 1, SOURCE);
showResult(SOURCE, martin, s);
s = patternista(stefan, 0, SOURCE);
showResult(SOURCE, stefan, s);
}
}
==========================================================================
'Fri 7:30 PM, Sat 1, 3 and 5 AM, Sun 2:30 PM' ==> 'classico'
0: Fri 7:30 PM
1: Sat 1, 3 and 5 AM
2: Sun 2:30 PM

'Fri 7:30 PM, Sat 1, 3 and 5 AM, Sun 2:30 PM' ==> '(.*?[AP]M),?'
0: Fri 7:30 PM
1: Sat 1, 3 and 5 AM
2: Sun 2:30 PM

'Fri 7:30 PM, Sat 1, 3 and 5 AM, Sun 2:30 PM' ==> '.*?(?:am|pm),?'
0: Fri 7:30 PM,
1: Sat 1, 3 and 5 AM,
2: Sun 2:30 PM
==========================================================================

As you can see, once I'd swapped greedy matches for non-greedy in my regex
(the second test run), both regexes do job and to my mind use much more
elegant code than the two pass classico approach, but of course ymmv.
 
D

Daniel Pitts

On 3/26/2012 2:54 PM, laredotornado wrote:
I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

A hackish solution:

String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

Nice. As far as hackish, using "split" for this purpose at all is
hackish.

That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.

And in any case, this particular problem is not a "split" kind of
problem, but a "parse" kind of problem. So, split for this is hackish,
 
A

Arne Vajhøj

On 3/26/12 6:58 PM, Arne Vajhøj wrote:
On 3/26/2012 2:54 PM, laredotornado wrote:
I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

A hackish solution:

String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

Nice. As far as hackish, using "split" for this purpose at all is
hackish.

That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.

..NET String Split return string[] (non regex)
..NET Regex Split return string[] (regex)
PHP split return array (regex)
PHP explode return array (non regex)
PHP preg_split return array (regex)
And in any case, this particular problem is not a "split" kind of
problem, but a "parse" kind of problem. So, split for this is hackish,

I think it would be rather common in practice.

Arne
 
G

Gene Wirchenko

On Tue, 27 Mar 2012 14:29:33 -0700, Daniel Pitts

[snip]
At the same time, it is ones personal loss to ignore something because
of who said it or how it was said. Part of the problem is the jadedness

One must balance the loss of missing something with the loss of
spending time trying to uncurve a response.

[snip]
I just want to point out that while your intentions *may* be good, the
tone of your message comes off just as smug as what you're attempting to
decry. I'm not trying to stir up a flame war, but I'm hoping that you
can see the other side of this as well. Lew has been a long time
contributor to the Java newsgroups, and I have never found any of this
posts personally distasteful in any way. This is the internet, and some
slight thickness of skin is expected.

"slight". And that does mean that being rude is good.
So, please, stop baiting each other, and keep these messages on topic.

I am not baiting him. I like the polite Lew. There is no reason
why people can not be polite on USENET. They just have to decide to
do so.

Sincerely,

Gene Wirchenko
 
D

Daniel Pitts

On 3/27/2012 12:14 AM, Daniel Pitts wrote:
On 3/26/12 6:58 PM, Arne Vajhøj wrote:
On 3/26/2012 2:54 PM, laredotornado wrote:
I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

A hackish solution:

String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

Nice. As far as hackish, using "split" for this purpose at all is
hackish.

That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.

.NET String Split return string[] (non regex)
.NET Regex Split return string[] (regex)
PHP split return array (regex)
PHP explode return array (non regex)
PHP preg_split return array (regex)
And in any case, this particular problem is not a "split" kind of
problem, but a "parse" kind of problem. So, split for this is hackish,

I think it would be rather common in practice.

Arne
I thought you meant modern languages like python or ruby :)
 
A

Arved Sandstrom

On 3/26/12 6:58 PM, Arne Vajhøj wrote:
On 3/26/2012 2:54 PM, laredotornado wrote:
I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

A hackish solution:

String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

Nice. As far as hackish, using "split" for this purpose at all is
hackish.

That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.
[ SNIP ]

These days what's the difference? Both arrays and lists, in computing,
are commonly considered to support indexing, and both can be "iterated"
over one way or the other. As far as arrays go, consider what you can do
with Haskell arrays, or with array operations in APL or J, or with
slices in D...no "for" loops happening there.

AHS
 
A

Arne Vajhøj

On 3/27/2012 12:14 AM, Daniel Pitts wrote:
On 3/26/12 6:58 PM, Arne Vajhøj wrote:
On 3/26/2012 2:54 PM, laredotornado wrote:
I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

A hackish solution:

String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

Nice. As far as hackish, using "split" for this purpose at all is
hackish.

That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.
[ SNIP ]

These days what's the difference? Both arrays and lists, in computing,
are commonly considered to support indexing, and both can be "iterated"
over one way or the other. As far as arrays go, consider what you can do
with Haskell arrays, or with array operations in APL or J, or with
slices in D...no "for" loops happening there.

I think what Daniel wanted was a lazy not an eager split.

Instead of doing a full parse and return a data structure
(array or list) then just return an iterator with a pointer
to the start and then do the parsing when asked for next.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top