Keeping the split token in a Java regular expression

Arved Sandstrom · Mar 28, 2012

On 3/27/12 2:21 PM, Arne Vajhøj wrote:
On 3/27/2012 12:14 AM, Daniel Pitts wrote:
On 3/26/12 6:58 PM, Arne Vajhøj wrote:
On 3/26/2012 2:54 PM, laredotornado wrote:
I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to
split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

A hackish solution:

String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

Nice. As far as hackish, using "split" for this purpose at all is
hackish.

That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.

Click to expand...

[ SNIP ]

These days what's the difference? Both arrays and lists, in computing,
are commonly considered to support indexing, and both can be "iterated"
over one way or the other. As far as arrays go, consider what you can do
with Haskell arrays, or with array operations in APL or J, or with
slices in D...no "for" loops happening there.

Click to expand...

I think what Daniel wanted was a lazy not an eager split.

Instead of doing a full parse and return a data structure
(array or list) then just return an iterator with a pointer
to the start and then do the parsing when asked for next.

Arne

A generator, IOW.

AHS

Gene Wirchenko · Mar 28, 2012

On Tue, 27 Mar 2012 14:29:33 -0700, Daniel Pitts

[snip]

At the same time, it is ones personal loss to ignore something because
of who said it or how it was said. Part of the problem is the jadedness

Click to expand...

One must balance the loss of missing something with the loss of
spending time trying to uncurve a response.

[snip]

I just want to point out that while your intentions *may* be good, the
tone of your message comes off just as smug as what you're attempting to
decry. I'm not trying to stir up a flame war, but I'm hoping that you
can see the other side of this as well. Lew has been a long time
contributor to the Java newsgroups, and I have never found any of this
posts personally distasteful in any way. This is the internet, and some
slight thickness of skin is expected.

Click to expand...

"slight". And that does mean that being rude is good.

^
I missed a "not" here.

I am not baiting him. I like the polite Lew. There is no reason
why people can not be polite on USENET. They just have to decide to
do so.

Sincerely,

Gene Wirchenko

Daniel Pitts · Mar 28, 2012

On 12-03-27 07:20 PM, Daniel Pitts wrote:
On 3/27/12 2:21 PM, Arne Vajhøj wrote:
That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.
[ SNIP ]

These days what's the difference? Both arrays and lists, in computing,
are commonly considered to support indexing, and both can be "iterated"
over one way or the other. As far as arrays go, consider what you can do
with Haskell arrays, or with array operations in APL or J, or with
slices in D...no "for" loops happening there.

Click to expand...

I think what Daniel wanted was a lazy not an eager split.

Instead of doing a full parse and return a data structure
(array or list) then just return an iterator with a pointer
to the start and then do the parsing when asked for next.

Arne

Click to expand...

A generator, IOW.

Basically, yes. That was what I was trying to get at. Calling split on
an unknown String (without using the limit param) is just asking for a
D.O.S. attack.

Daniel Pitts · Mar 28, 2012

^
I missed a "not" here.

I had wondered ;-)

Stefan Ram · Mar 28, 2012

public static void main( final java.lang.String[] args )
{ split( "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM" ); }}

Thanks for the comments! I believe Jim's answer was most
close to what the OP asked for, and Robert is right with
most of his criticism.

As someone said regular expressions were too much overhead,
I tried a solution without regular expressions (with custom
pattern matching); it was not thoroughly tested, though:

final class Tracer
{ private int pos = 0;
private boolean matched = false;
private final boolean advance(){ ++this.pos; return false; }
public final boolean reset()
{ this.pos = 0; this.matched = false; return false; }
public final boolean matched(){ return this.matched; }
public final boolean accept( final char c )
{ final char ch = java.lang.Character.toLowerCase( c );
switch( pos )
{ case 0: /* the pattern is hardcoded below */
return ch == 'a' || ch == 'p' ? this.advance(): this.reset();
case 1:
if( ch == 'm' ){ this.matched = true; return true; }
else { --pos; return this.accept( c ); }
default: this.reset(); return false; }}}

final class Splitter
{
private final java.util.List<java.lang.CharSequence> target
= new java.util.ArrayList<java.lang.CharSequence>();

private final int comma
( final java.lang.CharSequence text, final int i, final int length )
{ final int j = i + 1;
return j < length ? text.charAt( j ) == ',' ? j : i : j; }

public final java.util.List<java.lang.CharSequence> split
( final java.lang.CharSequence text )
{ final Tracer tracer = new Tracer();
final int length = text.length();
int l = 0;
for( int i = 0; i < length; ++i )
{ tracer.accept( text.charAt( i ));
if( tracer.matched() )
{ i = comma( text, i, length );
this.target.add( text.subSequence( l, i ));
tracer.reset();
l = i + 1; }}
return target; }}

public final class Main
{
public static void main( final java.lang.String[] args )
{
java.lang.System.out.println
( new Splitter().split
( "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM" )); }}

[Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM]

Gene Wirchenko · Mar 28, 2012

I had wondered ;-)

I have noted over the years, that if there is one word that
people will miss in posts, it is "not".

Sincerely,

Gene Wirchenko

Robert Klemme · Mar 28, 2012

What do you find excellent about this? I find it has some deficiencies

- the separator is included in the match (which goes against the
requirements despite the thread subject)
- spaces after a separator comma are included in the next token as
leading text
- the method really does more than splitting (namely printing), so the
name does not reflect what's going on
- the Pattern is compiled on _every_ invocation of the method
- the method is unnecessary restricted, argument type CharSequence is
sufficient

Test output for
"Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM"
"Fri 8 PM, Sat 1, 3, and 5 PM"

Fri 7:30 PM,
Sat 2 PM,
Sun 2:30 PM

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
private static final Pattern SPLIT_PATTERN = Pattern.compile(
"(\\S.*?[ap]m)(?:,\\s*)?", Pattern.CASE_INSENSITIVE);

public static void splitPrint(final CharSequence text) {
for (final Matcher m = SPLIT_PATTERN.matcher(text); m.find()

{
System.out.println(m.group(1));
}
}

public static List<String> split(final CharSequence text) {
final List<String> result = new ArrayList<String>();

for (final Matcher m = SPLIT_PATTERN.matcher(text); m.find()

{
result.add(m.group(1));
}

return result;
}

public static void main(final java.lang.String[] args) {
splitPrint("Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM");
System.out.println("---");
splitPrint("Fri 8 PM, Sat 1, 3, and 5 PM");
System.out.println("---");
}
}

I had overlooked a fairly obvious improvement with regards to am/pm parsing.

I might even sneak a "\\s*" in between "pm)" and "(?:," to even catch
cases where there are spaces before the separator.

Kind regards

robert

Robert Klemme · Mar 28, 2012

Premature optimization. Regex parsing inside an inner loop *migh* add
unacceptable overhead, however that should be determined via profiling.

That's not the only reason, because:

That's a better reason to factor it out.

I forgot to add another point: regular expressions tend to grow large
which makes methods which contain such a regexp string constant harder
to read.

And then of course there is another difference: with the Pattern in a
static variable you'll notice earlier (at class load time) if the
pattern is ill formatted as opposed to using ad hoc compilation which
comes to haunt you later on every method invocation.

My personal philosophy for this kind of thing:
Correct first, easy second, fast third.

+1

Kind regards

robert

Robert Klemme · Mar 28, 2012

I have noted over the years, that if there is one word that
people will miss in posts, it is "not".

I don't remember the details but I once heard that people cannot
remember "not" - seems to be a psychological thing or a "feature" of the
mind. You kind of focus on the main message and then you forget to
store the negation as well.

Kind regards

robert

Robert Klemme · Mar 28, 2012

That's interesting. I've written my own Deterministic FSA to implement a
subset of regex functionality, and arbitrary lookbehind actually would
be an easy feature to add. Easier than zero-width matches (for example
word-boundaries).

The limitation for lookbehind seems to be quite common (Ruby's Oniguruma
has it as well). With arbitrary lookbehind you need a buffer which can
grow because you must basically operate on the whole string the whole
time. And, most modern regular expression engines are implemented as
NFAs - or better NFA with a lot of special logic stacked onto it. The
runtime overhead of two directions of backtracking might be considered
too big.

Kind regards

robert

Daniel Pitts · Mar 28, 2012

That's not the only reason, because:

I forgot to add another point: regular expressions tend to grow large
which makes methods which contain such a regexp string constant harder
to read.

Right, I did concede that there are other great reasons to factor it
out. Performance isn't the first one I would pick ;-)

And then of course there is another difference: with the Pattern in a
static variable you'll notice earlier (at class load time) if the
pattern is ill formatted as opposed to using ad hoc compilation which
comes to haunt you later on every method invocation.

Actually, I know even earlier. I know at edit time, as my IDE will
highlight bad regex inside methods which take regex ;-)

Even so, it should be found at Unit Test time (which, granted, will be
around the same time whether it's per method or per class-load).

Just a thought.

Daniel Pitts · Mar 28, 2012

I don't remember the details but I once heard that people cannot
remember "not" - seems to be a psychological thing or a "feature" of the
mind. You kind of focus on the main message and then you forget to store
the negation as well.

I wonder if this is really a true phenomena, or even if it is frequent
enough to contort your point to avoid negating the text of it.

If there is any chance that your point will be pulled out of context,
(such as with dubious reporters), then you may want to choose your words
in such a way that the "not" isn't elided.

However, on the day-to-day conversation, I think some concepts are so
much easier to convey as what they are not, instead of what they are.

The distinction between a java applet and an application	1	Jan 4, 2023
Regular Expression for the special character "\|" pipe	7	May 27, 2014
split	0	Feb 9, 2007
Need regular expression help	2	Nov 16, 2007
Getting incorrect output in finding the maximum pair sum in the given array.	7	Apr 6, 2023
C# How to convert date into en-US when thread culture is ar-SA	1	Feb 18, 2021
I want to include fees depending on the payment method, using the plugin "Deposits for Woocommerce"	0	Aug 17, 2022
What option should I take? Java Senior Software Developer or Genesys Developer in a really big company?	0	Jul 27, 2023

Keeping the split token in a Java regular expression

Arved Sandstrom

Gene Wirchenko

Daniel Pitts

Daniel Pitts

Stefan Ram

Gene Wirchenko

Robert Klemme

Robert Klemme

Robert Klemme

Robert Klemme

Daniel Pitts

Daniel Pitts

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads