Keeping the split token in a Java regular expression

A

Arved Sandstrom

On 3/27/12 2:21 PM, Arne Vajhøj wrote:
On 3/27/2012 12:14 AM, Daniel Pitts wrote:
On 3/26/12 6:58 PM, Arne Vajhøj wrote:
On 3/26/2012 2:54 PM, laredotornado wrote:
I'm using Java 6. I want to split a Java string on a regular
expression, but I would like to keep part of the string used to
split
in the results. What I have are Strings like

Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

What I would like to do is split the expression wherever I have an
expression matching /(am|pm),?/i . Hopefully I got that right. In
the above example, I would like the results to be

Fri 7:30 PM
Sat 2 PM
Sun 2:30 PM

But with String.split, the split token is not kept within the
results. How would I write a Java parsing expression to do what I
want?

A hackish solution:

String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

Nice. As far as hackish, using "split" for this purpose at all is
hackish.

That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.
[ SNIP ]

These days what's the difference? Both arrays and lists, in computing,
are commonly considered to support indexing, and both can be "iterated"
over one way or the other. As far as arrays go, consider what you can do
with Haskell arrays, or with array operations in APL or J, or with
slices in D...no "for" loops happening there.

I think what Daniel wanted was a lazy not an eager split.

Instead of doing a full parse and return a data structure
(array or list) then just return an iterator with a pointer
to the start and then do the parsing when asked for next.

Arne
A generator, IOW.

AHS
 
G

Gene Wirchenko

On Tue, 27 Mar 2012 14:29:33 -0700, Daniel Pitts

[snip]
At the same time, it is ones personal loss to ignore something because
of who said it or how it was said. Part of the problem is the jadedness

One must balance the loss of missing something with the loss of
spending time trying to uncurve a response.

[snip]
I just want to point out that while your intentions *may* be good, the
tone of your message comes off just as smug as what you're attempting to
decry. I'm not trying to stir up a flame war, but I'm hoping that you
can see the other side of this as well. Lew has been a long time
contributor to the Java newsgroups, and I have never found any of this
posts personally distasteful in any way. This is the internet, and some
slight thickness of skin is expected.

"slight". And that does mean that being rude is good.
^
I missed a "not" here.
I am not baiting him. I like the polite Lew. There is no reason
why people can not be polite on USENET. They just have to decide to
do so.

Sincerely,

Gene Wirchenko
 
D

Daniel Pitts

On 12-03-27 07:20 PM, Daniel Pitts wrote:
On 3/27/12 2:21 PM, Arne Vajhøj wrote:
That type of split is the typical way in most modern languages
(though usually in a non regex flavor).
For functional languages, yes, but those modern languages don't
necessarily return an array. Ideally they would return "iterable" of
some sort.
[ SNIP ]

These days what's the difference? Both arrays and lists, in computing,
are commonly considered to support indexing, and both can be "iterated"
over one way or the other. As far as arrays go, consider what you can do
with Haskell arrays, or with array operations in APL or J, or with
slices in D...no "for" loops happening there.

I think what Daniel wanted was a lazy not an eager split.

Instead of doing a full parse and return a data structure
(array or list) then just return an iterator with a pointer
to the start and then do the parsing when asked for next.

Arne
A generator, IOW.

Basically, yes. That was what I was trying to get at. Calling split on
an unknown String (without using the limit param) is just asking for a
D.O.S. attack.
 
S

Stefan Ram

public static void main( final java.lang.String[] args )
{ split( "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM" ); }}

Thanks for the comments! I believe Jim's answer was most
close to what the OP asked for, and Robert is right with
most of his criticism.

As someone said regular expressions were too much overhead,
I tried a solution without regular expressions (with custom
pattern matching); it was not thoroughly tested, though:

final class Tracer
{ private int pos = 0;
private boolean matched = false;
private final boolean advance(){ ++this.pos; return false; }
public final boolean reset()
{ this.pos = 0; this.matched = false; return false; }
public final boolean matched(){ return this.matched; }
public final boolean accept( final char c )
{ final char ch = java.lang.Character.toLowerCase( c );
switch( pos )
{ case 0: /* the pattern is hardcoded below */
return ch == 'a' || ch == 'p' ? this.advance(): this.reset();
case 1:
if( ch == 'm' ){ this.matched = true; return true; }
else { --pos; return this.accept( c ); }
default: this.reset(); return false; }}}

final class Splitter
{
private final java.util.List<java.lang.CharSequence> target
= new java.util.ArrayList<java.lang.CharSequence>();

private final int comma
( final java.lang.CharSequence text, final int i, final int length )
{ final int j = i + 1;
return j < length ? text.charAt( j ) == ',' ? j : i : j; }

public final java.util.List<java.lang.CharSequence> split
( final java.lang.CharSequence text )
{ final Tracer tracer = new Tracer();
final int length = text.length();
int l = 0;
for( int i = 0; i < length; ++i )
{ tracer.accept( text.charAt( i ));
if( tracer.matched() )
{ i = comma( text, i, length );
this.target.add( text.subSequence( l, i ));
tracer.reset();
l = i + 1; }}
return target; }}

public final class Main
{
public static void main( final java.lang.String[] args )
{
java.lang.System.out.println
( new Splitter().split
( "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM" )); }}

[Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM]
 
R

Robert Klemme

What do you find excellent about this? I find it has some deficiencies

- the separator is included in the match (which goes against the
requirements despite the thread subject)
- spaces after a separator comma are included in the next token as
leading text
- the method really does more than splitting (namely printing), so the
name does not reflect what's going on
- the Pattern is compiled on _every_ invocation of the method
- the method is unnecessary restricted, argument type CharSequence is
sufficient

Test output for
"Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM"
"Fri 8 PM, Sat 1, 3, and 5 PM"

Fri 7:30 PM,
Sat 2 PM,
Sun 2:30 PM

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
private static final Pattern SPLIT_PATTERN = Pattern.compile(
"(\\S.*?[ap]m)(?:,\\s*)?", Pattern.CASE_INSENSITIVE);

public static void splitPrint(final CharSequence text) {
for (final Matcher m = SPLIT_PATTERN.matcher(text); m.find();) {
System.out.println(m.group(1));
}
}

public static List<String> split(final CharSequence text) {
final List<String> result = new ArrayList<String>();

for (final Matcher m = SPLIT_PATTERN.matcher(text); m.find();) {
result.add(m.group(1));
}

return result;
}

public static void main(final java.lang.String[] args) {
splitPrint("Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM");
System.out.println("---");
splitPrint("Fri 8 PM, Sat 1, 3, and 5 PM");
System.out.println("---");
}
}

I had overlooked a fairly obvious improvement with regards to am/pm parsing.
I might even sneak a "\\s*" in between "pm)" and "(?:," to even catch
cases where there are spaces before the separator.

Kind regards

robert
 
R

Robert Klemme

Premature optimization. Regex parsing inside an inner loop *migh* add
unacceptable overhead, however that should be determined via profiling.

That's not the only reason, because:
That's a better reason to factor it out.

I forgot to add another point: regular expressions tend to grow large
which makes methods which contain such a regexp string constant harder
to read.

And then of course there is another difference: with the Pattern in a
static variable you'll notice earlier (at class load time) if the
pattern is ill formatted as opposed to using ad hoc compilation which
comes to haunt you later on every method invocation.
My personal philosophy for this kind of thing:
Correct first, easy second, fast third.

+1

Kind regards

robert
 
R

Robert Klemme

I have noted over the years, that if there is one word that
people will miss in posts, it is "not".

I don't remember the details but I once heard that people cannot
remember "not" - seems to be a psychological thing or a "feature" of the
mind. You kind of focus on the main message and then you forget to
store the negation as well.

Kind regards

robert
 
R

Robert Klemme

That's interesting. I've written my own Deterministic FSA to implement a
subset of regex functionality, and arbitrary lookbehind actually would
be an easy feature to add. Easier than zero-width matches (for example
word-boundaries).

The limitation for lookbehind seems to be quite common (Ruby's Oniguruma
has it as well). With arbitrary lookbehind you need a buffer which can
grow because you must basically operate on the whole string the whole
time. And, most modern regular expression engines are implemented as
NFAs - or better NFA with a lot of special logic stacked onto it. The
runtime overhead of two directions of backtracking might be considered
too big.

Kind regards

robert
 
D

Daniel Pitts

That's not the only reason, because:


I forgot to add another point: regular expressions tend to grow large
which makes methods which contain such a regexp string constant harder
to read.
Right, I did concede that there are other great reasons to factor it
out. Performance isn't the first one I would pick ;-)
And then of course there is another difference: with the Pattern in a
static variable you'll notice earlier (at class load time) if the
pattern is ill formatted as opposed to using ad hoc compilation which
comes to haunt you later on every method invocation.
Actually, I know even earlier. I know at edit time, as my IDE will
highlight bad regex inside methods which take regex ;-)

Even so, it should be found at Unit Test time (which, granted, will be
around the same time whether it's per method or per class-load).

Just a thought.
 
D

Daniel Pitts

I don't remember the details but I once heard that people cannot
remember "not" - seems to be a psychological thing or a "feature" of the
mind. You kind of focus on the main message and then you forget to store
the negation as well.
I wonder if this is really a true phenomena, or even if it is frequent
enough to contort your point to avoid negating the text of it.

If there is any chance that your point will be pulled out of context,
(such as with dubious reporters), then you may want to choose your words
in such a way that the "not" isn't elided.

However, on the day-to-day conversation, I think some concepts are so
much easier to convey as what they are not, instead of what they are.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top