Bug in Regex Split

R

Roedy Green

I don't think this behaviour is defensible even if documented
somewhere.

Consider this:

Pattern spaceSplitter = Pattern.compile( " " );

now try it on strings with lead, embedded and trailing spaces.

lead spaces turn into "" each, but trailing spaces are ignored.

e.g.

..split( "..a.b..c.." ); (where . represents space )

gives:

""
""
"a"
"b"
""
"c"
 
S

skeptic

Roedy Green said:
I don't think this behaviour is defensible even if documented
somewhere.

Consider this:

Pattern spaceSplitter = Pattern.compile( " " );

now try it on strings with lead, embedded and trailing spaces.

lead spaces turn into "" each, but trailing spaces are ignored.

e.g.

.split( "..a.b..c.." ); (where . represents space )

gives:

""
""
"a"
"b"
""
"c"

Strictly speaking it's not a bug as there is no formal regex
replacement/splitting specification in the java documentation. I also
doubt if one ever existed at all, because those perl regexes are more
of art than a science (real regular expressions aside).

Nevertheless, I agree that it does wrong.
But that's not the real problem: I just could conform my app to the
given behaviour.
The real problem is that the behaviour may silently change in the next
version of jre.
A macabre picture gets on the mind where some search/replace-type app
goes wild in a batch mode after a JRE upgrade.

Regards
 
D

Dave Glasser

I don't think this behaviour is defensible even if documented
somewhere.

Consider this:

Pattern spaceSplitter = Pattern.compile( " " );

now try it on strings with lead, embedded and trailing spaces.

lead spaces turn into "" each, but trailing spaces are ignored.

e.g.

.split( "..a.b..c.." ); (where . represents space )

gives:

""
""
"a"
"b"
""
"c"

That does seem odd that it seems to treat leading and trailing
whitespace differently. I just ran this perl script that does the same
operation, however:

$_ = " a b c ";
@outs = split(/ /);
foreach $elem (@outs) {
print "\n-$elem-";
}



and the ouptput is:

--
--
-a-
-b-
--
-c-

Which implies that perl follows the same rules. (Hopefully a perl
expert will read this and explain why.)

Do you want your output array to include a zero-length string wherever
there's a boundary between two space characters? It seems to me you'd
want to treat any amount of contiguous whitespace as a single
delimiter, and therefore use "\\s+" (one or more whitespace
characters) instead of " ".
 
R

Rene

Dave Glasser said:
GMT in comp.lang.java.programmer:
[snip]
lead spaces turn into "" each, but trailing spaces are ignored.

e.g.

.split( "..a.b..c.." ); (where . represents space )

gives:

""
""
"a"
"b"
""
"c"

That does seem odd that it seems to treat leading and trailing
whitespace differently. I just ran this perl script that does the same
operation, however:

$_ = " a b c ";
@outs = split(/ /);
foreach $elem (@outs) {
print "\n-$elem-";
}

It works if you add the flag "x". So try @outs = split(/ /x); and it will
give: (changed - to + so to not break quoting at --<space>)

+ +
+ +
+a+
+ +
+b+
+ +
+ +
+c+
+ +
+ +

What /x does is to change behaviour with regard to comments. You can do
things like that in perl:

m{
\w+: # match a word
( # begin group
\s+ # match one or more whitespaces
\w+ # match another word
) # end group
\s* # match zero or more digits
\d+ # match some digits
}x;

In this case the regexp is far more readable, but contains a *lot*
whitespace and comment chars that are not part of the expression, so you
need to flag them as such. It seems that trailing spaces are generally
considered comments.

I didn't expect it either and I'm not a regexp master (had to take the
camel book out of the shelf for that :) ) Skimming through the javadoc, I'd
expect the Pattern.COMMENTS flag to yield the same behaviour.
Which implies that perl follows the same rules. (Hopefully a perl
expert will read this and explain why.)

Here's my 2 cents but I'm not an expert on that matter.

CU

Rene
 
F

Filip Larsen

Roedy Green wrote
I don't think this behaviour is defensible even if documented
somewhere.

Consider this:

Pattern spaceSplitter = Pattern.compile( " " );

now try it on strings with lead, embedded and trailing spaces.

lead spaces turn into "" each, but trailing spaces are ignored.

e.g.

.split( "..a.b..c.." ); (where . represents space )

gives:

""
""
"a"
"b"
""
"c"

The call to split(s) corresponds to a call to split(s,0), where zero it the
limit parameter. For that call, zero means that trailing empty strings in
the result are removed (as the documentation clearly says). If called with
negative limit the result will contain trailing empty strings.

So, in your case a call to .split(" a b c ",-1) gives the expected "",
"", "a", "", "b", "", "".

Whether or not it is defensible for the .split(CharSequence) call to remove
trailing spaces is a different issue. If it somehow is the normal behaviour
of split in other regex implementations, then I think it is a good choice to
let split(CharSequence) be aligned with that.


Regards,
 
R

Roedy Green

Do you want your output array to include a zero-length string wherever
there's a boundary between two space characters?

In this particular case I am trying to break a phrase into words in
such a way that I can reconstruct it precisely the way I found it. So
lead, embedded and trailing blanks are significant. I store words
with an implied trailing blank on all but the first.

I could see split ignoring lead and trail blanks or treating strings
of embedded blanks as one, but not the asymmetric thing it does.
It should treat lead, trail and embedded blanks the same way.
 
R

Roedy Green

I always use trim() so the problem does not exist.

In my case I was processing filenames. The lead and trail blanks are
significant. I decided to simply treat these as errors and ask the
user to rename his files.
 
P

Phil...

If it needs to be exact and split() isn't doin it for ya
maybe try charAt() and make a loop that does
exactly what you want, doesn't seem hard
 
A

Alan Moore

But that's not the real problem: I just could conform my app to the
given behaviour.
The real problem is that the behaviour may silently change in the next
version of jre.
A macabre picture gets on the mind where some search/replace-type app
goes wild in a batch mode after a JRE upgrade.

I doubt that will happen. The only reason to change its behavior
would be if someone showed that it was not consistent with Perl's
split function--which has happened once or twice. But, AFAIK, the
split() method is now fully Perl-compliant.
 
R

Roedy Green

I doubt that will happen. The only reason to change its behavior
would be if someone showed that it was not consistent with Perl's
split function--which has happened once or twice. But, AFAIK, the
split() method is now fully Perl-compliant.

and this strange behaviour is documented in the dual argument split,
so they can't change it now.
 
R

Roedy Green

I doubt that will happen. The only reason to change its behavior
would be if someone showed that it was not consistent with Perl's
split function--which has happened once or twice. But, AFAIK, the
split() method is now fully Perl-compliant.

It is not as strange as it first appears. In many languages lead and
embedded separators are significant, but trailing ones are not.

I think particularly back to OS JCL which was nuts on null positional
parameters.
 
R

Raymond DeCampo

Roedy said:
In my case I was processing filenames. The lead and trail blanks are
significant. I decided to simply treat these as errors and ask the
user to rename his files.

Roedy,

Why did you not just use Pattern.split(CharSequence,int) where you pass
a negative limit? According to the documentation it sounds like it does
what you want.

Ray
 
R

Roedy Green

Why did you not just use Pattern.split(CharSequence,int) where you pass
a negative limit? According to the documentation it sounds like it does
what you want.

Because I did not know about it at the time I complained.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top