java.util.regex.Pattern.split issue

I

Ichiro

Hi there,

I wrote a test harness for java.util.regex.Pattern.split and found
that, at least from my point of view, it behaves inconsistently. See
program and output below.

In particular, a trailing delimiter does not generate an empty string
on the right of the delimiter, so I get

*******************
'a,' splits into:
'a'
*******************

while I expected

*******************
'a,' splits into:
'a'
''
*******************

Also, possibly even more bizarrely

*******************
',' splits into:
*******************

while (since the empty string splits into an empty string) I expected

*******************
',,' splits into:
''
''
*******************

I tried to modify my pattern to also take beginning and end of string
into account, like so

pattern = Pattern.compile("[\\A\\z,]");

but this generated a PatternSyntaxException.
Can someone please suggest a way to achieve what I need?

Finally, is this the right newsgroup for this kind of questions? There
sure is a lot of noise (=spam) around here...

Thanks much,
Ichiro


import java.util.regex.Pattern;
public class Main
{
public static void main(String[] args)
{
String input;

input = "test";
printSplit(input);

input = "";
printSplit(input);

input = "a,b";
printSplit(input);

input = "a,";
printSplit(input);

input = ",b";
printSplit(input);

input = ",";
printSplit(input);

input = "a,b,c";
printSplit(input);

input = "a,b,";
printSplit(input);

input = "a,,c";
printSplit(input);

input = ",b,c";
printSplit(input);

input = "a,,";
printSplit(input);

input = ",b,";
printSplit(input);

input = ",,c";
printSplit(input);

input = ",,";
printSplit(input);
}


private static void printSplit(String input)
{
Pattern pattern;
String[] output;

pattern = Pattern.compile(",");
output = pattern.split(input);

System.out.println("'" + input + "' splits into:");
for (String s : output)
{
System.out.println("'" + s + "'");
}
System.out.println("*******************");
}
}


'test' splits into:
'test'
*******************
'' splits into:
''
*******************
'a,b' splits into:
'a'
'b'
*******************
'a,' splits into:
'a'
*******************
',b' splits into:
''
'b'
*******************
',' splits into:
*******************
'a,b,c' splits into:
'a'
'b'
'c'
*******************
'a,b,' splits into:
'a'
'b'
*******************
'a,,c' splits into:
'a'
''
'c'
*******************
',b,c' splits into:
''
'b'
'c'
*******************
'a,,' splits into:
'a'
*******************
',b,' splits into:
''
'b'
*******************
',,c' splits into:
''
''
'c'
*******************
',,' splits into:
*******************
 
A

Arne Vajhøj

Ichiro said:
Hi there,

I wrote a test harness for java.util.regex.Pattern.split and found
that, at least from my point of view, it behaves inconsistently. See
program and output below.

In particular, a trailing delimiter does not generate an empty string
on the right of the delimiter, so I get

*******************
'a,' splits into:
'a'
*******************

while I expected

*******************
'a,' splits into:
'a'
''
*******************

Also, possibly even more bizarrely

*******************
',' splits into:
*******************

while (since the empty string splits into an empty string) I expected

*******************
',,' splits into:
''
''
*******************

http://www.j2ee.me/javase/6/docs/api/java/lang/String.html#split(java.lang.String)

<quote>
Trailing empty strings are therefore not included in the resulting array.
I tried to modify my pattern to also take beginning and end of string
into account, like so

pattern = Pattern.compile("[\\A\\z,]");

but this generated a PatternSyntaxException.
Can someone please suggest a way to achieve what I need?

I would suggest either java.util.regex.Pattern or good old
StringTokenizer.
Finally, is this the right newsgroup for this kind of questions?

Sure is.
There
sure is a lot of noise (=spam) around here...

Trim your filters. It is usenet anno 2009.

Arne
 
I

Ichiro

I would suggest either java.util.regex.Pattern or good old
StringTokenizer.

Thank you Arne.
Please note that I was actually using java.util.regex.Pattern.split,
not String.split, and so I think I need to fine-tune my regular
expression for the delimiter to achieve what I want - unfortunately my
attempts have been unsuccessful so far.

Cheers,
Ichiro
 
A

Arne Vajhøj

Ichiro said:
Please note that I was actually using java.util.regex.Pattern.split,
not String.split, and so I think I need to fine-tune my regular
expression for the delimiter to achieve what I want - unfortunately my
attempts have been unsuccessful so far.

With Pattern I intended to use matcher not split.

Arne
 
I

Ichiro

With Pattern I intended to use matcher not split.

You mean you can see no other option than to roll my own "split"
functions using matcher.find() in a loop?
Seems strange that the power of regexp would not allow me to solve
this simple problem.

Thanks
 
A

Arne Vajhøj

Ichiro said:
You mean you can see no other option than to roll my own "split"
functions using matcher.find() in a loop?
Seems strange that the power of regexp would not allow me to solve
this simple problem.

The power of regex most certainly allows you to solve that.

But the simplicity of the split method does not.

But split methods explicit state in their documentation,
that trailing empty strings are removed.

If you were able to get it working, then it would be
a bug that would need to be fixed.

Arne
 
I

Ichiro

Can someone please suggest a concrete way of solving the issue, if
possible with code?
As a hack, I tried to include beginning (\\A) and end (\\z) of string
as alternative delimiters (see original post) but I had no luck.

Thank you
 
I

Ichiro

Actually, cancel that. After RTFM a little closer, I found the
solution. It's not entirely clear to me why it works, but it does.

Pattern pattern = Pattern.compile(",");
String[] output = pattern.split(input, -1);
// instead of pattern.split(input)

Thanks
 
R

Roedy Green

I wrote a test harness for java.util.regex.Pattern.split and found
that, at least from my point of view, it behaves inconsistently

There is a way around that gotcha. See
http://mindprod.com/jgloss/regex.html#SPLITTING
--
Roedy Green Canadian Mind Products
http://mindprod.com

"For reason that have a lot to do with US Government bureaucracy, we settled on the one issue everyone could agree on, which was weapons of mass destruction."
~ Paul Wolfowitz 2003-06, explaining how the Bush administration sold the Iraq war to a gullible public.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top