java String split() does not work for delimiter "|" ?

chunji08 · Oct 12, 2007

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"

Thanks a lot for any tips.

Chun

Joshua Cranmer · Oct 12, 2007

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

`split' uses a regex command, and '|' happens to be a special operator
in regex. Instead of "|", you want "\\|".

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Your regex specifies either the empty string or the empty string. Since
there is an empty string between each character, the string is split
between each character. It's what you told it do.

For more information:
<http://java.sun.com/javase/6/docs/api/java/lang/String.html> and
<http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html>

chunji08 · Oct 12, 2007

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"

Thanks a lot for any tips.

Chun

Please ignore that, "\\|" works for me, I guess I use perl too much
-

-cji

RedGrittyBrick · Oct 12, 2007

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Because the argument to split() is a regex not a string.

In regexes, certain characters (metacharacters) have special meanings.
The vertical bar is such a metacharacter, representing alternation.

public class MetaChar {
public static void main(String[] args) {
String s = "oneXtwoYthreeXfour";
String[] a = s.split("X|Y");
for (String w:a)
System.out.println(w);
}
}

You have to "escape" the vertical bar if you want to treat it as an
ordinary character and not as a metacharacter.

http://www.regular-expressions.info/alternation.html
http://www.regular-expressions.info/characters.html

Roedy Green · Oct 13, 2007

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

you mean literal | not the regex command |. See
http://mindprod.com/jgloss/regex.html
on quoting.

Roedy Green · Oct 13, 2007

String delimiter = new String(''|")

there is no need for new String.

See http://mindprod.com/jgloss/newbie.html

you can write that;

String delimiter = ''|";

but of course as others pointed out, you meant:

String delimiter = ''\\|";

smarty.ad4 · Aug 8, 2013

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"

Thanks a lot for any tips.

Chun

You can also do like this :
StringTokenizer tokenizer = new StringTokenizer(content, "||");
while(tokenizer.hasMoreTokens()){
_log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
}

smarty.ad4 · Aug 8, 2013

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"

Thanks a lot for any tips.

Chun

You can also do like this :
StringTokenizer tokenizer = new StringTokenizer(content, "|");
while(tokenizer.hasMoreTokens()){
_log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
}

Lew · Aug 8, 2013

[email protected] said:
You can also do like this :
StringTokenizer tokenizer = new StringTokenizer(content, "|");
while(tokenizer.hasMoreTokens()){
_log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
}

"StringTokenizer is a legacy class that is retained for compatibility reasons although
its use is discouraged in new code. It is recommended that anyone seeking this
functionality use the split method of String or the java.util.regex package instead."
http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

"Variable names should not start with underscore _ or dollar sign $ characters,
even though both are allowed."
http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-135099.html#367

Kevin McMurtrie · Aug 9, 2013

Lew said:
"StringTokenizer is a legacy class that is retained for compatibility reasons
although
its use is discouraged in new code. It is recommended that anyone seeking
this
functionality use the split method of String or the java.util.regex package
instead."
http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

Last time I checked, the performance of String.spit() sucked. The
JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
before calling a simple and effective tool like StringTokenizer "legacy."

Now if there was only a way to revert String.substring()'s performance
in Java 1.7, I might try Oracle's version of Java.

Arved Sandstrom · Aug 9, 2013

Last time I checked, the performance of String.spit() sucked. The
JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
before calling a simple and effective tool like StringTokenizer "legacy."

Now if there was only a way to revert String.substring()'s performance
in Java 1.7, I might try Oracle's version of Java.

I had to check that because I didn't remember ever seeing that the
Javadoc for String.split saying that the performance sucked. Lo and
behold, I don't see that language.

What's the basis for assessing the suckage of Java String.split? Doing
millions of splits? And if the situation calls for industrial text
processing, why use Java anyway? It's not the first language I'd think
of for that purpose, it's cumbersome. And you can't ramp up your RAM?

I don't mind your comments about Java implementation performance, they
are useful to followup. I just wonder what kind of Java programs you
write where you find this kind of detail to be that important. Can't say
I've ever in 15+ years seen a Java SE or EE project be significantly
impacted by these considerations.

AHS

Eric Sosman · Aug 9, 2013

Couldn't you have waited for its sixth birthday?

Kevin McMurtrie · Aug 10, 2013

Arved Sandstrom said:
I had to check that because I didn't remember ever seeing that the
Javadoc for String.split saying that the performance sucked. Lo and
behold, I don't see that language.

What's the basis for assessing the suckage of Java String.split? Doing
millions of splits? And if the situation calls for industrial text
processing, why use Java anyway? It's not the first language I'd think
of for that purpose, it's cumbersome. And you can't ramp up your RAM?

I don't mind your comments about Java implementation performance, they
are useful to followup. I just wonder what kind of Java programs you
write where you find this kind of detail to be that important. Can't say
I've ever in 15+ years seen a Java SE or EE project be significantly
impacted by these considerations.

AHS

String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.

Let me test...

Java 1.6.0_51 on an old Mac gives me these relative times:
splitNanos= 5341045000
tokenizerNanos= 1934390000

I hacked in a copy of 1.7.0_40-ea and got:
splitNanos= 3299753000
tokenizerNanos= 1675745000

It's not HUGE, but don't think you should deprecate a class that's 2
times faster than the replacement. String.split() is great for utility
use but the core code should use pre-compiled patterns or
StringTokenizer.

Last time I checked, Oracle was still targeting big business. Asking to
double the datacenter could get a whole Engineering team fired.

public class Str
{
final char testChars[]=
"\t\n;0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
.toCharArray();
final Random rnd= new Random();

public static void main(String[] args)
{
final Str str= new Str();

long splitNanos= 0;
long tokenizerNanos= 0;

for (int i= 0; i < 100; ++i)
{
final String line= str.randomAlphaNumerics();
String formatBySplit= null, formatByTokenize= null;

final long startTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatBySplit= str.formatSplit(line);
final long midTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatByTokenize= str.formatTokenized(line);
final long endTime= System.nanoTime();

splitNanos+= midTime - startTime;
tokenizerNanos+= endTime - midTime;

if (!formatBySplit.equals(formatByTokenize))
throw new RuntimeException("formatBySplit=" + formatBySplit +
" formatByTokenize=" +formatByTokenize);
}

System.out.println ("splitNanos= " + splitNanos);
System.out.println ("tokenizerNanos= " + tokenizerNanos);
}

private String formatSplit (String input)
{
final String toks[]= input.split("[ \t\n;]+");
final StringBuilder buf= new StringBuilder (input.length());

for (String tok : toks)
{
if (tok.length() > 0)
{
if (buf.length() > 0)
buf.append('\n');
buf.append(tok);
}
}
return buf.toString();
}

private String formatTokenized (String input)
{
final StringTokenizer tok= new StringTokenizer(input, " \t\n;", false);
final StringBuilder buf= new StringBuilder (input.length());

if (tok.hasMoreElements())
buf.append(tok.nextElement());

while (tok.hasMoreElements())
buf.append('\n').append(tok.nextElement());

return buf.toString();
}

private String randomAlphaNumerics ()
{
final char buf[]= new char[rnd.nextInt(200)];
for (int i= 0; i < buf.length; ++i)
buf= testChars[rnd.nextInt(testChars.length)];
return new String (buf);
}
}

Michael Jung · Aug 10, 2013

Kevin McMurtrie said:
String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.
Let me test...
Java 1.6.0_51 on an old Mac gives me these relative times:
splitNanos= 5341045000
tokenizerNanos= 1934390000
I hacked in a copy of 1.7.0_40-ea and got:
splitNanos= 3299753000
tokenizerNanos= 1675745000
It's not HUGE, but don't think you should deprecate a class that's 2
times faster than the replacement. String.split() is great for utility
use but the core code should use pre-compiled patterns or
StringTokenizer.
Last time I checked, Oracle was still targeting big business. Asking to
double the datacenter could get a whole Engineering team fired.

I can confirm that this does matter in business code. We got a 10%-20%
performance boost by avoiding split for certain use cases that used it a
lot, not just in micro-optimizing tests. The numbers from Kevin are
about what we had (although I personally wouldn't show that many decimal
places that suggest a higher degree of accuracy than is actually
reasonable).

Michael

Joerg Meier · Aug 10, 2013

String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.

There is really no way around that with .split(), short of some convoluted
internal chaching system where the last x patterns compiled by .sort are
stored for y time. You call a method with a String as a parameter twice,
how are you going to avoid having to compile the String to a Pattern other
than through that ?

The .split syntax is convenient, but slow. There is really no sensible way
to speed it up while keeping the convenient method signature. Of course,
simply using Pattern is not terribly hard at all.

With all that being said: StringTokenizer obviously can only handle very
simple splitting due to the lack of regex support, and thus is naturally
faster, but if your splitting is simple enough not to need regex, it might
be simple enough to use indexOf, which is almost a magnitude faster than
even Tokenizer.

Liebe Gruesse,
Joerg

Arved Sandstrom · Aug 11, 2013

I can confirm that this does matter in business code. We got a 10%-20%
performance boost by avoiding split for certain use cases that used it a
lot, not just in micro-optimizing tests. The numbers from Kevin are
about what we had (although I personally wouldn't show that many decimal
places that suggest a higher degree of accuracy than is actually
reasonable).

Michael

I don't doubt that use of String.split is not always the optimal
approach. From the sounds of it it's not often the optimal approach. But
I'll bet that the large majority of the time using it is a "good enough"
approach, because very often that extra 10-20 percent speed bump isn't
actually needed.

Funny thing is, I can think of one ESB application of mine right now
that needs to process a high volume of messages, and each message is
composed of 10-20 lines each one of which may have multiple fields
delimited by slashes...and I've been using String.split without
problems. Having said that, this is a 24/7 "don't fail or shit rains
down from the heavens" application, so I might try swapping out
..split(), since it's not complicated logic and I know exactly what the
delimiter is.

But I wouldn't eschew String.split as a rule. I doubt most apps care.

AHS

Michael Jung · Aug 11, 2013

Arved Sandstrom said:
On 08/10/2013 07:37 AM, Michael Jung wrote: [...]

I can confirm that this does matter in business code. We got a 10%-20%
performance boost by avoiding split for certain use cases that used it a
lot, not just in micro-optimizing tests. The numbers from Kevin are
about what we had (although I personally wouldn't show that many decimal
places that suggest a higher degree of accuracy than is actually
reasonable).

Click to expand...

I don't doubt that use of String.split is not always the optimal
approach. From the sounds of it it's not often the optimal
approach. But I'll bet that the large majority of the time using it is
a "good enough" approach, because very often that extra 10-20 percent
speed bump isn't actually needed. [...]
But I wouldn't eschew String.split as a rule. I doubt most apps care.

I use split myself often enough. You can read my response as a case for
optimzation surprises. The micro benchmark shows around a 200% boost
(3:10), the overall gain was 15%, but the code in question as to the
amount of (user-level) code run through was far less than 1% (big "fat"
EE application).

Michael

Joerg Meier · Aug 11, 2013

I use split myself often enough. You can read my response as a case for
optimzation surprises. The micro benchmark shows around a 200% boost
(3:10), the overall gain was 15%, but the code in question as to the
amount of (user-level) code run through was far less than 1% (big "fat"
EE application).

Well, odds are, not many applications spend 25% of their CPU time doing
..split(), so I would say that your application speeding up that much is an
extreme edge case. What on Earth do you do that requires millions of
..split() calls per second, and why did you think that would even remotely
be a representative example ?

Liebe Gruesse,
Joerg

Michael Jung · Aug 11, 2013

Joerg Meier said:
Well, odds are, not many applications spend 25% of their CPU time doing
.split(), so I would say that your application speeding up that much is an
extreme edge case. What on Earth do you do that requires millions of
.split() calls per second, and why did you think that would even remotely
be a representative example ?

Odds are that the rest of the application was already highly
optimized. (I already said this was for certain use cases.) Whether this
is representative of something, I don't know, everybody has to judge for
himself what to do with split. But string manipulation is omnipresent in
many applications these days. This was just some light.

Michael

How can you make idle processors pick up java work?	3	Jul 31, 2012
How can you make idle processors pick up java work?	3	Jul 31, 2012
java 1.5 InetAddr doesn't consult system hosts file unless run from [pseudo]tty?	5	Nov 15, 2006
com.sun.tools.javac.Main is not on the classpath. Perhaps JAVA_HOME does not point to the JDK	8	Feb 6, 2006
Why is Java so slow????	58	Nov 19, 2007
Threads and Virtual Memory	22	Jan 10, 2005
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
One Small step one infinite leap	1	Feb 6, 2005

java String split() does not work for delimiter "|" ?

chunji08

Joshua Cranmer

chunji08

RedGrittyBrick

Roedy Green

Roedy Green

smarty.ad4

smarty.ad4

Lew

Kevin McMurtrie

Arved Sandstrom

Eric Sosman

Kevin McMurtrie

Michael Jung

Joerg Meier

Arved Sandstrom

Michael Jung

Joerg Meier

Michael Jung

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads