Arved Sandstrom said:
I had to check that because I didn't remember ever seeing that the
Javadoc for String.split saying that the performance sucked. Lo and
behold, I don't see that language.
What's the basis for assessing the suckage of Java String.split? Doing
millions of splits? And if the situation calls for industrial text
processing, why use Java anyway? It's not the first language I'd think
of for that purpose, it's cumbersome. And you can't ramp up your RAM?
I don't mind your comments about Java implementation performance, they
are useful to followup. I just wonder what kind of Java programs you
write where you find this kind of detail to be that important. Can't say
I've ever in 15+ years seen a Java SE or EE project be significantly
impacted by these considerations.
AHS
String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.
Let me test...
Java 1.6.0_51 on an old Mac gives me these relative times:
splitNanos= 5341045000
tokenizerNanos= 1934390000
I hacked in a copy of 1.7.0_40-ea and got:
splitNanos= 3299753000
tokenizerNanos= 1675745000
It's not HUGE, but don't think you should deprecate a class that's 2
times faster than the replacement. String.split() is great for utility
use but the core code should use pre-compiled patterns or
StringTokenizer.
Last time I checked, Oracle was still targeting big business. Asking to
double the datacenter could get a whole Engineering team fired.
public class Str
{
final char testChars[]=
"\t\n;0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
.toCharArray();
final Random rnd= new Random();
public static void main(String[] args)
{
final Str str= new Str();
long splitNanos= 0;
long tokenizerNanos= 0;
for (int i= 0; i < 100; ++i)
{
final String line= str.randomAlphaNumerics();
String formatBySplit= null, formatByTokenize= null;
final long startTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatBySplit= str.formatSplit(line);
final long midTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatByTokenize= str.formatTokenized(line);
final long endTime= System.nanoTime();
splitNanos+= midTime - startTime;
tokenizerNanos+= endTime - midTime;
if (!formatBySplit.equals(formatByTokenize))
throw new RuntimeException("formatBySplit=" + formatBySplit +
" formatByTokenize=" +formatByTokenize);
}
System.out.println ("splitNanos= " + splitNanos);
System.out.println ("tokenizerNanos= " + tokenizerNanos);
}
private String formatSplit (String input)
{
final String toks[]= input.split("[ \t\n;]+");
final StringBuilder buf= new StringBuilder (input.length());
for (String tok : toks)
{
if (tok.length() > 0)
{
if (buf.length() > 0)
buf.append('\n');
buf.append(tok);
}
}
return buf.toString();
}
private String formatTokenized (String input)
{
final StringTokenizer tok= new StringTokenizer(input, " \t\n;", false);
final StringBuilder buf= new StringBuilder (input.length());
if (tok.hasMoreElements())
buf.append(tok.nextElement());
while (tok.hasMoreElements())
buf.append('\n').append(tok.nextElement());
return buf.toString();
}
private String randomAlphaNumerics ()
{
final char buf[]= new char[rnd.nextInt(200)];
for (int i= 0; i < buf.length; ++i)
buf
= testChars[rnd.nextInt(testChars.length)];
return new String (buf);
}
}