java String split() does not work for delimiter "|" ?

C

chunji08

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"


Thanks a lot for any tips.


Chun
 
J

Joshua Cranmer

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

`split' uses a regex command, and '|' happens to be a special operator
in regex. Instead of "|", you want "\\|".
Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Your regex specifies either the empty string or the empty string. Since
there is an empty string between each character, the string is split
between each character. It's what you told it do.

For more information:
<http://java.sun.com/javase/6/docs/api/java/lang/String.html> and
<http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html>
 
C

chunji08

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"

Thanks a lot for any tips.

Chun

Please ignore that, "\\|" works for me, I guess I use perl too much
-:)


-cji
 
R

RedGrittyBrick

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Because the argument to split() is a regex not a string.

In regexes, certain characters (metacharacters) have special meanings.
The vertical bar is such a metacharacter, representing alternation.

public class MetaChar {
public static void main(String[] args) {
String s = "oneXtwoYthreeXfour";
String[] a = s.split("X|Y");
for (String w:a)
System.out.println(w);
}
}

You have to "escape" the vertical bar if you want to treat it as an
ordinary character and not as a metacharacter.

http://www.regular-expressions.info/alternation.html
http://www.regular-expressions.info/characters.html
 
S

smarty.ad4

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"


Thanks a lot for any tips.


Chun

You can also do like this :
StringTokenizer tokenizer = new StringTokenizer(content, "||");
while(tokenizer.hasMoreTokens()){
_log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
}
 
S

smarty.ad4

Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"


Thanks a lot for any tips.


Chun

You can also do like this :
StringTokenizer tokenizer = new StringTokenizer(content, "|");
while(tokenizer.hasMoreTokens()){
_log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
}
 
L

Lew

You can also do like this :
StringTokenizer tokenizer = new StringTokenizer(content, "|");
while(tokenizer.hasMoreTokens()){
_log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
}

"StringTokenizer is a legacy class that is retained for compatibility reasons although
its use is discouraged in new code. It is recommended that anyone seeking this
functionality use the split method of String or the java.util.regex package instead."
http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

"Variable names should not start with underscore _ or dollar sign $ characters,
even though both are allowed."
http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-135099.html#367
 
K

Kevin McMurtrie

Lew said:
"StringTokenizer is a legacy class that is retained for compatibility reasons
although
its use is discouraged in new code. It is recommended that anyone seeking
this
functionality use the split method of String or the java.util.regex package
instead."
http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

Last time I checked, the performance of String.spit() sucked. The
JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
before calling a simple and effective tool like StringTokenizer "legacy."

Now if there was only a way to revert String.substring()'s performance
in Java 1.7, I might try Oracle's version of Java.
 
A

Arved Sandstrom

Last time I checked, the performance of String.spit() sucked. The
JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
before calling a simple and effective tool like StringTokenizer "legacy."

Now if there was only a way to revert String.substring()'s performance
in Java 1.7, I might try Oracle's version of Java.

I had to check that because I didn't remember ever seeing that the
Javadoc for String.split saying that the performance sucked. Lo and
behold, I don't see that language.

What's the basis for assessing the suckage of Java String.split? Doing
millions of splits? And if the situation calls for industrial text
processing, why use Java anyway? It's not the first language I'd think
of for that purpose, it's cumbersome. And you can't ramp up your RAM?

I don't mind your comments about Java implementation performance, they
are useful to followup. I just wonder what kind of Java programs you
write where you find this kind of detail to be that important. Can't say
I've ever in 15+ years seen a Java SE or EE project be significantly
impacted by these considerations.

AHS
 
K

Kevin McMurtrie

Arved Sandstrom said:
I had to check that because I didn't remember ever seeing that the
Javadoc for String.split saying that the performance sucked. Lo and
behold, I don't see that language.

What's the basis for assessing the suckage of Java String.split? Doing
millions of splits? And if the situation calls for industrial text
processing, why use Java anyway? It's not the first language I'd think
of for that purpose, it's cumbersome. And you can't ramp up your RAM?

I don't mind your comments about Java implementation performance, they
are useful to followup. I just wonder what kind of Java programs you
write where you find this kind of detail to be that important. Can't say
I've ever in 15+ years seen a Java SE or EE project be significantly
impacted by these considerations.

AHS

String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.

Let me test...

Java 1.6.0_51 on an old Mac gives me these relative times:
splitNanos= 5341045000
tokenizerNanos= 1934390000

I hacked in a copy of 1.7.0_40-ea and got:
splitNanos= 3299753000
tokenizerNanos= 1675745000


It's not HUGE, but don't think you should deprecate a class that's 2
times faster than the replacement. String.split() is great for utility
use but the core code should use pre-compiled patterns or
StringTokenizer.

Last time I checked, Oracle was still targeting big business. Asking to
double the datacenter could get a whole Engineering team fired.



public class Str
{
final char testChars[]=
"\t\n;0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
.toCharArray();
final Random rnd= new Random();

public static void main(String[] args)
{
final Str str= new Str();

long splitNanos= 0;
long tokenizerNanos= 0;

for (int i= 0; i < 100; ++i)
{
final String line= str.randomAlphaNumerics();
String formatBySplit= null, formatByTokenize= null;

final long startTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatBySplit= str.formatSplit(line);
final long midTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatByTokenize= str.formatTokenized(line);
final long endTime= System.nanoTime();

splitNanos+= midTime - startTime;
tokenizerNanos+= endTime - midTime;

if (!formatBySplit.equals(formatByTokenize))
throw new RuntimeException("formatBySplit=" + formatBySplit +
" formatByTokenize=" +formatByTokenize);
}

System.out.println ("splitNanos= " + splitNanos);
System.out.println ("tokenizerNanos= " + tokenizerNanos);
}

private String formatSplit (String input)
{
final String toks[]= input.split("[ \t\n;]+");
final StringBuilder buf= new StringBuilder (input.length());

for (String tok : toks)
{
if (tok.length() > 0)
{
if (buf.length() > 0)
buf.append('\n');
buf.append(tok);
}
}
return buf.toString();
}

private String formatTokenized (String input)
{
final StringTokenizer tok= new StringTokenizer(input, " \t\n;", false);
final StringBuilder buf= new StringBuilder (input.length());

if (tok.hasMoreElements())
buf.append(tok.nextElement());

while (tok.hasMoreElements())
buf.append('\n').append(tok.nextElement());

return buf.toString();
}

private String randomAlphaNumerics ()
{
final char buf[]= new char[rnd.nextInt(200)];
for (int i= 0; i < buf.length; ++i)
buf= testChars[rnd.nextInt(testChars.length)];
return new String (buf);
}
}
 
M

Michael Jung

Kevin McMurtrie said:
String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.
Let me test...
Java 1.6.0_51 on an old Mac gives me these relative times:
splitNanos= 5341045000
tokenizerNanos= 1934390000
I hacked in a copy of 1.7.0_40-ea and got:
splitNanos= 3299753000
tokenizerNanos= 1675745000
It's not HUGE, but don't think you should deprecate a class that's 2
times faster than the replacement. String.split() is great for utility
use but the core code should use pre-compiled patterns or
StringTokenizer.
Last time I checked, Oracle was still targeting big business. Asking to
double the datacenter could get a whole Engineering team fired.

I can confirm that this does matter in business code. We got a 10%-20%
performance boost by avoiding split for certain use cases that used it a
lot, not just in micro-optimizing tests. The numbers from Kevin are
about what we had (although I personally wouldn't show that many decimal
places that suggest a higher degree of accuracy than is actually
reasonable).

Michael
 
J

Joerg Meier

String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.

There is really no way around that with .split(), short of some convoluted
internal chaching system where the last x patterns compiled by .sort are
stored for y time. You call a method with a String as a parameter twice,
how are you going to avoid having to compile the String to a Pattern other
than through that ?

The .split syntax is convenient, but slow. There is really no sensible way
to speed it up while keeping the convenient method signature. Of course,
simply using Pattern is not terribly hard at all.

With all that being said: StringTokenizer obviously can only handle very
simple splitting due to the lack of regex support, and thus is naturally
faster, but if your splitting is simple enough not to need regex, it might
be simple enough to use indexOf, which is almost a magnitude faster than
even Tokenizer.

Liebe Gruesse,
Joerg
 
A

Arved Sandstrom

I can confirm that this does matter in business code. We got a 10%-20%
performance boost by avoiding split for certain use cases that used it a
lot, not just in micro-optimizing tests. The numbers from Kevin are
about what we had (although I personally wouldn't show that many decimal
places that suggest a higher degree of accuracy than is actually
reasonable).

Michael
I don't doubt that use of String.split is not always the optimal
approach. From the sounds of it it's not often the optimal approach. But
I'll bet that the large majority of the time using it is a "good enough"
approach, because very often that extra 10-20 percent speed bump isn't
actually needed.

Funny thing is, I can think of one ESB application of mine right now
that needs to process a high volume of messages, and each message is
composed of 10-20 lines each one of which may have multiple fields
delimited by slashes...and I've been using String.split without
problems. Having said that, this is a 24/7 "don't fail or shit rains
down from the heavens" application, so I might try swapping out
..split(), since it's not complicated logic and I know exactly what the
delimiter is.

But I wouldn't eschew String.split as a rule. I doubt most apps care.

AHS
 
M

Michael Jung

Arved Sandstrom said:
On 08/10/2013 07:37 AM, Michael Jung wrote: [...]
I can confirm that this does matter in business code. We got a 10%-20%
performance boost by avoiding split for certain use cases that used it a
lot, not just in micro-optimizing tests. The numbers from Kevin are
about what we had (although I personally wouldn't show that many decimal
places that suggest a higher degree of accuracy than is actually
reasonable).
I don't doubt that use of String.split is not always the optimal
approach. From the sounds of it it's not often the optimal
approach. But I'll bet that the large majority of the time using it is
a "good enough" approach, because very often that extra 10-20 percent
speed bump isn't actually needed. [...]
But I wouldn't eschew String.split as a rule. I doubt most apps care.

I use split myself often enough. You can read my response as a case for
optimzation surprises. The micro benchmark shows around a 200% boost
(3:10), the overall gain was 15%, but the code in question as to the
amount of (user-level) code run through was far less than 1% (big "fat"
EE application).

Michael
 
J

Joerg Meier

I use split myself often enough. You can read my response as a case for
optimzation surprises. The micro benchmark shows around a 200% boost
(3:10), the overall gain was 15%, but the code in question as to the
amount of (user-level) code run through was far less than 1% (big "fat"
EE application).

Well, odds are, not many applications spend 25% of their CPU time doing
..split(), so I would say that your application speeding up that much is an
extreme edge case. What on Earth do you do that requires millions of
..split() calls per second, and why did you think that would even remotely
be a representative example ?

Liebe Gruesse,
Joerg
 
M

Michael Jung

Joerg Meier said:
Well, odds are, not many applications spend 25% of their CPU time doing
.split(), so I would say that your application speeding up that much is an
extreme edge case. What on Earth do you do that requires millions of
.split() calls per second, and why did you think that would even remotely
be a representative example ?

Odds are that the rest of the application was already highly
optimized. (I already said this was for certain use cases.) Whether this
is representative of something, I don't know, everybody has to judge for
himself what to do with split. But string manipulation is omnipresent in
many applications these days. This was just some light.

Michael
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top