How to read a flat file quickly

T

tnorgd

Dear Group,

I have to read quite a big (around 1GB) flat text file (looks like a
spreadsheet with columns: strings, doubles and integers separated with
whitechars). Each line looks exactly the same and from each line I am
creating a single object with final fields corresponding to the data
from a file.

I started with Scanner and its nextDouble(), nextInt() methods. Then I
checked that it is faster to do it this way:
String[] entries = buffer.readLine().split("\\s+");
int data1 = Integer.parseInt(entries[0]);
// and so on, each data entry is parsed as above

Do you have any experience which way of parsing such a file might be
the fastest?

Best regards,
Dominik
 
A

Arne Vajhøj

I have to read quite a big (around 1GB) flat text file (looks like a
spreadsheet with columns: strings, doubles and integers separated with
whitechars). Each line looks exactly the same and from each line I am
creating a single object with final fields corresponding to the data
from a file.

I started with Scanner and its nextDouble(), nextInt() methods. Then I
checked that it is faster to do it this way:
String[] entries = buffer.readLine().split("\\s+");
int data1 = Integer.parseInt(entries[0]);
// and so on, each data entry is parsed as above

Do you have any experience which way of parsing such a file might be
the fastest?

BufferedReader and a custom parse (indexOf, substring etc.) instead
of regex would be my suggestion if you are willing to spend time
getting the code right to save some parsing time.

Arne
 
T

tnorgd

I am asking this question, because I am porting a C code to JAVA. The
original one uses scanf(). My java version uses String.split() and
then Integer.parseInt, Double.parseDouble etc. The java version is
around 5 times slower. I wonder if its "because java is slow -
frequently used slogan", or I can make it better.

Dominik
 
R

Roedy Green

I have to read quite a big (around 1GB) flat text file (looks like a
spreadsheet with columns: strings, doubles and integers separated with
whitechars). Each line looks exactly the same and from each line I am
creating a single object with final fields corresponding to the data
from a file.

The key is to read it buffered with a sufficiently big buffer to get
the physical i/o out the way.

to split the line, you could use a Regex split. You might get a tiny
bit more speed analysing each line chugging along char by char with
charAt.

You could also use CSVReader configuring space for the separator. You
could then not have more than one space between fields. See
http://mindprod.com/products1.html#CSV

--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 
E

Eric Sosman

I am asking this question, because I am porting a C code to JAVA. The
original one uses scanf(). My java version uses String.split() and
then Integer.parseInt, Double.parseDouble etc. The java version is
around 5 times slower. I wonder if its "because java is slow -
frequently used slogan", or I can make it better.

The code snippet you showed earlier suggests that you're
compiling a brand-new regex for each line you split, using
it once, and throwing it away. No wonder "Java is slow!"

There's also the question of whether dragging out all the
regex machinery mightn't be overkill for such a simple format;
you could probably gain some speed by just looking for white
space yourself instead of using aiming cannons at canaries.
But as a first step, try re-using a single Pattern instead of
compiling a new one for every line.
 
D

Daniel Pitts

I am asking this question, because I am porting a C code to JAVA. The
original one uses scanf(). My java version uses String.split() and
then Integer.parseInt, Double.parseDouble etc. The java version is
around 5 times slower. I wonder if its "because java is slow -
frequently used slogan", or I can make it better.

Dominik
You can make it faster by not using split.
Look up StringTokenizer and StreamTokenizer (very different classes with
different uses)

They may be able to provide a faster implementation for you.

Otherwise, you can probably code one by hand that is faster, using
BufferedReader to read a line at a time and indexOf/substring to split
your String.
 
T

tnorgd

OK, so I did some tests. Results are the following (for a part of my
data file):

1-A) Just to read lines:
while ((line = in.readLine()) != null);
takes 1.9 sec
1-B) readLine() + pattern.split(line) takes 7.0 sec

2) Just tokens (which does roughly what 1-A and 1-B do together):
while ((st.nextToken()) != StreamTokenizer.TT_EOF);
takes 6.6 sec

When I add parsing e.g. Integer.parseInt() and Double.parseDouble() in
both cases I end up around 10sec. Yes, I apparently I have to do
parsing also in the case with StreamTokenizer. My input contains
strings with digits (like "Johny17") which are parsed into two
distinct tokens. So I had to switch of parsing numbers within
StreamTokenizer and to do it on my own.

Some of you have suggested that I gain some speed by:
A) increasing buffer size: yes, around 10% effect
B) Changing from split("\\s+"") to a compiled pattern: this has almost
no effect.

Best regards,
Dominik
 
J

John B. Matthews

OK, so I did some tests. Results are the following (for a part of my
data file):

1-A) Just to read lines:
while ((line = in.readLine()) != null);
takes 1.9 sec
1-B) readLine() + pattern.split(line) takes 7.0 sec

2) Just tokens (which does roughly what 1-A and 1-B do together):
while ((st.nextToken()) != StreamTokenizer.TT_EOF);
takes 6.6 sec

When I add parsing e.g. Integer.parseInt() and Double.parseDouble() in
both cases I end up around 10sec. Yes, I apparently I have to do
parsing also in the case with StreamTokenizer. My input contains
strings with digits (like "Johny17") which are parsed into two
distinct tokens. So I had to switch of parsing numbers within
StreamTokenizer and to do it on my own.

Some of you have suggested that I gain some speed by:
A) increasing buffer size: yes, around 10% effect
B) Changing from split("\\s+"") to a compiled pattern: this has almost
no effect.

Indeed, compiling such a short pattern has minimal benefit, but Eric
Sosman's parser suggestion may be worth the effort. I liked Daniel
Pitts' StreamTokenizer idea well enough to try it. It might be better
for creating a Double array:

<console>
Warmup: 30

Size: 5
RegEx: 19
Compiled: 3
Parse: 5
Token: 24

Size: 50
RegEx: 28
Compiled: 29
Parse: 14
Token: 61

Size: 500
RegEx: 280
Compiled: 276
Parse: 139
Token: 591

Size: 5000
RegEx: 3042
Compiled: 3007
Parse: 2038
Token: 8000
</console>

<code>
package cli;

import java.io.IOException;
import java.io.Reader;
import java.io.StreamTokenizer;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Pattern;

/** @author JBM*/
public class RCPTest {

private static final Random random = new Random();

public static void main(String[] args) {
(new Warmup()).test(testString(1));
System.out.println();
for (int i = 1; i < 5; i++) {
int padding = (int) Math.pow(10, i) / 2;
System.out.println("Size: " + padding);
String s = testString(padding);
(new RegEx()).test(s);
(new Compiled()).test(s);
(new Parse()).test(s);
(new Token()).test(s);
System.out.println();
}
}

private static String testString(int count) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < count; i++) {
sb.append(random.nextInt());
sb.append(" ");
}
return sb.toString();
}
}

abstract class Test {

public static final int COUNT = 1000;

public void test(String in) {
long start = System.currentTimeMillis();
for (int i = 0; i < COUNT; i++) {
split(in);
}
System.out.println(name()
+ (System.currentTimeMillis() - start));
}

public abstract String[] split(String in);

public abstract String name();
}

class Warmup extends Test {

public String[] split(String in) {
return (new RegEx()).split(in);
}

public String name() {
return "Warmup: ";
}
}

class RegEx extends Test {

public String[] split(String in) {
return in.split("\\s+");
}

public String name() {
return "RegEx: ";
}
}

class Compiled extends Test {

private static final Pattern p = Pattern.compile("\\s+");

public String[] split(String in) {
return p.split(in);
}

public String name() {
return "Compiled: ";
}
}

class Parse extends Test {

public String[] split(String in) {
List<String> list = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
int len = in.length();
int i = 0;
char c;
while (i < len) {
c = in.charAt(i++);
if (c == ' ' || i == len) {
list.add(sb.toString());
sb.delete(0, len - 1);
} else {
sb.append(c);
}
}
return list.toArray(new String[0]);
}

public String name() {
return "Parse: ";
}
}

class Token extends Test {

public String[] split(String in) {
Reader reader = new StringReader(in);
StreamTokenizer tokens = new StreamTokenizer(reader);
List<String> list = new ArrayList<String>();
double d;
try {
int token = tokens.nextToken();
while (token != StreamTokenizer.TT_EOF) {
d = tokens.nval;
list.add(Double.toString(d));
token = tokens.nextToken();
}
return list.toArray(new String[0]);
} catch (IOException ex) {
ex.printStackTrace(System.err);
return new String[0];
}
}

public String name() {
return "Token: ";
}
}
</code>
 
J

John B. Matthews

Eric Sosman said:
... instead of using aiming cannons at canaries.

I often use aiming cannons to get the canary's range then open up with
the 16-inch gun. I know it's wasteful.
 
E

Eric Sosman

John said:
I often use aiming cannons to get the canary's range then open up with
the 16-inch gun. I know it's wasteful.

Too much in-flight editing, too little editorial
review ...

Myself, I don't even bother aiming at the damn
canaries. They go "cheep, cheep, cheep" and I just
set off a nice, non-directional hundred megaton bomb.
The only drawback is that I keep dropping dead from
noxious gases down in the mine. C'est la vie -- er,
la mort.
 
A

Andreas Leitgeb

Eric Sosman said:
[...]
Some of you have suggested that I gain some speed by:
A) increasing buffer size: yes, around 10% effect
B) Changing from split("\\s+"") to a compiled pattern: this has almost
no effect.
I admit surprise at (B) -- which just goes to show (again)
that opinion is inferior to measurement.

Two other possible explanations for surprise:

1) there may be still a bug in the test code, and the
Pattern (accidentally) get's re-compiled all the time.

2) the Regex part just isn't relevant to the total runtime.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top