How to read a flat file quickly

tnorgd · May 12, 2009

Dear Group,

I have to read quite a big (around 1GB) flat text file (looks like a
spreadsheet with columns: strings, doubles and integers separated with
whitechars). Each line looks exactly the same and from each line I am
creating a single object with final fields corresponding to the data
from a file.

I started with Scanner and its nextDouble(), nextInt() methods. Then I
checked that it is faster to do it this way:
String[] entries = buffer.readLine().split("\\s+");
int data1 = Integer.parseInt(entries[0]);
// and so on, each data entry is parsed as above

Do you have any experience which way of parsing such a file might be
the fastest?

Best regards,
Dominik

Arne Vajhøj · May 12, 2009

I have to read quite a big (around 1GB) flat text file (looks like a
spreadsheet with columns: strings, doubles and integers separated with
whitechars). Each line looks exactly the same and from each line I am
creating a single object with final fields corresponding to the data
from a file.

I started with Scanner and its nextDouble(), nextInt() methods. Then I
checked that it is faster to do it this way:
String[] entries = buffer.readLine().split("\\s+");
int data1 = Integer.parseInt(entries[0]);
// and so on, each data entry is parsed as above

Do you have any experience which way of parsing such a file might be
the fastest?

BufferedReader and a custom parse (indexOf, substring etc.) instead
of regex would be my suggestion if you are willing to spend time
getting the code right to save some parsing time.

Arne

tnorgd · May 13, 2009

I am asking this question, because I am porting a C code to JAVA. The
original one uses scanf(). My java version uses String.split() and
then Integer.parseInt, Double.parseDouble etc. The java version is
around 5 times slower. I wonder if its "because java is slow -
frequently used slogan", or I can make it better.

Dominik

Roedy Green · May 13, 2009

I have to read quite a big (around 1GB) flat text file (looks like a
spreadsheet with columns: strings, doubles and integers separated with
whitechars). Each line looks exactly the same and from each line I am
creating a single object with final fields corresponding to the data
from a file.

The key is to read it buffered with a sufficiently big buffer to get
the physical i/o out the way.

to split the line, you could use a Regex split. You might get a tiny
bit more speed analysing each line chugging along char by char with
charAt.

You could also use CSVReader configuring space for the separator. You
could then not have more than one space between fields. See
http://mindprod.com/products1.html#CSV

--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25

Eric Sosman · May 13, 2009

I am asking this question, because I am porting a C code to JAVA. The
original one uses scanf(). My java version uses String.split() and
then Integer.parseInt, Double.parseDouble etc. The java version is
around 5 times slower. I wonder if its "because java is slow -
frequently used slogan", or I can make it better.

The code snippet you showed earlier suggests that you're
compiling a brand-new regex for each line you split, using
it once, and throwing it away. No wonder "Java is slow!"

There's also the question of whether dragging out all the
regex machinery mightn't be overkill for such a simple format;
you could probably gain some speed by just looking for white
space yourself instead of using aiming cannons at canaries.
But as a first step, try re-using a single Pattern instead of
compiling a new one for every line.

Daniel Pitts · May 13, 2009

I am asking this question, because I am porting a C code to JAVA. The
original one uses scanf(). My java version uses String.split() and
then Integer.parseInt, Double.parseDouble etc. The java version is
around 5 times slower. I wonder if its "because java is slow -
frequently used slogan", or I can make it better.

Dominik

You can make it faster by not using split.
Look up StringTokenizer and StreamTokenizer (very different classes with
different uses)

They may be able to provide a faster implementation for you.

Otherwise, you can probably code one by hand that is faster, using
BufferedReader to read a line at a time and indexOf/substring to split
your String.

tnorgd · May 14, 2009

OK, so I did some tests. Results are the following (for a part of my
data file):

1-A) Just to read lines:
while ((line = in.readLine()) != null);
takes 1.9 sec
1-B) readLine() + pattern.split(line) takes 7.0 sec

2) Just tokens (which does roughly what 1-A and 1-B do together):
while ((st.nextToken()) != StreamTokenizer.TT_EOF);
takes 6.6 sec

When I add parsing e.g. Integer.parseInt() and Double.parseDouble() in
both cases I end up around 10sec. Yes, I apparently I have to do
parsing also in the case with StreamTokenizer. My input contains
strings with digits (like "Johny17") which are parsed into two
distinct tokens. So I had to switch of parsing numbers within
StreamTokenizer and to do it on my own.

Some of you have suggested that I gain some speed by:
A) increasing buffer size: yes, around 10% effect
B) Changing from split("\\s+"") to a compiled pattern: this has almost
no effect.

Best regards,
Dominik

John B. Matthews · May 14, 2009

OK, so I did some tests. Results are the following (for a part of my
data file):

1-A) Just to read lines:
while ((line = in.readLine()) != null);
takes 1.9 sec
1-B) readLine() + pattern.split(line) takes 7.0 sec

2) Just tokens (which does roughly what 1-A and 1-B do together):
while ((st.nextToken()) != StreamTokenizer.TT_EOF);
takes 6.6 sec

When I add parsing e.g. Integer.parseInt() and Double.parseDouble() in
both cases I end up around 10sec. Yes, I apparently I have to do
parsing also in the case with StreamTokenizer. My input contains
strings with digits (like "Johny17") which are parsed into two
distinct tokens. So I had to switch of parsing numbers within
StreamTokenizer and to do it on my own.

Some of you have suggested that I gain some speed by:
A) increasing buffer size: yes, around 10% effect
B) Changing from split("\\s+"") to a compiled pattern: this has almost
no effect.

Indeed, compiling such a short pattern has minimal benefit, but Eric
Sosman's parser suggestion may be worth the effort. I liked Daniel
Pitts' StreamTokenizer idea well enough to try it. It might be better
for creating a Double array:

<console>
Warmup: 30

Size: 5
RegEx: 19
Compiled: 3
Parse: 5
Token: 24

Size: 50
RegEx: 28
Compiled: 29
Parse: 14
Token: 61

Size: 500
RegEx: 280
Compiled: 276
Parse: 139
Token: 591

Size: 5000
RegEx: 3042
Compiled: 3007
Parse: 2038
Token: 8000
</console>

<code>
package cli;

import java.io.IOException;
import java.io.Reader;
import java.io.StreamTokenizer;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Pattern;

/** @author JBM*/
public class RCPTest {

private static final Random random = new Random();

public static void main(String[] args) {
(new Warmup()).test(testString(1));
System.out.println();
for (int i = 1; i < 5; i++) {
int padding = (int) Math.pow(10, i) / 2;
System.out.println("Size: " + padding);
String s = testString(padding);
(new RegEx()).test(s);
(new Compiled()).test(s);
(new Parse()).test(s);
(new Token()).test(s);
System.out.println();
}
}

private static String testString(int count) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < count; i++) {
sb.append(random.nextInt());
sb.append(" ");
}
return sb.toString();
}
}

abstract class Test {

public static final int COUNT = 1000;

public void test(String in) {
long start = System.currentTimeMillis();
for (int i = 0; i < COUNT; i++) {
split(in);
}
System.out.println(name()
+ (System.currentTimeMillis() - start));
}

public abstract String[] split(String in);

public abstract String name();
}

class Warmup extends Test {

public String[] split(String in) {
return (new RegEx()).split(in);
}

public String name() {
return "Warmup: ";
}
}

class RegEx extends Test {

public String[] split(String in) {
return in.split("\\s+");
}

public String name() {
return "RegEx: ";
}
}

class Compiled extends Test {

private static final Pattern p = Pattern.compile("\\s+");

public String[] split(String in) {
return p.split(in);
}

public String name() {
return "Compiled: ";
}
}

class Parse extends Test {

public String[] split(String in) {
List<String> list = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
int len = in.length();
int i = 0;
char c;
while (i < len) {
c = in.charAt(i++);
if (c == ' ' || i == len) {
list.add(sb.toString());
sb.delete(0, len - 1);
} else {
sb.append(c);
}
}
return list.toArray(new String[0]);
}

public String name() {
return "Parse: ";
}
}

class Token extends Test {

public String[] split(String in) {
Reader reader = new StringReader(in);
StreamTokenizer tokens = new StreamTokenizer(reader);
List<String> list = new ArrayList<String>();
double d;
try {
int token = tokens.nextToken();
while (token != StreamTokenizer.TT_EOF) {
d = tokens.nval;
list.add(Double.toString(d));
token = tokens.nextToken();
}
return list.toArray(new String[0]);
} catch (IOException ex) {
ex.printStackTrace(System.err);
return new String[0];
}
}

public String name() {
return "Token: ";
}
}
</code>

John B. Matthews · May 14, 2009

Eric Sosman said:
... instead of using aiming cannons at canaries.

I often use aiming cannons to get the canary's range then open up with
the 16-inch gun. I know it's wasteful.

Eric Sosman · May 14, 2009

John said:
I often use aiming cannons to get the canary's range then open up with
the 16-inch gun. I know it's wasteful.

Too much in-flight editing, too little editorial
review ...

Myself, I don't even bother aiming at the damn
canaries. They go "cheep, cheep, cheep" and I just
set off a nice, non-directional hundred megaton bomb.
The only drawback is that I keep dropping dead from
noxious gases down in the mine. C'est la vie -- er,
la mort.

Andreas Leitgeb · May 14, 2009

Eric Sosman said:
[...]
Some of you have suggested that I gain some speed by:
A) increasing buffer size: yes, around 10% effect
B) Changing from split("\\s+"") to a compiled pattern: this has almost
no effect.

Click to expand...

I admit surprise at (B) -- which just goes to show (again)
that opinion is inferior to measurement.

Two other possible explanations for surprise:

1) there may be still a bug in the test code, and the
Pattern (accidentally) get's re-compiled all the time.

2) the Regex part just isn't relevant to the total runtime.

How to read from a .csv file in Java?	1	Nov 6, 2023
How to create PDF file in Batch	5	May 11, 2022
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	1	Mar 22, 2024
How to save JSON Data to a file using fetch() api?	2	Apr 28, 2022
how to find a lable quickly?	3	May 4, 2007
quickly read a formated file?	3	Mar 1, 2007
How to read a directory path from a txt file	6	Jun 2, 2014

How to read a flat file quickly

tnorgd

Arne Vajhøj

tnorgd

Roedy Green

Eric Sosman

Daniel Pitts

tnorgd

John B. Matthews

John B. Matthews

Eric Sosman

Andreas Leitgeb

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads