Efficient format for huge amount of data

Gabriel Genellina · Jan 20, 2004

I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
.... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.
I've considered using Serialized, but since the source program is not
written in Java it may be hard to replicate exactly the Serialized
format - btw, where is it documented? if documented at all...

Any ideas are welcome.
Thanks,

Gabriel Genellina
Softlab SRL

Marco Schmidt · Jan 20, 2004

Gabriel Genellina:

[...]

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.

There are DataInputStream and DataOutputStream. Both have read and
write method for the primitive types of Java and Strings. Byte order
is big endian, valid intervals for the primitives types are defined in
the Java specs (e.g. char from 0 to 65535), the format of String
serialization is described in the API docs of read/writeUTF.

So if an element would be like the data you described above, an
element class could be:

class Element {
String s;
int i;
float f;
String s2;
}

And reading and writing could work like that:

Element read(DataInputStream in) throws IOException {
Element elem = new Element();
elem.s = in.readUTF();
elem.i = in.readInt();
elem.f = in.readFloat();
elem.s2 = in.readUTF();
return elem;
}

void write(DataOutputStream out, Element elem) throws IOException {
out.writeUTF(elem.s);
out.writeInt(elem.i);
out.writeFloat(elem.f);
out.writeUTF(elem.s2);
}

There is no single best way of doing persistent storage. Personally
I'd work with databases whenever it's feasible. I don't like self-made
binary formats like the above very much. You can't change things
easily, at least not if you have to convert existing data from binary
format A to B. Other people will have to study your format and write
and maintain dedicated code.

However, the format is more efficient (less space and faster to parse)
than ASCII text.

Regards,
Marco

Thomas Schodt · Jan 20, 2004

Marco said:
There are DataInputStream and DataOutputStream. Both have read and
write method for the primitive types of Java and Strings. Byte order
is big endian

So be sure to use htons() / htonl() in the non-Java app before stuffing
the data on the stream.

Andrew Hobbs · Jan 20, 2004

Gabriel Genellina said:
I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

How large are you talking about. 1 Mbyte is not a large file. And what do
you consider too slow? Have you tried that approach. I suspect you will
find it faster than you think. Alternatively what about writing a parser
yourself. Look at each character in turn and using the commas as
delimiters.

We wrote our own parser and reading a 1 MByte file off disc, parsing it into
floats and strings and then drawing the 3D structure that it represents
takes a fraction of a second. If you want to see what I mean then log onto
www.metasense.com.au and try the free trial version. Click on the Chemistry
and then the DNA folder and try out some of those molecules. The largest is
almost 1 M in size and it loads and displays on my machine in about 1/2
second. It might take longer for you depending upon the speed of your
connection.

Cheers

Andrew

--
********************************************************
Andrew Hobbs PhD

MetaSense Pty Ltd - www.metasense.com.au
12 Ashover Grove
Carine W.A.
Australia 6020

61 8 9246 2026
(e-mail address removed)

*********************************************************

Christian Holm · Jan 20, 2004

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

I wouldn't worry too much about speed. I've written something very similar,
and was able to parse a 600 mb text-file using the method above in about a
minute. Your case may be a bit more timeconsuming, but it will probably
still be fast enough.

Christian

Thomas Weidenfeller · Jan 20, 2004

Gabriel said:
I have to pass a huge amount of data [...]
... many more lines, 100K or 1M ...

1M is not a huge amount of data. I eat that for breakfast - twice

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

Try it. Slow is a relative term, but I don't think you will get in
trouble here.

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.

A ByteBuffer might be the fastest.

I've considered using Serialized, but since the source program is not
written in Java it may be hard to replicate exactly the Serialized
format - btw, where is it documented? if documented at all...

AFAIR the low-level details are documented in the
Data[Output|Input]Stream or Object[Input|Output]Stream API
documentation. There is also some spec. on Sun's Java web site.

/Thomas

nos · Jan 20, 2004

Gabriel Genellina said:
I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.
I've considered using Serialized, but since the source program is not
written in Java it may be hard to replicate exactly the Serialized
format - btw, where is it documented? if documented at all...

Any ideas are welcome.
Thanks,

Gabriel Genellina
Softlab SRL

I would put one value per line. This avoids tokenizing and
the file size doesn't change much.

William Brogden · Jan 20, 2004

Gabriel Genellina said:
I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

A StreamTokenizer would be much more flexible and you would only need to
create one.
Using the flag to set end-of-line as a token would let you tell when each
line ends.
Bill

Chris · Jan 20, 2004

I doubt that speed will be an issue for you.

I've been working on some address handling software for a mate,
comma-delimited records, file-size usually around the 3-4Mb mark,
using BufferedReader and StringTokenizer for parsing - it generally
takes a minute or so to process (and it looks like the in-memory
processing I'm doing is considerably more complex than your
requirements).

Try it and see!

- sarge

Jon A. Cruz · Jan 20, 2004

Thomas said:
So be sure to use htons() / htonl() in the non-Java app before stuffing
the data on the stream.

Actually, try not to use them.

Instead use explicit byte math to get values out in an explicit order.

Since most networked applications use 'network byte order' which is
big-endian, go ahead and use that.

to give you the rough idea:

write32( char* dst, uint32 u )
{
dst++ = (u >> 24) & 0x0ff;
dst++ = (u >> 16) & 0x0ff;
dst++ = (u >> 8) & 0x0ff;
dst++ = (u >> 0) & 0x0ff;
}

Jon A. Cruz · Jan 20, 2004

Gabriel said:
I could use an ASCII file to transfer data, like this:

Probably not, since an "ASCII" file would be limited to 7-bit data, and
would lose things. It's very important, especially in the Java world, to
remember that "ASCII" is *not* a synonym for "plain text".

Most of the MS Windows documentation uses "ANSI" as a term for 8-bit
text. "ASCII" is much more limited, and actually present in Java's data
conversions. You'll hit a lot of subtle errors telling Java applications
that you want "ASCII" data when it's not really what you need.

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

As long as you wrap IO in one of the buffered types, speed probably
won't be a problem on only 1MB.

HOWEVER... there's another gotcha. Readers use some encoding to convert
from 8-bit encodings to internal Java strings which are UTF-16. You'll
probably want to be very explicit on the encoding used. UTF-8 is
probably very good for your needs.

Gabriel Genellina · Jan 20, 2004

Andrew Hobbs said:
How large are you talking about. 1 Mbyte is not a large file. And what do
you consider too slow? Have you tried that approach. I suspect you will
find it faster than you think. Alternatively what about writing a parser
yourself. Look at each character in turn and using the commas as
delimiters.

Sorry, I meant between 100000 and 1 million lines, not 1MB file size.
My test file (ASCII format) is about 200 MB.
Reading the ASCII file was too slow - I'll try other ways as suggested
by other people here.

Scott Ellsworth · Jan 20, 2004

I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

Depends on just how much the "huge amount" ends up being, and how you
intend to use it.

I parse a data file containing matrix data for a simple lapack test. It
has an x, a y, and a double value for matrices up to 500 by 500. This
uses no tokenizers, just reading the line, splitting on the space, and
parsing the data. This 1.1M file is read in 1.357 seconds.

In a different project, I parse 10M XML files using a JDOM-based parser
in 5 seconds or so, though these are all string data without a
string-double conversion.

For both of these, it was important for me to have a format that a human
could read, and that a junior programmer could write a correct parser
for in a very short time, so I used a pure text format.

The nio package has memory mapped files, auto-endian converting byte
buffers, and other tools that make a binary representation easier to
handle.

The key question is where your time is likely to be spent. If you have
a lot of data that has to come off the disk quickly, then a binary
format will minimize wire time. If that file needs to be curated,
parsed, read in by other languages, then human readability might become
dominant. If you only need a small subset of the data, you might be
best served by a relational database, as those are very good at
searching gigabytes of data to extract the 50k or so you wanted.

Scott
(e-mail address removed)
Java, Cocoa, WebObjects and Database consulting for the life sciences

Jon A. Cruz · Jan 20, 2004

Gabriel said:
Sorry, I meant between 100000 and 1 million lines, not 1MB file size.
My test file (ASCII format) is about 200 MB.
Reading the ASCII file was too slow - I'll try other ways as suggested
by other people here.

Again, "ASCII" is not correct.

Among other things, Java can use "ASCII" as then coding during
conversion, but you will lose 50% of all possible data.

Not safe.

A. Craig West · Jan 21, 2004

Actually, try not to use them.

Instead use explicit byte math to get values out in an explicit order.

Since most networked applications use 'network byte order' which is
big-endian, go ahead and use that.

I'm not sure what you have against htons() and htonl(), seeing as they are
commonly available macros that convert data from the host-specific byte order
to network byte order, which is exactly what is needed. That's the whole POINT
of htons() and htonl(). While you could expand out the macros yourself (like
you did in your example) if you are doing any significant amount of data at
all you will end up writing your own anyways, so you might as well use the
common ones.
Now if it should happen that the non-java app isn't written in C or C++, then
I can see where using htons() and htonl() could be a problem...

Jon A. Cruz · Jan 21, 2004

A. Craig West said:
I'm not sure what you have against htons() and htonl(), seeing as they are
commonly available macros that convert data from the host-specific byte order
to network byte order, which is exactly what is needed.

Well, that's a lot of what I have against them.

That they are macros and they *convert* endianess of data.

If one accidentally ends up calling them twice on the same data, then
you just undid your fixing of the data.

And, yes, I've encountered actual bugs where people had done that.

Another problem with them is that they are not guaranteed as to what
sizes they operate on. Depending on the platform and the age of the
compiler, things can be defined "interestingly".

Most modern compilers will have switched to stdint types, but that
wasn't always the case.

That's the whole POINT
of htons() and htonl().

Actually, not quite.

The whole point of them was to prep certain data for simple direct
networking support.

Most man pages describe them as "These routines are most often used
in conjunction with Internet addresses and ports as returned by
gethostent() and getservent()."

While you could expand out the macros yourself (like
you did in your example) if you are doing any significant amount of data at
all you will end up writing your own anyways, so you might as well use the
common ones.

Well, it comes down to differences in the usage also.

If you use those, then you still have to marshal the values you end up with.

Now, there are two general approaches at that point.

First, one could take the result of that call and store it in a
temporary variable. Then one could write out data by pointing to the
address of that temporay variable and writing the given number of bytes.

Second, one could take that temporary result and then send them out (or
copy them over) a single byte at a time, the way I had things listed in
that psuedocode.

Both of those have drawbacks.

For the first case, things are just "bad". That is, the code (either
writing or memcpy'ing) will have to access the internals of a variable
directly. Since that's to be avoided at all costs for structs, making an
exception for primitives makes the code inconsistent. And it leaves
things fragile in that if a maintenance programmer doesn't understand
all the subtleties of when to peek at memory and when not to, a mistake
is easy to make. Additionally, an extra temporary variable is needed to
access the guts of.

For the second case, the htonl call is uneeded, and again we have a
superfluous temporay variable.

Of course, there are two general options for IO in this manner. Either
write things directly, or marshal the bytes first before sending.

Sometimes it might be nice to have a function that writes directly. In
those cases something might be "int writeU32( int fh, uint32 u )".

For other cases, having a macro that marshals the given value into a
buffer with the proper byte order and also updating the pointer by the
number of bytes stored is nice.

In any case, using those instead of htons/htonl themselves also tends to
make the code more readable:

uint32_t tmp32;
uint16_t tmp16;

tmp32 = htonl( bar.field1 );
result = write( fh, &tmp32, sizeof(tmp32) );
tmp16 = htons( bar.field2 );
result = write( fh, &tmp16, sizeof(tmp16) );
tmp32 = htonl( bar.field3 );
result = write( fh, &tmp32, sizeof(tmp32) );

becomes

result = writeU32( fh, bar.field1 );
result = writeU16( fh, bar.field2 );
result = writeU32( fh, bar.field3 );

and

uint32_t tmp32;
uint16_t tmp16;

tmp32 = htonl( bar.field1 );
marshal( p, &tmp32, sizeof(tmp32) );
tmp16 = htons( bar.field2 );
marshal( p, &tmp16, sizeof(tmp16) );
tmp32 = htonl( bar.field3 );
marshal( p, &tmp32, sizeof(tmp32) );

becomes

marshalU32( p, bar.field1 );
marshalU16( p, bar.field2 );
marshalU32( p, bar.field3 );

Much clearer.

(Of course, remember the error checking for the routines using fh)

Jon Skeet · Jan 21, 2004

Jon A. Cruz said:
Again, "ASCII" is not correct.

Among other things, Java can use "ASCII" as then coding during
conversion, but you will lose 50% of all possible data.

Maybe I'm missing something here, but it looks perfectly possible to me
that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
whatever reason) that his data will never go out of ASCII, or *at
least* knows that his *test* data is all within ASCII.

Dimitri Maziuk · Jan 21, 2004

Gabriel Genellina sez:

I have to pass a huge amount of data to a Java program.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

What do you mean, at runtime?

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.

Binary format is more efficient, but if your data doesn't come from
a Java program it may be too hard to do.

Give JFlex a try: I've written JFlex lexers that parse 100+Mb files in
seconds.

Dima

Jon A. Cruz · Jan 21, 2004

Jon said:
Maybe I'm missing something here, but it looks perfectly possible to me
that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
whatever reason) that his data will never go out of ASCII, or *at
least* knows that his *test* data is all within ASCII.

In that case, it *was* in UTF-8 also.

However, from the context of the entire thread, it seems quite clear
that he's using "ASCII" as a synonym for "Plain text". Among other
things, his initial post contrasts "I could use an ASCII file..." to
"Maybe a binary format is more efficient...".

And a key clue is his next phrase "but I don't know which could
be the best way". This really goes to show he's comparing "ASCII file"
to "binary format".

Gabriel Genellina · Jan 22, 2004

Jon Skeet said:
Maybe I'm missing something here, but it looks perfectly possible to me
that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
whatever reason) that his data will never go out of ASCII, or *at
least* knows that his *test* data is all within ASCII.

Both were true... I should have written "plain text file" instead of
ASCII file, sorry - I came from the dark ages, before MIME and Unicode
were born...
And in fact my test file is just ASCII - its contents were randomly
generated using just uppercase A-Z letters plus spaces.

Any suggestions for handling data of huge dimension in Java?	13	Mar 24, 2011
Webservices returning large amount of data - need suggestions	2	May 3, 2006
Efficient processing of large nuumeric data file	12	Jan 18, 2008
Efficient Data Storage	7	Sep 8, 2004
Serialization format of gdb trace data between many differentmachines.	2	Apr 15, 2009
Save 60% on Convert to XML Format	0	Apr 18, 2009
save tuple of simple data types to disk (low memory foot print)	5	Oct 28, 2011
Request for source code review of simple Ising model	88	Apr 10, 2014

Efficient format for huge amount of data

Gabriel Genellina

Marco Schmidt

Thomas Schodt

Andrew Hobbs

Christian Holm

Thomas Weidenfeller

nos

William Brogden

Chris

Jon A. Cruz

Jon A. Cruz

Gabriel Genellina

Scott Ellsworth

Jon A. Cruz

A. Craig West

Jon A. Cruz

Jon Skeet

Dimitri Maziuk

Jon A. Cruz

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads