Efficient format for huge amount of data

G

Gabriel Genellina

I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
.... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.
I've considered using Serialized, but since the source program is not
written in Java it may be hard to replicate exactly the Serialized
format - btw, where is it documented? if documented at all...

Any ideas are welcome.
Thanks,

Gabriel Genellina
Softlab SRL
 
M

Marco Schmidt

Gabriel Genellina:

[...]
Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.

There are DataInputStream and DataOutputStream. Both have read and
write method for the primitive types of Java and Strings. Byte order
is big endian, valid intervals for the primitives types are defined in
the Java specs (e.g. char from 0 to 65535), the format of String
serialization is described in the API docs of read/writeUTF.

So if an element would be like the data you described above, an
element class could be:

class Element {
String s;
int i;
float f;
String s2;
}

And reading and writing could work like that:

Element read(DataInputStream in) throws IOException {
Element elem = new Element();
elem.s = in.readUTF();
elem.i = in.readInt();
elem.f = in.readFloat();
elem.s2 = in.readUTF();
return elem;
}

void write(DataOutputStream out, Element elem) throws IOException {
out.writeUTF(elem.s);
out.writeInt(elem.i);
out.writeFloat(elem.f);
out.writeUTF(elem.s2);
}

There is no single best way of doing persistent storage. Personally
I'd work with databases whenever it's feasible. I don't like self-made
binary formats like the above very much. You can't change things
easily, at least not if you have to convert existing data from binary
format A to B. Other people will have to study your format and write
and maintain dedicated code.

However, the format is more efficient (less space and faster to parse)
than ASCII text.

Regards,
Marco
 
T

Thomas Schodt

Marco said:
There are DataInputStream and DataOutputStream. Both have read and
write method for the primitive types of Java and Strings. Byte order
is big endian

So be sure to use htons() / htonl() in the non-Java app before stuffing
the data on the stream.
 
A

Andrew Hobbs

Gabriel Genellina said:
I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

How large are you talking about. 1 Mbyte is not a large file. And what do
you consider too slow? Have you tried that approach. I suspect you will
find it faster than you think. Alternatively what about writing a parser
yourself. Look at each character in turn and using the commas as
delimiters.

We wrote our own parser and reading a 1 MByte file off disc, parsing it into
floats and strings and then drawing the 3D structure that it represents
takes a fraction of a second. If you want to see what I mean then log onto
www.metasense.com.au and try the free trial version. Click on the Chemistry
and then the DNA folder and try out some of those molecules. The largest is
almost 1 M in size and it loads and displays on my machine in about 1/2
second. It might take longer for you depending upon the speed of your
connection.

Cheers

Andrew

--
********************************************************
Andrew Hobbs PhD

MetaSense Pty Ltd - www.metasense.com.au
12 Ashover Grove
Carine W.A.
Australia 6020

61 8 9246 2026
(e-mail address removed)

*********************************************************
 
C

Christian Holm

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

I wouldn't worry too much about speed. I've written something very similar,
and was able to parse a 600 mb text-file using the method above in about a
minute. Your case may be a bit more timeconsuming, but it will probably
still be fast enough.

Christian
 
T

Thomas Weidenfeller

Gabriel said:
I have to pass a huge amount of data [...]
... many more lines, 100K or 1M ...

1M is not a huge amount of data. I eat that for breakfast - twice :)
but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

Try it. Slow is a relative term, but I don't think you will get in
trouble here.
Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.

A ByteBuffer might be the fastest.
I've considered using Serialized, but since the source program is not
written in Java it may be hard to replicate exactly the Serialized
format - btw, where is it documented? if documented at all...

AFAIR the low-level details are documented in the
Data[Output|Input]Stream or Object[Input|Output]Stream API
documentation. There is also some spec. on Sun's Java web site.

/Thomas
 
N

nos

Gabriel Genellina said:
I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.
I've considered using Serialized, but since the source program is not
written in Java it may be hard to replicate exactly the Serialized
format - btw, where is it documented? if documented at all...

Any ideas are welcome.
Thanks,

Gabriel Genellina
Softlab SRL

I would put one value per line. This avoids tokenizing and
the file size doesn't change much.
 
W

William Brogden

Gabriel Genellina said:
I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

A StreamTokenizer would be much more flexible and you would only need to
create one.
Using the flag to set end-of-line as a token would let you tell when each
line ends.
Bill
 
C

Chris

I doubt that speed will be an issue for you.

I've been working on some address handling software for a mate,
comma-delimited records, file-size usually around the 3-4Mb mark,
using BufferedReader and StringTokenizer for parsing - it generally
takes a minute or so to process (and it looks like the in-memory
processing I'm doing is considerably more complex than your
requirements).

Try it and see!

- sarge
 
J

Jon A. Cruz

Thomas said:
So be sure to use htons() / htonl() in the non-Java app before stuffing
the data on the stream.

Actually, try not to use them.

Instead use explicit byte math to get values out in an explicit order.

Since most networked applications use 'network byte order' which is
big-endian, go ahead and use that.

to give you the rough idea:

write32( char* dst, uint32 u )
{
dst++ = (u >> 24) & 0x0ff;
dst++ = (u >> 16) & 0x0ff;
dst++ = (u >> 8) & 0x0ff;
dst++ = (u >> 0) & 0x0ff;
}
 
J

Jon A. Cruz

Gabriel said:
I could use an ASCII file to transfer data, like this:

Probably not, since an "ASCII" file would be limited to 7-bit data, and
would lose things. It's very important, especially in the Java world, to
remember that "ASCII" is *not* a synonym for "plain text".

Most of the MS Windows documentation uses "ANSI" as a term for 8-bit
text. "ASCII" is much more limited, and actually present in Java's data
conversions. You'll hit a lot of subtle errors telling Java applications
that you want "ASCII" data when it's not really what you need.

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

As long as you wrap IO in one of the buffered types, speed probably
won't be a problem on only 1MB.


HOWEVER... there's another gotcha. Readers use some encoding to convert
from 8-bit encodings to internal Java strings which are UTF-16. You'll
probably want to be very explicit on the encoding used. UTF-8 is
probably very good for your needs.
 
G

Gabriel Genellina

Andrew Hobbs said:
How large are you talking about. 1 Mbyte is not a large file. And what do
you consider too slow? Have you tried that approach. I suspect you will
find it faster than you think. Alternatively what about writing a parser
yourself. Look at each character in turn and using the commas as
delimiters.

Sorry, I meant between 100000 and 1 million lines, not 1MB file size.
My test file (ASCII format) is about 200 MB.
Reading the ASCII file was too slow - I'll try other ways as suggested
by other people here.
 
S

Scott Ellsworth

I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

Depends on just how much the "huge amount" ends up being, and how you
intend to use it.

I parse a data file containing matrix data for a simple lapack test. It
has an x, a y, and a double value for matrices up to 500 by 500. This
uses no tokenizers, just reading the line, splitting on the space, and
parsing the data. This 1.1M file is read in 1.357 seconds.

In a different project, I parse 10M XML files using a JDOM-based parser
in 5 seconds or so, though these are all string data without a
string-double conversion.

For both of these, it was important for me to have a format that a human
could read, and that a junior programmer could write a correct parser
for in a very short time, so I used a pure text format.

The nio package has memory mapped files, auto-endian converting byte
buffers, and other tools that make a binary representation easier to
handle.

The key question is where your time is likely to be spent. If you have
a lot of data that has to come off the disk quickly, then a binary
format will minimize wire time. If that file needs to be curated,
parsed, read in by other languages, then human readability might become
dominant. If you only need a small subset of the data, you might be
best served by a relational database, as those are very good at
searching gigabytes of data to extract the 50k or so you wanted.

Scott
(e-mail address removed)
Java, Cocoa, WebObjects and Database consulting for the life sciences
 
J

Jon A. Cruz

Gabriel said:
Sorry, I meant between 100000 and 1 million lines, not 1MB file size.
My test file (ASCII format) is about 200 MB.
Reading the ASCII file was too slow - I'll try other ways as suggested
by other people here.

Again, "ASCII" is not correct.

Among other things, Java can use "ASCII" as then coding during
conversion, but you will lose 50% of all possible data.

Not safe.
 
A

A. Craig West

Actually, try not to use them.
Instead use explicit byte math to get values out in an explicit order.
Since most networked applications use 'network byte order' which is
big-endian, go ahead and use that.

I'm not sure what you have against htons() and htonl(), seeing as they are
commonly available macros that convert data from the host-specific byte order
to network byte order, which is exactly what is needed. That's the whole POINT
of htons() and htonl(). While you could expand out the macros yourself (like
you did in your example) if you are doing any significant amount of data at
all you will end up writing your own anyways, so you might as well use the
common ones.
Now if it should happen that the non-java app isn't written in C or C++, then
I can see where using htons() and htonl() could be a problem...
 
J

Jon A. Cruz

A. Craig West said:
I'm not sure what you have against htons() and htonl(), seeing as they are
commonly available macros that convert data from the host-specific byte order
to network byte order, which is exactly what is needed.

Well, that's a lot of what I have against them.

:)

That they are macros and they *convert* endianess of data.

If one accidentally ends up calling them twice on the same data, then
you just undid your fixing of the data.

And, yes, I've encountered actual bugs where people had done that.


Another problem with them is that they are not guaranteed as to what
sizes they operate on. Depending on the platform and the age of the
compiler, things can be defined "interestingly".

Most modern compilers will have switched to stdint types, but that
wasn't always the case.

That's the whole POINT
of htons() and htonl().

Actually, not quite.

The whole point of them was to prep certain data for simple direct
networking support.

Most man pages describe them as "These routines are most often used
in conjunction with Internet addresses and ports as returned by
gethostent() and getservent()."


While you could expand out the macros yourself (like
you did in your example) if you are doing any significant amount of data at
all you will end up writing your own anyways, so you might as well use the
common ones.

Well, it comes down to differences in the usage also.

If you use those, then you still have to marshal the values you end up with.

Now, there are two general approaches at that point.

First, one could take the result of that call and store it in a
temporary variable. Then one could write out data by pointing to the
address of that temporay variable and writing the given number of bytes.

Second, one could take that temporary result and then send them out (or
copy them over) a single byte at a time, the way I had things listed in
that psuedocode.


Both of those have drawbacks.

For the first case, things are just "bad". That is, the code (either
writing or memcpy'ing) will have to access the internals of a variable
directly. Since that's to be avoided at all costs for structs, making an
exception for primitives makes the code inconsistent. And it leaves
things fragile in that if a maintenance programmer doesn't understand
all the subtleties of when to peek at memory and when not to, a mistake
is easy to make. Additionally, an extra temporary variable is needed to
access the guts of.

For the second case, the htonl call is uneeded, and again we have a
superfluous temporay variable.



Of course, there are two general options for IO in this manner. Either
write things directly, or marshal the bytes first before sending.

Sometimes it might be nice to have a function that writes directly. In
those cases something might be "int writeU32( int fh, uint32 u )".

For other cases, having a macro that marshals the given value into a
buffer with the proper byte order and also updating the pointer by the
number of bytes stored is nice.

In any case, using those instead of htons/htonl themselves also tends to
make the code more readable:


uint32_t tmp32;
uint16_t tmp16;

tmp32 = htonl( bar.field1 );
result = write( fh, &tmp32, sizeof(tmp32) );
tmp16 = htons( bar.field2 );
result = write( fh, &tmp16, sizeof(tmp16) );
tmp32 = htonl( bar.field3 );
result = write( fh, &tmp32, sizeof(tmp32) );


becomes

result = writeU32( fh, bar.field1 );
result = writeU16( fh, bar.field2 );
result = writeU32( fh, bar.field3 );


and


uint32_t tmp32;
uint16_t tmp16;

tmp32 = htonl( bar.field1 );
marshal( p, &tmp32, sizeof(tmp32) );
tmp16 = htons( bar.field2 );
marshal( p, &tmp16, sizeof(tmp16) );
tmp32 = htonl( bar.field3 );
marshal( p, &tmp32, sizeof(tmp32) );


becomes

marshalU32( p, bar.field1 );
marshalU16( p, bar.field2 );
marshalU32( p, bar.field3 );


Much clearer.


(Of course, remember the error checking for the routines using fh)
 
J

Jon Skeet

Jon A. Cruz said:
Again, "ASCII" is not correct.

Among other things, Java can use "ASCII" as then coding during
conversion, but you will lose 50% of all possible data.

Maybe I'm missing something here, but it looks perfectly possible to me
that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
whatever reason) that his data will never go out of ASCII, or *at
least* knows that his *test* data is all within ASCII.
 
D

Dimitri Maziuk

Gabriel Genellina sez:
I have to pass a huge amount of data to a Java program.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

What do you mean, at runtime?
Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.

Binary format is more efficient, but if your data doesn't come from
a Java program it may be too hard to do.

Give JFlex a try: I've written JFlex lexers that parse 100+Mb files in
seconds.

Dima
 
J

Jon A. Cruz

Jon said:
Maybe I'm missing something here, but it looks perfectly possible to me
that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
whatever reason) that his data will never go out of ASCII, or *at
least* knows that his *test* data is all within ASCII.

In that case, it *was* in UTF-8 also. :)

However, from the context of the entire thread, it seems quite clear
that he's using "ASCII" as a synonym for "Plain text". Among other
things, his initial post contrasts "I could use an ASCII file..." to
"Maybe a binary format is more efficient...".

And a key clue is his next phrase "but I don't know which could
be the best way". This really goes to show he's comparing "ASCII file"
to "binary format".
 
G

Gabriel Genellina

Jon Skeet said:
Maybe I'm missing something here, but it looks perfectly possible to me
that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
whatever reason) that his data will never go out of ASCII, or *at
least* knows that his *test* data is all within ASCII.

Both were true... I should have written "plain text file" instead of
ASCII file, sorry - I came from the dark ages, before MIME and Unicode
were born...
And in fact my test file is just ASCII - its contents were randomly
generated using just uppercase A-Z letters plus spaces.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top