Efficient format for huge amount of data

Discussion in 'Java' started by Gabriel Genellina, Jan 20, 2004.

  1. I have to pass a huge amount of data to a Java program. The source
    program is not written in Java but I have control over both programs
    and can arrange any suitable format at both ends.

    The dataset is a sequence of records, all records having the same
    structure. This structure is only known at runtime, and it's built on
    simple types like string, integer, double, etc.

    I could use an ASCII file to transfer data, like this:

    "A string", 123, 4.567, "X"
    "Another string", 89, 10.0, "Y"
    "Third line", -1, 0.0, "Z"
    .... many more lines, 100K or 1M ...

    but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
    for each line + the various wrapper classes like Integer, Double... I
    think this may be very slow for a large file.

    Maybe a binary format is more efficient, but I don't know which could
    be the best way, nor how to implement it.
    I've considered using Serialized, but since the source program is not
    written in Java it may be hard to replicate exactly the Serialized
    format - btw, where is it documented? if documented at all...

    Any ideas are welcome.
    Thanks,

    Gabriel Genellina
    Softlab SRL
     
    Gabriel Genellina, Jan 20, 2004
    #1
    1. Advertising

  2. Gabriel Genellina:

    [...]

    >Maybe a binary format is more efficient, but I don't know which could
    >be the best way, nor how to implement it.


    There are DataInputStream and DataOutputStream. Both have read and
    write method for the primitive types of Java and Strings. Byte order
    is big endian, valid intervals for the primitives types are defined in
    the Java specs (e.g. char from 0 to 65535), the format of String
    serialization is described in the API docs of read/writeUTF.

    So if an element would be like the data you described above, an
    element class could be:

    class Element {
    String s;
    int i;
    float f;
    String s2;
    }

    And reading and writing could work like that:

    Element read(DataInputStream in) throws IOException {
    Element elem = new Element();
    elem.s = in.readUTF();
    elem.i = in.readInt();
    elem.f = in.readFloat();
    elem.s2 = in.readUTF();
    return elem;
    }

    void write(DataOutputStream out, Element elem) throws IOException {
    out.writeUTF(elem.s);
    out.writeInt(elem.i);
    out.writeFloat(elem.f);
    out.writeUTF(elem.s2);
    }

    There is no single best way of doing persistent storage. Personally
    I'd work with databases whenever it's feasible. I don't like self-made
    binary formats like the above very much. You can't change things
    easily, at least not if you have to convert existing data from binary
    format A to B. Other people will have to study your format and write
    and maintain dedicated code.

    However, the format is more efficient (less space and faster to parse)
    than ASCII text.

    Regards,
    Marco
    --
    Please reply in the newsgroup, not by email!
    Java programming tips: http://jiu.sourceforge.net/javatips.html
    Other Java pages: http://www.geocities.com/marcoschmidt.geo/java.html
     
    Marco Schmidt, Jan 20, 2004
    #2
    1. Advertising

  3. Marco Schmidt wrote:

    > There are DataInputStream and DataOutputStream. Both have read and
    > write method for the primitive types of Java and Strings. Byte order
    > is big endian


    So be sure to use htons() / htonl() in the non-Java app before stuffing
    the data on the stream.
     
    Thomas Schodt, Jan 20, 2004
    #3
  4. Gabriel Genellina

    Andrew Hobbs Guest

    "Gabriel Genellina" <> wrote in message
    news:...
    > I have to pass a huge amount of data to a Java program. The source
    > program is not written in Java but I have control over both programs
    > and can arrange any suitable format at both ends.
    >
    > The dataset is a sequence of records, all records having the same
    > structure. This structure is only known at runtime, and it's built on
    > simple types like string, integer, double, etc.
    >
    > I could use an ASCII file to transfer data, like this:
    >
    > "A string", 123, 4.567, "X"
    > "Another string", 89, 10.0, "Y"
    > "Third line", -1, 0.0, "Z"
    > ... many more lines, 100K or 1M ...
    >
    > but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
    > for each line + the various wrapper classes like Integer, Double... I
    > think this may be very slow for a large file.


    How large are you talking about. 1 Mbyte is not a large file. And what do
    you consider too slow? Have you tried that approach. I suspect you will
    find it faster than you think. Alternatively what about writing a parser
    yourself. Look at each character in turn and using the commas as
    delimiters.

    We wrote our own parser and reading a 1 MByte file off disc, parsing it into
    floats and strings and then drawing the 3D structure that it represents
    takes a fraction of a second. If you want to see what I mean then log onto
    www.metasense.com.au and try the free trial version. Click on the Chemistry
    and then the DNA folder and try out some of those molecules. The largest is
    almost 1 M in size and it loads and displays on my machine in about 1/2
    second. It might take longer for you depending upon the speed of your
    connection.

    Cheers

    Andrew

    --
    ********************************************************
    Andrew Hobbs PhD

    MetaSense Pty Ltd - www.metasense.com.au
    12 Ashover Grove
    Carine W.A.
    Australia 6020

    61 8 9246 2026


    *********************************************************



    >
    > Maybe a binary format is more efficient, but I don't know which could
    > be the best way, nor how to implement it.
    > I've considered using Serialized, but since the source program is not
    > written in Java it may be hard to replicate exactly the Serialized
    > format - btw, where is it documented? if documented at all...
    >
    > Any ideas are welcome.
    > Thanks,
    >
    > Gabriel Genellina
    > Softlab SRL
     
    Andrew Hobbs, Jan 20, 2004
    #4
  5. "Gabriel Genellina" <> wrote in message
    news:...
    <snip>
    > but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
    > for each line + the various wrapper classes like Integer, Double... I
    > think this may be very slow for a large file.


    I wouldn't worry too much about speed. I've written something very similar,
    and was able to parse a 600 mb text-file using the method above in about a
    minute. Your case may be a bit more timeconsuming, but it will probably
    still be fast enough.

    Christian
     
    Christian Holm, Jan 20, 2004
    #5
  6. Gabriel Genellina wrote:
    > I have to pass a huge amount of data

    [...]
    > ... many more lines, 100K or 1M ...


    1M is not a huge amount of data. I eat that for breakfast - twice :)

    > but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
    > for each line + the various wrapper classes like Integer, Double... I
    > think this may be very slow for a large file.


    Try it. Slow is a relative term, but I don't think you will get in
    trouble here.

    > Maybe a binary format is more efficient, but I don't know which could
    > be the best way, nor how to implement it.


    A ByteBuffer might be the fastest.

    > I've considered using Serialized, but since the source program is not
    > written in Java it may be hard to replicate exactly the Serialized
    > format - btw, where is it documented? if documented at all...


    AFAIR the low-level details are documented in the
    Data[Output|Input]Stream or Object[Input|Output]Stream API
    documentation. There is also some spec. on Sun's Java web site.

    /Thomas
     
    Thomas Weidenfeller, Jan 20, 2004
    #6
  7. Gabriel Genellina

    nos Guest

    "Gabriel Genellina" <> wrote in message
    news:...
    > I have to pass a huge amount of data to a Java program. The source
    > program is not written in Java but I have control over both programs
    > and can arrange any suitable format at both ends.
    >
    > The dataset is a sequence of records, all records having the same
    > structure. This structure is only known at runtime, and it's built on
    > simple types like string, integer, double, etc.
    >
    > I could use an ASCII file to transfer data, like this:
    >
    > "A string", 123, 4.567, "X"
    > "Another string", 89, 10.0, "Y"
    > "Third line", -1, 0.0, "Z"
    > ... many more lines, 100K or 1M ...
    >
    > but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
    > for each line + the various wrapper classes like Integer, Double... I
    > think this may be very slow for a large file.
    >
    > Maybe a binary format is more efficient, but I don't know which could
    > be the best way, nor how to implement it.
    > I've considered using Serialized, but since the source program is not
    > written in Java it may be hard to replicate exactly the Serialized
    > format - btw, where is it documented? if documented at all...
    >
    > Any ideas are welcome.
    > Thanks,
    >
    > Gabriel Genellina
    > Softlab SRL


    I would put one value per line. This avoids tokenizing and
    the file size doesn't change much.
     
    nos, Jan 20, 2004
    #7
  8. "Gabriel Genellina" <> wrote in message
    news:...
    > I have to pass a huge amount of data to a Java program. The source
    > program is not written in Java but I have control over both programs
    > and can arrange any suitable format at both ends.
    >
    > The dataset is a sequence of records, all records having the same
    > structure. This structure is only known at runtime, and it's built on
    > simple types like string, integer, double, etc.
    >
    > I could use an ASCII file to transfer data, like this:
    >
    > "A string", 123, 4.567, "X"
    > "Another string", 89, 10.0, "Y"
    > "Third line", -1, 0.0, "Z"
    > ... many more lines, 100K or 1M ...
    >
    > but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
    > for each line + the various wrapper classes like Integer, Double... I
    > think this may be very slow for a large file.


    A StreamTokenizer would be much more flexible and you would only need to
    create one.
    Using the flag to set end-of-line as a token would let you tell when each
    line ends.
    Bill

    >
    > Maybe a binary format is more efficient, but I don't know which could
    > be the best way, nor how to implement it.
    > I've considered using Serialized, but since the source program is not
    > written in Java it may be hard to replicate exactly the Serialized
    > format - btw, where is it documented? if documented at all...
    >
    > Any ideas are welcome.
    > Thanks,
    >
    > Gabriel Genellina
    > Softlab SRL





    ----== Posted via Newsfeed.Com - Unlimited-Uncensored-Secure Usenet News==----
    http://www.newsfeed.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
    ---= 19 East/West-Coast Specialized Servers - Total Privacy via Encryption =---
     
    William Brogden, Jan 20, 2004
    #8
  9. Gabriel Genellina

    Chris Guest

    I doubt that speed will be an issue for you.

    I've been working on some address handling software for a mate,
    comma-delimited records, file-size usually around the 3-4Mb mark,
    using BufferedReader and StringTokenizer for parsing - it generally
    takes a minute or so to process (and it looks like the in-memory
    processing I'm doing is considerably more complex than your
    requirements).

    Try it and see!

    - sarge
     
    Chris, Jan 20, 2004
    #9
  10. Gabriel Genellina

    Jon A. Cruz Guest

    Thomas Schodt wrote:
    >
    > So be sure to use htons() / htonl() in the non-Java app before stuffing
    > the data on the stream.


    Actually, try not to use them.

    Instead use explicit byte math to get values out in an explicit order.

    Since most networked applications use 'network byte order' which is
    big-endian, go ahead and use that.

    to give you the rough idea:

    write32( char* dst, uint32 u )
    {
    dst++ = (u >> 24) & 0x0ff;
    dst++ = (u >> 16) & 0x0ff;
    dst++ = (u >> 8) & 0x0ff;
    dst++ = (u >> 0) & 0x0ff;
    }
     
    Jon A. Cruz, Jan 20, 2004
    #10
  11. Gabriel Genellina

    Jon A. Cruz Guest

    Gabriel Genellina wrote:
    >
    > I could use an ASCII file to transfer data, like this:
    >


    Probably not, since an "ASCII" file would be limited to 7-bit data, and
    would lose things. It's very important, especially in the Java world, to
    remember that "ASCII" is *not* a synonym for "plain text".

    Most of the MS Windows documentation uses "ANSI" as a term for 8-bit
    text. "ASCII" is much more limited, and actually present in Java's data
    conversions. You'll hit a lot of subtle errors telling Java applications
    that you want "ASCII" data when it's not really what you need.


    > "A string", 123, 4.567, "X"
    > "Another string", 89, 10.0, "Y"
    > "Third line", -1, 0.0, "Z"
    > ... many more lines, 100K or 1M ...
    >
    > but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
    > for each line + the various wrapper classes like Integer, Double... I
    > think this may be very slow for a large file.


    As long as you wrap IO in one of the buffered types, speed probably
    won't be a problem on only 1MB.


    HOWEVER... there's another gotcha. Readers use some encoding to convert
    from 8-bit encodings to internal Java strings which are UTF-16. You'll
    probably want to be very explicit on the encoding used. UTF-8 is
    probably very good for your needs.
     
    Jon A. Cruz, Jan 20, 2004
    #11
  12. "Andrew Hobbs" <> wrote in message news:<400cf5b6$0$1745$>...

    > > The dataset is a sequence of records, all records having the same
    > > structure. This structure is only known at runtime, and it's built on
    > > simple types like string, integer, double, etc.
    > >
    > > I could use an ASCII file to transfer data, like this:
    > >
    > > "A string", 123, 4.567, "X"
    > > "Another string", 89, 10.0, "Y"
    > > "Third line", -1, 0.0, "Z"
    > > ... many more lines, 100K or 1M ...
    > >

    >
    > How large are you talking about. 1 Mbyte is not a large file. And what do
    > you consider too slow? Have you tried that approach. I suspect you will
    > find it faster than you think. Alternatively what about writing a parser
    > yourself. Look at each character in turn and using the commas as
    > delimiters.


    Sorry, I meant between 100000 and 1 million lines, not 1MB file size.
    My test file (ASCII format) is about 200 MB.
    Reading the ASCII file was too slow - I'll try other ways as suggested
    by other people here.
     
    Gabriel Genellina, Jan 20, 2004
    #12
  13. In article <>,
    (Gabriel Genellina) wrote:

    > I have to pass a huge amount of data to a Java program. The source
    > program is not written in Java but I have control over both programs
    > and can arrange any suitable format at both ends.


    Depends on just how much the "huge amount" ends up being, and how you
    intend to use it.

    I parse a data file containing matrix data for a simple lapack test. It
    has an x, a y, and a double value for matrices up to 500 by 500. This
    uses no tokenizers, just reading the line, splitting on the space, and
    parsing the data. This 1.1M file is read in 1.357 seconds.

    In a different project, I parse 10M XML files using a JDOM-based parser
    in 5 seconds or so, though these are all string data without a
    string-double conversion.

    For both of these, it was important for me to have a format that a human
    could read, and that a junior programmer could write a correct parser
    for in a very short time, so I used a pure text format.

    The nio package has memory mapped files, auto-endian converting byte
    buffers, and other tools that make a binary representation easier to
    handle.

    The key question is where your time is likely to be spent. If you have
    a lot of data that has to come off the disk quickly, then a binary
    format will minimize wire time. If that file needs to be curated,
    parsed, read in by other languages, then human readability might become
    dominant. If you only need a small subset of the data, you might be
    best served by a relational database, as those are very good at
    searching gigabytes of data to extract the 50k or so you wanted.

    Scott

    Java, Cocoa, WebObjects and Database consulting for the life sciences
     
    Scott Ellsworth, Jan 20, 2004
    #13
  14. Gabriel Genellina

    Jon A. Cruz Guest

    Gabriel Genellina wrote:
    > Sorry, I meant between 100000 and 1 million lines, not 1MB file size.
    > My test file (ASCII format) is about 200 MB.
    > Reading the ASCII file was too slow - I'll try other ways as suggested
    > by other people here.


    Again, "ASCII" is not correct.

    Among other things, Java can use "ASCII" as then coding during
    conversion, but you will lose 50% of all possible data.

    Not safe.
     
    Jon A. Cruz, Jan 20, 2004
    #14
  15. Jon A. Cruz <> wrote:
    > Thomas Schodt wrote:
    >>
    >> So be sure to use htons() / htonl() in the non-Java app before stuffing
    >> the data on the stream.


    > Actually, try not to use them.


    > Instead use explicit byte math to get values out in an explicit order.


    > Since most networked applications use 'network byte order' which is
    > big-endian, go ahead and use that.


    I'm not sure what you have against htons() and htonl(), seeing as they are
    commonly available macros that convert data from the host-specific byte order
    to network byte order, which is exactly what is needed. That's the whole POINT
    of htons() and htonl(). While you could expand out the macros yourself (like
    you did in your example) if you are doing any significant amount of data at
    all you will end up writing your own anyways, so you might as well use the
    common ones.
    Now if it should happen that the non-java app isn't written in C or C++, then
    I can see where using htons() and htonl() could be a problem...

    --
    Craig West Ph: (416) 666-1645 | It's not a bug,
    | It's a feature...
     
    A. Craig West, Jan 21, 2004
    #15
  16. Gabriel Genellina

    Jon A. Cruz Guest

    A. Craig West wrote:
    >
    > I'm not sure what you have against htons() and htonl(), seeing as they are
    > commonly available macros that convert data from the host-specific byte order
    > to network byte order, which is exactly what is needed.


    Well, that's a lot of what I have against them.

    :)

    That they are macros and they *convert* endianess of data.

    If one accidentally ends up calling them twice on the same data, then
    you just undid your fixing of the data.

    And, yes, I've encountered actual bugs where people had done that.


    Another problem with them is that they are not guaranteed as to what
    sizes they operate on. Depending on the platform and the age of the
    compiler, things can be defined "interestingly".

    Most modern compilers will have switched to stdint types, but that
    wasn't always the case.


    > That's the whole POINT
    > of htons() and htonl().


    Actually, not quite.

    The whole point of them was to prep certain data for simple direct
    networking support.

    Most man pages describe them as "These routines are most often used
    in conjunction with Internet addresses and ports as returned by
    gethostent() and getservent()."



    > While you could expand out the macros yourself (like
    > you did in your example) if you are doing any significant amount of data at
    > all you will end up writing your own anyways, so you might as well use the
    > common ones.


    Well, it comes down to differences in the usage also.

    If you use those, then you still have to marshal the values you end up with.

    Now, there are two general approaches at that point.

    First, one could take the result of that call and store it in a
    temporary variable. Then one could write out data by pointing to the
    address of that temporay variable and writing the given number of bytes.

    Second, one could take that temporary result and then send them out (or
    copy them over) a single byte at a time, the way I had things listed in
    that psuedocode.


    Both of those have drawbacks.

    For the first case, things are just "bad". That is, the code (either
    writing or memcpy'ing) will have to access the internals of a variable
    directly. Since that's to be avoided at all costs for structs, making an
    exception for primitives makes the code inconsistent. And it leaves
    things fragile in that if a maintenance programmer doesn't understand
    all the subtleties of when to peek at memory and when not to, a mistake
    is easy to make. Additionally, an extra temporary variable is needed to
    access the guts of.

    For the second case, the htonl call is uneeded, and again we have a
    superfluous temporay variable.



    Of course, there are two general options for IO in this manner. Either
    write things directly, or marshal the bytes first before sending.

    Sometimes it might be nice to have a function that writes directly. In
    those cases something might be "int writeU32( int fh, uint32 u )".

    For other cases, having a macro that marshals the given value into a
    buffer with the proper byte order and also updating the pointer by the
    number of bytes stored is nice.

    In any case, using those instead of htons/htonl themselves also tends to
    make the code more readable:


    uint32_t tmp32;
    uint16_t tmp16;

    tmp32 = htonl( bar.field1 );
    result = write( fh, &tmp32, sizeof(tmp32) );
    tmp16 = htons( bar.field2 );
    result = write( fh, &tmp16, sizeof(tmp16) );
    tmp32 = htonl( bar.field3 );
    result = write( fh, &tmp32, sizeof(tmp32) );


    becomes

    result = writeU32( fh, bar.field1 );
    result = writeU16( fh, bar.field2 );
    result = writeU32( fh, bar.field3 );


    and


    uint32_t tmp32;
    uint16_t tmp16;

    tmp32 = htonl( bar.field1 );
    marshal( p, &tmp32, sizeof(tmp32) );
    tmp16 = htons( bar.field2 );
    marshal( p, &tmp16, sizeof(tmp16) );
    tmp32 = htonl( bar.field3 );
    marshal( p, &tmp32, sizeof(tmp32) );


    becomes

    marshalU32( p, bar.field1 );
    marshalU16( p, bar.field2 );
    marshalU32( p, bar.field3 );


    Much clearer.


    (Of course, remember the error checking for the routines using fh)
     
    Jon A. Cruz, Jan 21, 2004
    #16
  17. Gabriel Genellina

    Jon Skeet Guest

    Jon A. Cruz <> wrote:
    > Gabriel Genellina wrote:
    > > Sorry, I meant between 100000 and 1 million lines, not 1MB file size.
    > > My test file (ASCII format) is about 200 MB.
    > > Reading the ASCII file was too slow - I'll try other ways as suggested
    > > by other people here.

    >
    > Again, "ASCII" is not correct.
    >
    > Among other things, Java can use "ASCII" as then coding during
    > conversion, but you will lose 50% of all possible data.


    Maybe I'm missing something here, but it looks perfectly possible to me
    that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
    whatever reason) that his data will never go out of ASCII, or *at
    least* knows that his *test* data is all within ASCII.

    --
    Jon Skeet - <>
    http://www.pobox.com/~skeet
    If replying to the group, please do not mail me too
     
    Jon Skeet, Jan 21, 2004
    #17
  18. Gabriel Genellina sez:
    > I have to pass a huge amount of data to a Java program.
    >
    > The dataset is a sequence of records, all records having the same
    > structure. This structure is only known at runtime, and it's built on
    > simple types like string, integer, double, etc.


    What do you mean, at runtime?

    > Maybe a binary format is more efficient, but I don't know which could
    > be the best way, nor how to implement it.


    Binary format is more efficient, but if your data doesn't come from
    a Java program it may be too hard to do.

    Give JFlex a try: I've written JFlex lexers that parse 100+Mb files in
    seconds.

    Dima
    --
    Q276304 - Error Message: Your Password Must Be at Least 18770 Characters
    and Cannot Repeat Any of Your Previous 30689 Passwords -- RISKS 21.37
     
    Dimitri Maziuk, Jan 21, 2004
    #18
  19. Gabriel Genellina

    Jon A. Cruz Guest

    Jon Skeet wrote:
    >
    > Maybe I'm missing something here, but it looks perfectly possible to me
    > that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
    > whatever reason) that his data will never go out of ASCII, or *at
    > least* knows that his *test* data is all within ASCII.
    >


    In that case, it *was* in UTF-8 also. :)

    However, from the context of the entire thread, it seems quite clear
    that he's using "ASCII" as a synonym for "Plain text". Among other
    things, his initial post contrasts "I could use an ASCII file..." to
    "Maybe a binary format is more efficient...".

    And a key clue is his next phrase "but I don't know which could
    be the best way". This really goes to show he's comparing "ASCII file"
    to "binary format".
     
    Jon A. Cruz, Jan 21, 2004
    #19
  20. Jon Skeet <> wrote in message news:<MPG.1a788f1b87f7cc4c989723@10.1.1.42>...

    > > > Reading the ASCII file was too slow - I'll try other ways as suggested
    > > > by other people here.

    > >
    > > Again, "ASCII" is not correct.
    > >
    > > Among other things, Java can use "ASCII" as then coding during
    > > conversion, but you will lose 50% of all possible data.

    >
    > Maybe I'm missing something here, but it looks perfectly possible to me
    > that Gabriel's test file *was* in ASCII format. Perhaps he knows (for
    > whatever reason) that his data will never go out of ASCII, or *at
    > least* knows that his *test* data is all within ASCII.


    Both were true... I should have written "plain text file" instead of
    ASCII file, sorry - I came from the dark ages, before MIME and Unicode
    were born...
    And in fact my test file is just ASCII - its contents were randomly
    generated using just uppercase A-Z letters plus spaces.
     
    Gabriel Genellina, Jan 22, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?RGFuaWVsIFdhbHplbmJhY2g=?=

    Trouble with huge amount of State Server Sessions Timed out

    =?Utf-8?B?RGFuaWVsIFdhbHplbmJhY2g=?=, Jul 20, 2005, in forum: ASP .Net
    Replies:
    7
    Views:
    2,678
    Jerald Carter
    Sep 28, 2006
  2. MichiMichi
    Replies:
    2
    Views:
    421
    Alexey Smirnov
    Mar 14, 2007
  3. Jan Fischer
    Replies:
    10
    Views:
    169
    Robert Klemme
    Oct 9, 2008
  4. Replies:
    9
    Views:
    131
  5. Ishmael
    Replies:
    2
    Views:
    117
    Ted Zlatanov
    Mar 5, 2009
Loading...

Share This Page