Lots of Data

Discussion in 'Java' started by RJ, Jul 8, 2003.

  1. RJ

    RJ Guest

    I've got the following problem. I'm reading tab-separated text files
    into an array with objects which represent one line. Each object
    contains an other array where the single tab-separated texts are
    saved. When I try read more than 40000 rows with more than 80 columns.
    I get an OutOfMemoryError.
    What can I do to have quick access to that data and save memory so
    that my application will also work with lots of data.
     
    RJ, Jul 8, 2003
    #1
    1. Advertising

  2. Steve Horsley:

    >One thing that I find works well here is to share all common Strings that
    >you read from a file. So that if you find many instances of the word "yes"
    >in the file, you simply store a reference to the existing string "yes". I
    >do this by checking every string I retrieve against a HashMap (storing the
    >String as both key and value). If the string is already in the HashMap, I
    >use that one instead.


    Isn't that what intern() is for?

    Regards,
    Marco
    --
    Please reply in the newsgroup, not by email!
    Java programming tips: http://jiu.sourceforge.net/javatips.html
    Other Java pages: http://www.geocities.com/marcoschmidt.geo/java.html
     
    Marco Schmidt, Jul 8, 2003
    #2
    1. Advertising

  3. (RJ) writes:
    > What can I do to have quick access to that data and save memory so
    > that my application will also work with lots of data.


    E.g. use a database instead. Or compress, or clerverly encode your data
    if you really need it in memory.

    /Thomas
     
    Thomas Weidenfeller, Jul 8, 2003
    #3
  4. RJ

    Jon A. Cruz Guest

    Marco Schmidt wrote:
    >
    > Isn't that what intern() is for?
    >


    Kinda...

    except intern() has some problems like locking the strings in forever.
     
    Jon A. Cruz, Jul 8, 2003
    #4
  5. lid (Thomas Weidenfeller), Tue, 08 Jul 2003
    13:20:19 +0000:

    > (RJ) writes:
    >> What can I do to have quick access to that data and save memory so
    >> that my application will also work with lots of data.

    >
    > E.g. use a database instead. Or compress, or cleverly encode your data
    > if you really need it in memory.
    >
    > /Thomas


    If the data is stored with an explicit ordering, then you can get away with
    implementing some form of cache and accessing the file as a
    java.io.RandomAccessFile.

    It's a little bit more thinking to get the coding done, but reduces the memory
    usage, and random access is fast. 40.000 lines of text isn't that much.

    You can do a binary search if the data is sorted, or something like
    Knuth-Morris-Pratt to search through the files for a specific key (if the
    keys aren't too small).

    I did try the KMP once, and it was quite easy to code, search times were
    relatively small for a single key (<10ms on a cheap ata-100 IDE drive in a
    file with about 50.000 records with no fixed length records).

    Greets
    Bhun.
     
    dhek bhun kho, Jul 9, 2003
    #5
  6. dhek bhun kho <> writes:
    > If the data is stored with an explicit ordering, then you can get away with
    > implementing some form of cache and accessing the file as a
    > java.io.RandomAccessFile.
    >
    > It's a little bit more thinking to get the coding done, but reduces the memory
    > usage, and random access is fast. 40.000 lines of text isn't that much.


    I know. I just didn't want to tell the original poster that his idea of
    first reading in everything into memory is a !@#$%^&* idea. Instead I
    wanted to provide an off-the-shelf solution. Of course, writing your
    own little special-purpose "database" and search code is also an
    option.

    I see this "files? what the heck are files? Give me memory!" attitude
    more and more often, also in this group. E.g. recently someone posted
    his source code to copy the contents of one directory to another
    directory. For each file, the code was along the line of:

    * get length of file (which is a "long")

    * cast length to "int" (which can still be a rather large number)

    * allocate byte array of that (truncated) length

    * read the complete file (or what remains after the truncation)
    into the array

    * write the array to the destination file

    There isn't much more that one can do wrong here, isn't it? I guess it is
    a generation thing. I am not that old :) but I had to work with memory
    sizes that today we all use up for a simple String before the first cup
    of tea in the morning.

    /Thomas
     
    Thomas Weidenfeller, Jul 9, 2003
    #6
  7. lid (Thomas Weidenfeller), Wed, 09 Jul 2003
    07:06:15 +0000:

    > dhek bhun kho <> writes:
    >> If the data is stored with an explicit ordering, then you can get away with
    >> implementing some form of cache and accessing the file as a
    >> java.io.RandomAccessFile.
    >>
    >> It's a little bit more thinking to get the coding done, but reduces the memory
    >> usage, and random access is fast. 40.000 lines of text isn't that much.

    >
    > I know. I just didn't want to tell the original poster that his idea of
    > first reading in everything into memory is a !@#$%^&* idea. Instead I
    > wanted to provide an off-the-shelf solution. Of course, writing your
    > own little special-purpose "database" and search code is also an
    > option.


    HSQLDB (at sourceforge) provides JDBC access to a non-networked database
    server if you want a quick fix.

    [--the following is quite off-topic--]

    :D Just a quick question: Does memory mapping put the whole file into
    memory or does is just keep a part of the file memory resident? (I haven't
    tried this yet.). Otherwise memory mapping the file would be a good
    solution. The nio.*-interfaces are a bit awkward at times to me.

    A bonus to the off-the-shelf solutions is, that they work well to get
    things done initially on one and. On the other hand they tend to drive the
    hardware and environment requirements to stupendous amounts of RAM and
    software needed just to get it done. It's a bit strange to use a real
    database server, because the programmer does not know how to manipulate
    large files.

    And in some cases the speed gain is actually fake. For example, I code on a
    512 MB ram machine, but I really can't require users to have the same
    environment. And once the OS itself starts to swap things in and out of
    memory, you are actually experiencing a much worse slow down in EVERYTHING
    you do with the computer.

    Like a few weeks ago there was a posting about how to a code a servlet
    doing a MIME/multipart upload. It's not that hard once you understand what
    is actually coming over the connection ( just dump the output coming in
    and "you save a nickle!" ). I have the code lying around somewhere...

    I just dislike the fact that there is this general attitude on usenet,
    that once you post something that's not 100% working, you're 'reputation'
    is totally bummed out if somebody googles you. It wasn't always like this.
    It's so irritating, that there is almost this requirement to post using an
    alias, and never post anything that might be wrong. I really can't
    remember signing a SLA contract with anybody.

    > I see this "files? what the heck are files? Give me memory!" attitude
    > more and more often, also in this group. E.g. recently someone posted
    > his source code to copy the contents of one directory to another
    > directory. For each file, the code was along the line of:
    >
    > * get length of file (which is a "long")
    >
    > * cast length to "int" (which can still be a rather large number)
    >
    > * allocate byte array of that (truncated) length
    >
    > * read the complete file (or what remains after the truncation)
    > into the array
    >
    > * write the array to the destination file
    >
    > There isn't much more that one can do wrong here, isn't it? I guess it is
    > a generation thing. I am not that old :) but I had to work with memory
    > sizes that today we all use up for a simple String before the first cup
    > of tea in the morning.


    It has nothing to do with generation differences. Like I am 24 turning
    25 next week; it's more like a dependency syndrome. Some major vendor
    is making the majority of us API addicts. In the end the real knowledge of
    being able to solve a problem is lost.

    Like API's are fun and a real blessing in most cases, but once you cross
    over the line that you can't solve a problem without using very high-level
    API's: what good are you as a programmer? (don't flame me about this, I
    don't consider myself the ueberprogrammer at all; some of those folks are
    hanging around in the top 1000 rankings of www.topcoder.com)

    I wouldn't consider myself any better than a Office XP user that talks to
    his machine. (Just imaging everyone at an office department talking to
    their machine; "We saved one nickle! We saved one nickle!" I would apply
    for a permanent position at a sanitorium immediately).

    >
    > /Thomas


    Off-topic:

    My wishlist for the future of the development of the jdk is:

    1) faster startup time of the JRE
    2) better memory management of the JRE
    3) being able to tweak the initial environment of the JRE
    (anybody noticed that the JRE 1.4 now fires up several extra threads for
    the filesystem preferences thing?)
    4) reduced memory usage of Swing
    5) overall speed improvement of Swing
    6) improved HTML support in Swing
    7) modularized on-demand installation of the JRE

    It's definitely not generics or overloaded return types (gosh). I don't see it be
    beneficial to the JRE to keep adding new language constructs. While it is
    more correct; the programming the you need to do can be done using the
    current means.

    The API is nearing google in the amount of classes, while less time is
    spent on things an average end-user really wants: speed and reduced
    memory usage, better (default) multimedia support and integration with the
    operating system. A side-effect of adding things to the API is that you
    can't remove it afterwards, since people expect the stuff to be present.

    Take for example generics: the programmer's really want it, because it
    makes the language more 'advanced'. But it's actually just a feature to
    easy our laziness. A good editor/IDE with template support can do exactly
    the same. Does the end-user notice the fact that you are using generics?
    No. He's still looking at the sometimes brain-dead and slow to start user
    interface.

    Greets
    Bhun.
     
    dhek bhun kho, Jul 9, 2003
    #7
  8. Memory-mapped I/O (Re: Lots of Data)

    dhek bhun kho <> writes:
    >:D Just a quick question: Does memory mapping put the whole file into
    > memory or does is just keep a part of the file memory resident? (I haven't
    > tried this yet.).


    It depends on the JVM implementation, which in turn depends on the OS
    supporting it, and also on hardware for an efficient implemetation. For
    cross-platform reasons the Java API spec does not guarantee that you
    get real memory-mapped I/O.

    If it is done as it should be done, then only the parts of the file
    that you really access are transfered into memory (in some page-size
    chunks). Everything else is just some meta data in the memory
    management unit and the OS. But MappedByteBuffer also has the load()
    method, which worries me very much, because it is supposed to load the
    complete data into main memory at once.

    On Unix, memory-mapped I/O is "just" an application of the normal
    virtual memory management. It is the same mechanism that is used to
    page in and out programs and program data, and it is highly optimized.
    For memory-mapped I/O the file is associated with some virtual address
    space - just some meta data is set up. When this virtual memory is
    accessed at some time, it triggers a page-fault. That page-fault
    triggers the loading of that page from the file into main memory, and
    the access to the data is then satisfied from that page of main
    memory.

    What I would like to see for the nio FileChannel is a method to test if
    real memory-mapped I/O is supported, and also some method/constant
    which tells me the used page size (which is OS dependent).

    > Otherwise memory mapping the file would be a good
    > solution.


    Well, if you have to sequencially search through your file, then sooner
    or later all parts of the file will be paged into main memory (and some
    might be paged out if main memory gets low). So you "only" gain some
    speed because of the more efficient I/O operations. But, if you have a
    more clever file structure (e.g. an index at the beginning of the
    file), and record sizes which are multiples or integer parts of the
    page size, you can gain considerable speed in file access, especially
    for large files.

    > The nio.*-interfaces are a bit awkward at times to me.


    Indeed, they are. E.g. I am still trying to find out if it is "better"
    to map the complete file at once, or to create new MappedByteBuffers
    for each region I want to access.

    > I just dislike the fact that there is this general attitude on usenet,
    > that once you post something that's not 100% working, you're 'reputation'
    > is totally bummed out if somebody googles you.


    I don't have a reputation, so I don't have to care ;-) I am more
    worried about the whining that starts when you tell people the sade
    truth. E.g. that something they are trying is just stupid.

    > My wishlist for the future of the development of the jdk is:
    >
    > 1) faster startup time of the JRE
    > 2) better memory management of the JRE


    Please start with Swing :)

    > Take for example generics: the programmer's really want it,


    Don't count me in on that one. I am burned by C++ (yes, I know, Java
    generics are not templates, are much better, yada, yada, yada, but I
    still don't want them).

    /Thomas
     
    Thomas Weidenfeller, Jul 9, 2003
    #8
  9. Quasi-databases: Serialised data (Was Re: Lots of Data)

    Hey Thomas (or anybody else)

    lid (Thomas Weidenfeller), Wed, 09 Jul 2003
    07:06:15 +0000:

    > dhek bhun kho <> writes:
    >> If the data is stored with an explicit ordering, then you can get away with
    >> implementing some form of cache and accessing the file as a
    >> java.io.RandomAccessFile.
    >>
    >> It's a little bit more thinking to get the coding done, but reduces the memory
    >> usage, and random access is fast. 40.000 lines of text isn't that much.

    >
    > I know. I just didn't want to tell the original poster that his idea of
    > first reading in everything into memory is a !@#$%^&* idea. Instead I
    > wanted to provide an off-the-shelf solution. Of course, writing your
    > own little special-purpose "database" and search code is also an
    > option.


    <a
    href="http://forum.java.sun.com/thread.jsp?forum=4&thread=420695&start=0&range=30#1868195">
    I was talking to somebody on JDC about multiple instances of serialised
    data in a single file, and was wondering if anybody could provide some
    input:
    </a>

    > into the previously allocated space. And a small
    > buffer could be used to copy the rest of the file backward.


    I see that deletion does not have to be much of an issue per se.

    But the thing is: we are talking about serialised data, and thus variable
    length 'records'.

    Another problem is modification. There is no way to guarantee that a
    modified object will fit back into it's own 'bucket'. That is when using
    the approach of storing multiple serialised objects in a single file,
    every modification of a serialised object will imply a deletion and addition.

    The only things that are easy to do when having a single file with
    multiple objects is to add. And even that requires you to do a hack by
    adding deletion flag.

    Searching this file for a specific instance will also be akward. For one
    thing if you want to do fast searching, you would have to resort to
    pattern matching, as the data is not stored in any particular ordering and
    has variable length.

    The more think of it, the nastier it gets. Deletion is but the start of a
    problem. I am not saying it won't work. But all the extra work to get it
    working just to have an esthetically nice solution (single file) makes me
    think it is the result of not thinking clearly.

    Maybe somebody has done it like this already, I really don't know. I
    haven't seen an implementation like this yet. I thought about what dub
    said. I'd like to know what anybody has to say about the following:

    Code a class that will handle the save/delete/find methods.

    <code>
    public abstract class PersistenceManager {
    public static PersistenceManager getInstance(Object object);
    public void save(Serializable object) throws NotLocked;
    public void delete(Serializable object) throws NotLocked;
    public void lock(Serializable object) throws AlreadyLocked, ObjectNotFound;
    public void unlock(Serializable object) throws NotLocked;
    public Serializable find(Class serializableClass, Comparator query);
    public Serializable find(Class serializableClass, String md5hash);
    }
    </code>

    It's a singleton so it should be able to handle concurrent access to the
    backing store. This way the PersistenceManager can do some form of
    rudimentary caching (LRU, FIFO, whatever) and row locking in a single VM.

    The basic idea is clear, I hope. I'm not totally sure about the lock and
    unlock methods (deadlocks/starvation).

    A simple scenario would be something like this:
    <code>
    File baseDirectory=..; // like the namespaces in SQL
    ...
    public void save(Serializable object)
    throws NotLocked
    {
    File classDirectory;

    classDirectory = new File(baseDirectory, object.getClass().getName());
    if (!classDirectory.exists()) classDirectory().mkdirs();
    .. calculate key ..
    .. check if this object has been locked, lock the key ..
    .. serialise and save the object..
    .. unlock ..
    }
    </code>

    To get back on-topic. The major benefit of doing it like this, with a
    single file per object instance is:

    - deletion is a sequence like lock/delete file/unlock.
    - saving is lock/create file/unlock.
    - sequential search should not occur extra overhead (but is not synchronized)

    One catch I can see now is that it requires the Serializable class to
    either have some kind of unique id, meaning all objects should be unique:

    This way everything is still accessible using random access methods, but
    with the filesystem acting as 'one big file'. On the other hand, if I am
    correct the OS itself usually does the caching of file access so you don't
    need to go great lengths to do caching (hopefully).

    And what I find most important: memory usage of this approach should be
    realy low; since the it will only use the amount of a single object with
    some serialisation overhead.

    I am just not sure what do with the indexing of the serialised objects. I
    am planning to use the filenames as the key to an entry. Should I subclass
    the Serializable contract to enforce the presence of a keyvalue? I can't
    rely on the hashCode() method. Or should I do something like a MD5/SHA
    hash on the byte representation of the serialised object? That should
    provide unique keyvalues shouldn't it? These would be consistent across
    several executions of the application.

    I am planning om implementing this, can anybody provide some useful input
    if this meant to fail beforehand?

    Greets
    Bhun.
     
    dhek bhun kho, Jul 10, 2003
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Cor van Loos
    Replies:
    2
    Views:
    522
  2. Harry Whitehouse

    Long Page Load -- Lots of Data -- A problem

    Harry Whitehouse, Apr 24, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    491
    Lau Lei Cheong
    Apr 25, 2005
  3. brad
    Replies:
    9
    Views:
    396
    Bruno Desthuilliers
    Jun 19, 2008
  4. coolneo
    Replies:
    9
    Views:
    222
    coolneo
    Jan 30, 2007
  5. Replies:
    3
    Views:
    96
    Tim Williams
    Mar 3, 2005
Loading...

Share This Page