Interesting data storage problem

X

Xaonon

I'm writing a program which will, in part, need to store and eventually
write out to files a fair amount of data, on the rough order of 10 MB. This
data consists of a series of ~10000 numbered records, each containing a few
kB of text on average and a handful of integers. I intend to save it in XML
format. Writing the data structure to hold these records is trivial.
Assume further that I don't have access to any sort of database; any file
I/O I'm going to have to do more or less by hand.

Now, the question is: exactly how should I save all of this? Performance is
an issue, so I'm leery of saving the whole thing in one big file and loading
it all in one gulp. Ten megs seems like a lot of memory for a program that
isn't displaying much more than plain ol' *text*... and only a small subset
of the records will be displayed at any given time, so it seems that I
should be able to get away with only loading them as needed. On the other
end of the spectrum I could save each record in its own file, but I know
that some filesystems have trouble with directories containing thousands of
files.

I'm very new to Java---I'm writing this program mostly just to teach Java to
myself---so please excuse me if there's some obvious solution I'm missing.
I'd be grateful for any suggestions.
 
M

Matt Humphrey

Xaonon said:
I'm writing a program which will, in part, need to store and eventually
write out to files a fair amount of data, on the rough order of 10 MB. This
data consists of a series of ~10000 numbered records, each containing a few
kB of text on average and a handful of integers. I intend to save it in XML
format. Writing the data structure to hold these records is trivial.
Assume further that I don't have access to any sort of database; any file
I/O I'm going to have to do more or less by hand.

Now, the question is: exactly how should I save all of this? Performance is
an issue, so I'm leery of saving the whole thing in one big file and loading
it all in one gulp. Ten megs seems like a lot of memory for a program that
isn't displaying much more than plain ol' *text*... and only a small subset
of the records will be displayed at any given time, so it seems that I
should be able to get away with only loading them as needed. On the other
end of the spectrum I could save each record in its own file, but I know
that some filesystems have trouble with directories containing thousands of
files.

I'm very new to Java---I'm writing this program mostly just to teach Java to
myself---so please excuse me if there's some obvious solution I'm missing.
I'd be grateful for any suggestions.

Well, you really have two goals. One is the actual storage but the other is
to teach yourself Java and that may be well served by using something
appropriate for a real project. Although you say you don't have access to
any databases, there are some nice free ones and this could be a good
opportunity for you to learn to JDBC. However, to simply address the
problem as stated:

I would use 2 RandomAccessFiles with the first as a kind of index or record
map and the second as the varying-length data. The first file would consist
of fixed-length records that contain any key information along with your
extra numbers and two other numbers. Of these two extra ints, the first is
the length (in bytes) of the record's associated text block. The second
number is the block's location in the second file. That is, the
varying-length data is stacked end-to-end in the second file at positions
specifically recorded in the first file.

By knowing the record size it is possible to jump to any record number in
the first file. (E.g. Record index * Record Size). You can then read out
the record data along with the extra data position and size. You can then
read the extra data from the second file (or not if the size is 0 or
something.)

You havn't said whether you need to update the data or if you just create it
once and read repeatedly or if you create it and append (etc), but not edit.
Deleting records is largely a matter of marking them deleted, chaining them
in a stack and re-allocating them later. Changing the size of the
variable-length text is more complex. Actually, if you can ensure that the
varying length portion always has a maximum amount (e.g. 2K or something),
you can make your life much easier by using only one file and having each
fixed-length record contain it's own fixed-length area. You'll be wasting
space, but you're only talking 10MB and you would still have to trade
between internal or external fragmentation.

I hope that gives you some ideas.
Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top