Design of a pipelined architecture/framework for handling large data sets

N

nish

I am facing an inconvenience which I believe should have been faced
before by other Java developers but I am finding it difficult to
articulate it in keywords so that google will give me the right
answers..so here goes

1. I am using eclipse ide with mulitple java projects, each one sourced
from a CVS repository on an external server in the local LAN.
2. Almost all of these projects basically handle big data sets (read
100mbs - 500mbs of xml and text files) which is basically data crawled
from the web, act and transform it in some way and then pass it along
for other projects to act on it. Some of hte data is in single big
files and some of it is in 100's of small files inside a single
directory.

Basically what I am looking for is a better way to handle this data.
Currently if I put the data in CVS then it is not that efficient , plus
there needs to be some central lookup for all the data.I guess this is
partly a java design question and partly ignorance on my part to use
the right tools to do this job.

Thanks for any help.
 
N

nish

Other issues I could think about:

3. I should be able to specify how a data set is being archived. So for
example for some large data I dont want it to be revisioned in CVS
because it is not going to change, for others i might want it to be
checked into cvs so that it gets revisioned
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top