Accessing content of huge files

A

Alex Molochnikov

I work for a seismic data processing company, and am confronted with a
task of efficiently accessing file-based data in the environment where
file of 200Gb is considered a medium size.

The data is spacial (2D), with hundreds of millions of data points that
need to be plotted at various levels of resolution. For low resolution,
only small subset of all data that fits into a limited number of screen
pixels should be used (and therefore loaded from the file). For high
resolutions, all data points that belong to the given small X-Y region
should be loaded.

So these are the two extremes between which the application must
operate, and since this is an interactive app, the efficiency of
accessing data is of paramount importance.

I am looking for algorithms (better yet, LGPL-ed Java code) that can
handle this task.

Thanks for any clues.

Alex Molochnikov
Kelman Technologies Inc.
 
A

Andreas Leitgeb

Alex Molochnikov said:
The data is spacial (2D), with hundreds of millions of data points that
need to be plotted at various levels of resolution. For low resolution,
only small subset of all data that fits into a limited number of screen
pixels should be used (and therefore loaded from the file). For high
resolutions, all data points that belong to the given small X-Y region
should be loaded.

It depends on how the data points are actually stored in the file.
e.g. it could be decimal ascii-strings, or binary representations.
It could be pre-sorted or indexed to ease finding the relevant data;
it could be variable-length or fixed-size records for each data point.
 
B

Blueparty

Alex said:
I work for a seismic data processing company, and am confronted with a
task of efficiently accessing file-based data in the environment where
file of 200Gb is considered a medium size.

The data is spacial (2D), with hundreds of millions of data points that
need to be plotted at various levels of resolution. For low resolution,
only small subset of all data that fits into a limited number of screen
pixels should be used (and therefore loaded from the file). For high
resolutions, all data points that belong to the given small X-Y region
should be loaded.

So these are the two extremes between which the application must
operate, and since this is an interactive app, the efficiency of
accessing data is of paramount importance.

I think that you need to maintain som kind of indexes for different
criteria. I mean a number of files with pointers (file offsets) to
actual data. Let's call it metadata. Maybe you can have a database that
does not contain the actual data, but metadata only.

DG
 
P

Patricia Shanahan

Blueparty said:
I think that you need to maintain som kind of indexes for different
criteria. I mean a number of files with pointers (file offsets) to
actual data. Let's call it metadata. Maybe you can have a database that
does not contain the actual data, but metadata only.

Also, it may be useful to pre-extract, into a separate file, subset
points some of which will be plotted for low resolution.

Patricia
 
A

Alex Molochnikov

Patricia, DG:

Thank you for the response. I was leaning towards this approach
(indexing of spatial regions within the file, and prepping it for
various resolutions).

It is nice to see that I am not totally off-the-wall with these ideas.

However, apparently there is no freebie code floating around for me to
catch a ride on.

Alex.
 
A

Alex Molochnikov

Andreas said:
It depends on how the data points are actually stored in the file.
e.g. it could be decimal ascii-strings, or binary representations.
It could be pre-sorted or indexed to ease finding the relevant data;
it could be variable-length or fixed-size records for each data point.

It is up to me to choose the appropriate format. In fact, part of my
task is to decide what format to choose.

Most likely, the file will be pre-sorted, and binary with 4-byte
floating point data values.

I know how to implement the code for accessing such a file, but was
hoping to piggyback on other people's contributions.

Thanks,
Alex
 
P

Patricia Shanahan

Alex said:
It is up to me to choose the appropriate format. In fact, part of my
task is to decide what format to choose.

Most likely, the file will be pre-sorted, and binary with 4-byte
floating point data values.

I know how to implement the code for accessing such a file, but was
hoping to piggyback on other people's contributions.

The most useful piece that I think already exists is two dimensional
indexing. If you are not already familiar with the topic,
http://en.wikipedia.org/wiki/Kd-tree may be a good starting point giving
some of the terminology and links to related ideas.

Very often, searching for the name of an algorithm or data structure
preceded by "java" finds Java code implementing it.

Patricia
 
R

Roedy Green

The data is spacial (2D), with hundreds of millions of data points that
need to be plotted at various levels of resolution. For low resolution,
only small subset of all data that fits into a limited number of screen
pixels should be used (and therefore loaded from the file). For high
resolutions, all data points that belong to the given small X-Y region
should be loaded.

Back in the 80s, such databases were handled with special purpose
hardware, stuff from Silicon Graphics for example. You might google
for "mapping software"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top