Accessing content of huge files

Alex Molochnikov · Jun 26, 2008

I work for a seismic data processing company, and am confronted with a
task of efficiently accessing file-based data in the environment where
file of 200Gb is considered a medium size.

The data is spacial (2D), with hundreds of millions of data points that
need to be plotted at various levels of resolution. For low resolution,
only small subset of all data that fits into a limited number of screen
pixels should be used (and therefore loaded from the file). For high
resolutions, all data points that belong to the given small X-Y region
should be loaded.

So these are the two extremes between which the application must
operate, and since this is an interactive app, the efficiency of
accessing data is of paramount importance.

I am looking for algorithms (better yet, LGPL-ed Java code) that can
handle this task.

Thanks for any clues.

Alex Molochnikov
Kelman Technologies Inc.

Andreas Leitgeb · Jun 27, 2008

Alex Molochnikov said:
The data is spacial (2D), with hundreds of millions of data points that
need to be plotted at various levels of resolution. For low resolution,
only small subset of all data that fits into a limited number of screen
pixels should be used (and therefore loaded from the file). For high
resolutions, all data points that belong to the given small X-Y region
should be loaded.

It depends on how the data points are actually stored in the file.
e.g. it could be decimal ascii-strings, or binary representations.
It could be pre-sorted or indexed to ease finding the relevant data;
it could be variable-length or fixed-size records for each data point.

Blueparty · Jun 27, 2008

Alex said:
I work for a seismic data processing company, and am confronted with a
task of efficiently accessing file-based data in the environment where
file of 200Gb is considered a medium size.

The data is spacial (2D), with hundreds of millions of data points that
need to be plotted at various levels of resolution. For low resolution,
only small subset of all data that fits into a limited number of screen
pixels should be used (and therefore loaded from the file). For high
resolutions, all data points that belong to the given small X-Y region
should be loaded.

So these are the two extremes between which the application must
operate, and since this is an interactive app, the efficiency of
accessing data is of paramount importance.

I think that you need to maintain som kind of indexes for different
criteria. I mean a number of files with pointers (file offsets) to
actual data. Let's call it metadata. Maybe you can have a database that
does not contain the actual data, but metadata only.

DG

Patricia Shanahan · Jun 27, 2008

Blueparty said:
I think that you need to maintain som kind of indexes for different
criteria. I mean a number of files with pointers (file offsets) to
actual data. Let's call it metadata. Maybe you can have a database that
does not contain the actual data, but metadata only.

Also, it may be useful to pre-extract, into a separate file, subset
points some of which will be plotted for low resolution.

Patricia

Alex Molochnikov · Jun 27, 2008

Patricia, DG:

Thank you for the response. I was leaning towards this approach
(indexing of spatial regions within the file, and prepping it for
various resolutions).

It is nice to see that I am not totally off-the-wall with these ideas.

However, apparently there is no freebie code floating around for me to
catch a ride on.

Alex.

Alex Molochnikov · Jun 27, 2008

Andreas said:
It depends on how the data points are actually stored in the file.
e.g. it could be decimal ascii-strings, or binary representations.
It could be pre-sorted or indexed to ease finding the relevant data;
it could be variable-length or fixed-size records for each data point.

It is up to me to choose the appropriate format. In fact, part of my
task is to decide what format to choose.

Most likely, the file will be pre-sorted, and binary with 4-byte
floating point data values.

I know how to implement the code for accessing such a file, but was
hoping to piggyback on other people's contributions.

Thanks,
Alex

Patricia Shanahan · Jun 27, 2008

Alex said:
It is up to me to choose the appropriate format. In fact, part of my
task is to decide what format to choose.

Most likely, the file will be pre-sorted, and binary with 4-byte
floating point data values.

I know how to implement the code for accessing such a file, but was
hoping to piggyback on other people's contributions.

The most useful piece that I think already exists is two dimensional
indexing. If you are not already familiar with the topic,
http://en.wikipedia.org/wiki/Kd-tree may be a good starting point giving
some of the terminology and links to related ideas.

Very often, searching for the name of an algorithm or data structure
preceded by "java" finds Java code implementing it.

Patricia

Roedy Green · Jun 28, 2008

The data is spacial (2D), with hundreds of millions of data points that
need to be plotted at various levels of resolution. For low resolution,
only small subset of all data that fits into a limited number of screen
pixels should be used (and therefore loaded from the file). For high
resolutions, all data points that belong to the given small X-Y region
should be loaded.

Back in the 80s, such databases were handled with special purpose
hardware, stuff from Silicon Graphics for example. You might google
for "mapping software"

I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Incorporating data files into packages and accessing internally	6	Jul 31, 2007
Fast alternatives to "File" and "IO" for large numbers of files ?	6	Feb 24, 2011
how to capture locally, the data content of an HTM form?	28	Nov 17, 2009
Deleting files modified before a specified number of days using Java.	6	Dec 27, 2007
UnauthorizedAccessException in webservice accessing files on UNC s	0	Sep 15, 2005
Ann: XMLDBelt v.1.0, server addition to handle content management via xml data files	5	Apr 27, 2004
Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011

Accessing content of huge files

Alex Molochnikov

Andreas Leitgeb

Blueparty

Patricia Shanahan

Alex Molochnikov

Alex Molochnikov

Patricia Shanahan

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads