Accessing content of huge files

Discussion in 'Java' started by Alex Molochnikov, Jun 26, 2008.

  1. I work for a seismic data processing company, and am confronted with a
    task of efficiently accessing file-based data in the environment where
    file of 200Gb is considered a medium size.

    The data is spacial (2D), with hundreds of millions of data points that
    need to be plotted at various levels of resolution. For low resolution,
    only small subset of all data that fits into a limited number of screen
    pixels should be used (and therefore loaded from the file). For high
    resolutions, all data points that belong to the given small X-Y region
    should be loaded.

    So these are the two extremes between which the application must
    operate, and since this is an interactive app, the efficiency of
    accessing data is of paramount importance.

    I am looking for algorithms (better yet, LGPL-ed Java code) that can
    handle this task.

    Thanks for any clues.

    Alex Molochnikov
    Kelman Technologies Inc.
    Alex Molochnikov, Jun 26, 2008
    #1
    1. Advertising

  2. Alex Molochnikov <> wrote:
    > The data is spacial (2D), with hundreds of millions of data points that
    > need to be plotted at various levels of resolution. For low resolution,
    > only small subset of all data that fits into a limited number of screen
    > pixels should be used (and therefore loaded from the file). For high
    > resolutions, all data points that belong to the given small X-Y region
    > should be loaded.


    It depends on how the data points are actually stored in the file.
    e.g. it could be decimal ascii-strings, or binary representations.
    It could be pre-sorted or indexed to ease finding the relevant data;
    it could be variable-length or fixed-size records for each data point.
    Andreas Leitgeb, Jun 27, 2008
    #2
    1. Advertising

  3. Alex Molochnikov

    Blueparty Guest

    Alex Molochnikov wrote:
    > I work for a seismic data processing company, and am confronted with a
    > task of efficiently accessing file-based data in the environment where
    > file of 200Gb is considered a medium size.
    >
    > The data is spacial (2D), with hundreds of millions of data points that
    > need to be plotted at various levels of resolution. For low resolution,
    > only small subset of all data that fits into a limited number of screen
    > pixels should be used (and therefore loaded from the file). For high
    > resolutions, all data points that belong to the given small X-Y region
    > should be loaded.
    >
    > So these are the two extremes between which the application must
    > operate, and since this is an interactive app, the efficiency of
    > accessing data is of paramount importance.
    >


    I think that you need to maintain som kind of indexes for different
    criteria. I mean a number of files with pointers (file offsets) to
    actual data. Let's call it metadata. Maybe you can have a database that
    does not contain the actual data, but metadata only.

    DG
    Blueparty, Jun 27, 2008
    #3
  4. Blueparty wrote:
    > Alex Molochnikov wrote:
    >> I work for a seismic data processing company, and am confronted with a
    >> task of efficiently accessing file-based data in the environment where
    >> file of 200Gb is considered a medium size.
    >>
    >> The data is spacial (2D), with hundreds of millions of data points that
    >> need to be plotted at various levels of resolution. For low resolution,
    >> only small subset of all data that fits into a limited number of screen
    >> pixels should be used (and therefore loaded from the file). For high
    >> resolutions, all data points that belong to the given small X-Y region
    >> should be loaded.
    >>
    >> So these are the two extremes between which the application must
    >> operate, and since this is an interactive app, the efficiency of
    >> accessing data is of paramount importance.
    >>

    >
    > I think that you need to maintain som kind of indexes for different
    > criteria. I mean a number of files with pointers (file offsets) to
    > actual data. Let's call it metadata. Maybe you can have a database that
    > does not contain the actual data, but metadata only.


    Also, it may be useful to pre-extract, into a separate file, subset
    points some of which will be plotted for low resolution.

    Patricia
    Patricia Shanahan, Jun 27, 2008
    #4
  5. Patricia, DG:

    Thank you for the response. I was leaning towards this approach
    (indexing of spatial regions within the file, and prepping it for
    various resolutions).

    It is nice to see that I am not totally off-the-wall with these ideas.

    However, apparently there is no freebie code floating around for me to
    catch a ride on.

    Alex.

    Patricia Shanahan wrote:
    > Blueparty wrote:
    >> Alex Molochnikov wrote:
    >>> I work for a seismic data processing company, and am confronted with a
    >>> task of efficiently accessing file-based data in the environment where
    >>> file of 200Gb is considered a medium size.
    >>>
    >>> The data is spacial (2D), with hundreds of millions of data points that
    >>> need to be plotted at various levels of resolution. For low resolution,
    >>> only small subset of all data that fits into a limited number of screen
    >>> pixels should be used (and therefore loaded from the file). For high
    >>> resolutions, all data points that belong to the given small X-Y region
    >>> should be loaded.
    >>>
    >>> So these are the two extremes between which the application must
    >>> operate, and since this is an interactive app, the efficiency of
    >>> accessing data is of paramount importance.
    >>>

    >>
    >> I think that you need to maintain som kind of indexes for different
    >> criteria. I mean a number of files with pointers (file offsets) to
    >> actual data. Let's call it metadata. Maybe you can have a database that
    >> does not contain the actual data, but metadata only.

    >
    > Also, it may be useful to pre-extract, into a separate file, subset
    > points some of which will be plotted for low resolution.
    >
    > Patricia
    >
    Alex Molochnikov, Jun 27, 2008
    #5
  6. Andreas Leitgeb wrote:
    > Alex Molochnikov <> wrote:
    >> The data is spacial (2D), with hundreds of millions of data points that
    >> need to be plotted at various levels of resolution. For low resolution,
    >> only small subset of all data that fits into a limited number of screen
    >> pixels should be used (and therefore loaded from the file). For high
    >> resolutions, all data points that belong to the given small X-Y region
    >> should be loaded.

    >
    > It depends on how the data points are actually stored in the file.
    > e.g. it could be decimal ascii-strings, or binary representations.
    > It could be pre-sorted or indexed to ease finding the relevant data;
    > it could be variable-length or fixed-size records for each data point.
    >


    It is up to me to choose the appropriate format. In fact, part of my
    task is to decide what format to choose.

    Most likely, the file will be pre-sorted, and binary with 4-byte
    floating point data values.

    I know how to implement the code for accessing such a file, but was
    hoping to piggyback on other people's contributions.

    Thanks,
    Alex
    Alex Molochnikov, Jun 27, 2008
    #6
  7. Alex Molochnikov wrote:
    > Andreas Leitgeb wrote:
    >> Alex Molochnikov <> wrote:
    >>> The data is spacial (2D), with hundreds of millions of data points
    >>> that need to be plotted at various levels of resolution. For low
    >>> resolution, only small subset of all data that fits into a limited
    >>> number of screen pixels should be used (and therefore loaded from the
    >>> file). For high resolutions, all data points that belong to the given
    >>> small X-Y region should be loaded.

    >>
    >> It depends on how the data points are actually stored in the file.
    >> e.g. it could be decimal ascii-strings, or binary representations.
    >> It could be pre-sorted or indexed to ease finding the relevant data;
    >> it could be variable-length or fixed-size records for each data point.
    >>

    >
    > It is up to me to choose the appropriate format. In fact, part of my
    > task is to decide what format to choose.
    >
    > Most likely, the file will be pre-sorted, and binary with 4-byte
    > floating point data values.
    >
    > I know how to implement the code for accessing such a file, but was
    > hoping to piggyback on other people's contributions.


    The most useful piece that I think already exists is two dimensional
    indexing. If you are not already familiar with the topic,
    http://en.wikipedia.org/wiki/Kd-tree may be a good starting point giving
    some of the terminology and links to related ideas.

    Very often, searching for the name of an algorithm or data structure
    preceded by "java" finds Java code implementing it.

    Patricia
    Patricia Shanahan, Jun 27, 2008
    #7
  8. Alex Molochnikov

    Roedy Green Guest

    On Thu, 26 Jun 2008 12:10:37 -0600, Alex Molochnikov <>
    wrote, quoted or indirectly quoted someone who said :

    >The data is spacial (2D), with hundreds of millions of data points that
    >need to be plotted at various levels of resolution. For low resolution,
    >only small subset of all data that fits into a limited number of screen
    >pixels should be used (and therefore loaded from the file). For high
    >resolutions, all data points that belong to the given small X-Y region
    >should be loaded.


    Back in the 80s, such databases were handled with special purpose
    hardware, stuff from Silicon Graphics for example. You might google
    for "mapping software"
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 28, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mario Rodriguez

    uploading huge files

    Mario Rodriguez, Apr 20, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    305
    =?Utf-8?B?Q0FSZWVk?=
    Apr 20, 2004
  2. JMG

    Pb when downloading huge files

    JMG, Apr 29, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    325
  3. hazz
    Replies:
    6
    Views:
    49,379
    SkyUCHC
    Jun 9, 2010
  4. GrelEns
    Replies:
    1
    Views:
    262
    Paul Rubin
    Oct 23, 2003
  5. Replies:
    3
    Views:
    470
Loading...

Share This Page