Data Storage Issue (Basic Issue)

Discussion in 'Java' started by Srini, May 9, 2008.

  1. Srini

    Srini Guest

    Which is more efficient? Why??
    Storing or Reading a file in database or on to disk in a web server
    environment? considering all things like network, connections, memory
    are ideal.

    My knowledge says that using database is more efficient but few my
    pierce argue that storing in disk is more efficient when concurrent
    people accessing it...
     
    Srini, May 9, 2008
    #1
    1. Advertising

  2. Srini

    Tom Anderson Guest

    On Fri, 9 May 2008, Srini wrote:

    > Which is more efficient? Why?? Storing or Reading a file in database or
    > on to disk in a web server environment? considering all things like
    > network, connections, memory are ideal.
    >
    > My knowledge says that using database is more efficient but few my
    > pierce argue that storing in disk is more efficient when concurrent
    > people accessing it...


    Databases typically store their data in a file [1]. That means that a
    file-based solution can always be at least as fast as using a database,
    because it can just do what the database does.

    The problem is that to make a file-based solution that's as fast as a good
    database and also provides things like transactionality, you may have to
    write code that's as complex as a database. Which is not good.

    If you're mostly reading your data, so you don't have to worry about
    concurrency and transactionality, and you have a straightforward
    organisation (like having fixed-size records which you can refer to by
    index in a sequence), then you can write a simple file-based
    implementation that should be faster than a database, because it avoids
    the overhead and complexity.

    There's nothing in the java libraries, that i'm aware of, for doing this
    kind of non-database structured file access. There are things for some
    texual formats, like XML and properties files (remember those?!), but
    nothing like DBM or COBOL's record-oriented files. There are third-party
    libraries, though - see Berkeley DB, Java Edition:

    http://www.oracle.com/technology/products/berkeley-db/je/index.html

    and JDBM:

    http://jdbm.sourceforge.net/

    It's also not that hard to write your own fixed-size record manager, and
    not that hard to layer variable-sized records on top of such a thing.

    There's also an excellent trick for using the unix filesystem as a
    database by storing data in symbolic links: the path of a symlink is
    actually an arbitrary text string, so you can store information, rather
    than an actual path, in it. Gives you hierarchically organised,
    string-keyed records of up to a kilobyte (YMMV) without any actual file
    IO!

    A performance question for the wise: i hacked up a little fixed-size
    record manager, and wrote two backends, one using RandomAccessFile, and
    one using a NIO MappedByteBuffer. For both, i provided a way to flush to
    disk after each write - with RandomAccessFile, via getFD().sync(), and
    with MappedByteBuffer with force(). Timings to do a batch of reads and
    writes (100 000 operations, 75% reads, on 10 000 records of 256 bytes
    each; a different random pattern each time, on a machine doing nothing but
    this and playing MP3s):

    Implementation Flush? Time (ms)

    RandomAccessFile no 733
    RandomAccessFile yes 20659
    MappedByteBuffer no 63
    MappedByteBuffer yes 33087

    The mapped file is an order of magnitude faster without flushing, but 50%
    slower with. Any idea why?

    tom

    [1] Okay, so seriously heavyweight ones use disk extents/partitions and
    bypass the filesystem; how much of a difference does that make?

    --
    .... but when you spin it it looks like a dancing foetus!
     
    Tom Anderson, May 9, 2008
    #2
    1. Advertising

  3. Srini

    Tom Anderson Guest

    On Fri, 9 May 2008, Lew wrote:

    > Tom Anderson wrote:
    >
    >> Databases typically store their data in a file [1]. ... [1] Okay, so
    >> seriously heavyweight ones use disk extents/partitions and bypass the
    >> filesystem; how much of a difference does that make?

    >
    > Anecdotally, on a large-scale test, above five-to-one.


    Yikes! That's kind of damning for filesystems. I know they do a lot of
    stuff that a raw disk doesn't but still. Wow.

    tom

    --
    Finals make a man mean; let's fusc up and write!
     
    Tom Anderson, May 10, 2008
    #3
  4. Lew wrote:
    > Eric Sosman wrote:
    >> No. "Suggestive," perhaps, or even "cautionary." But for
    >> my own part I wouldn't confer "very indicative" on a lab result
    >> that cannot be reproduced by an independent experimenter. A
    >> citation of a full report or published paper or something of that
    >> nature might change my mind, but at the moment ... no.

    >
    > OK, that's fine. It was something that convinced me that it worked in
    > that case because I could interact with people who conducted the test.
    > You choose to reject the anecdote because I have no credibility with you.
    >
    > I'm sure the result can be reproduced by an independent experimenter, if
    > they were authorized to use the various proprietary parts.
    >
    > I would not be surprised if more objective results are available to you
    > with a quick Google. Given the amount of time Oracle and others have
    > offered direct disk-management modules, I would be surprised if such
    > results were not available to you. Do let us know what you find, won't
    > you?
    >


    The Transaction Processing Performance Council, http://www.tpc.org,
    creates a series of benchmarks that are strongly dependent on database
    performance, and archives results. It should be possible to work out how
    the disks are managed, at least from the full disclosure reports.

    If anyone is interested in researching this, TPC may be a starting point.

    Patricia
     
    Patricia Shanahan, May 10, 2008
    #4
  5. Srini

    Roedy Green Guest

    On Fri, 9 May 2008 07:37:29 -0700 (PDT), Srini <>
    wrote, quoted or indirectly quoted someone who said :

    >Storing or Reading a file in database or on to disk in a web server
    >environment? considering all things like network, connections, memory
    >are ideal.


    It doesn't matter since the reading and writing are not
    interchangeable.

    Reading is faster since it can be cached with no physical disk i/o.

    In an SQL database with writing you have to do actual physical i/o
    PLUS disk i/o to handle transaction logging and rollback
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, May 11, 2008
    #5
  6. Srini

    Tom Anderson Guest

    On Sun, 11 May 2008, Eric Sosman wrote:

    > Lew wrote:
    >> Eric Sosman wrote:
    >>> No. "Suggestive," perhaps, or even "cautionary." But for
    >>> my own part I wouldn't confer "very indicative" on a lab result
    >>> that cannot be reproduced by an independent experimenter. A
    >>> citation of a full report or published paper or something of that
    >>> nature might change my mind, but at the moment ... no.

    >>
    >> OK, that's fine. It was something that convinced me that it worked in that
    >> case because I could interact with people who conducted the test. You
    >> choose to reject the anecdote because I have no credibility with you.

    >
    > No, no, no, Lew: Your credibility is just fine (or my credulity; take
    > it either way). All I'm saying is that the information you've presented
    > is insufficient to support the inference Tom Anderson drew from it
    > ("kind of damning for filesystems"), and that he should consider being a
    > little less hasty in drawing sweeping conclusions from sketchy and
    > incomplete reports.


    I'm not saying filesystems are worthless. Just that when it comes to
    shunting data on and off a disk, the fact that a different method can work
    five times faster than a filesystem means that if you're in the business
    of shunting data on and off a disk as fast as possible, a filesystem is
    probably not a great idea.

    Of course, if there was some particular reason why there was such a big
    difference in this case that doesn't apply in all or most cases, then fair
    enough, i'm over-generalising. I can't think of what that might be,
    though.

    > If someone told you there was a "five-to-one difference" between
    > ArrayList and plain array, would you conclude that this was "kind of
    > damning" for Collections? Of course not. I'm urging Tom not to make
    > the analogous error.


    If someone (credible) told me that there was a five-to-one difference in
    times for an ArrayList.get and an array[], then, if we were in a context
    where performance was important, i certainly would take that as kind of
    damning for collections. I think that's perfectly reasonable.

    tom

    --
    It's almost over now.
     
    Tom Anderson, May 11, 2008
    #6
  7. Lew wrote:
    > From what I've been reading and hearing from DBAs (disclaimer: I am but
    > a lowly programmer myself, not a DBA), journaling file syatems are not
    > the key to safety in big-iron DMBSes like PostgreSQL and Oracle.


    I don't think PostgreSQL is big-iron.

    Oracle, DB2 and to some extent Sybase and SQLServer are
    rather dominant in that area.

    Arne
     
    Arne Vajhøj, May 12, 2008
    #7
  8. Lew wrote:
    > Tom Anderson wrote:
    >> Databases typically store their data in a file [1]. ...
    >> [1] Okay, so seriously heavyweight ones use disk extents/partitions
    >> and bypass the filesystem; how much of a difference does that make?

    >
    > Anecdotally, on a large-scale test, above five-to-one.


    I doubt that.

    There are obviously overhead in the file system, but 400%
    overhead in writing to a pre allocated file isntead of a raw
    partition does not make any sense.

    (if the file needs to be extended during write, then it sounds
    very plausible, but that is not a fair comparison)

    My impression is that raw partitions was not preferred for
    performance but to ensure that there were no file systems
    cached not being flushed to the plates.

    Arne
     
    Arne Vajhøj, May 12, 2008
    #8
  9. Srini

    Arne Vajhøj Guest

    Srini wrote:
    > Which is more efficient? Why??
    > Storing or Reading a file in database or on to disk in a web server
    > environment? considering all things like network, connections, memory
    > are ideal.
    >
    > My knowledge says that using database is more efficient but few my
    > pierce argue that storing in disk is more efficient when concurrent
    > people accessing it...


    There is a myth among many web developers that storing files
    in database kills performance.

    And my guess is that ASP accessing an Access 95 database on
    a Pentium running NT 4 possible did.

    But things are different today. With some good hardware and
    a good database, then you should be able to get good
    performance out if storing files in the database.

    And it is so much easier to ensure data integrity and
    administer the database solution.

    I once did some experiments on the topic. The results are
    attached below.

    They are ASP.NET against MySQL, but technologies should
    not matter much.

    Arne

    ==================================================

    File (1 threads): 1,9 get per second
    File (10 threads): 2,2 get per second
    File with web app cache (1 threads): 6,6 get per second
    File with web app cache (10 threads): 16,1 get per second
    File directly by web server (1 threads): 17,6 get per second
    File directly by web server (10 threads): 20,1 get per second
    Database (1 threads): 10,7 get per second
    Database (10 threads): 11,8 get per second
    Database with web app cache (1 threads): 15,4 get per second
    Database with web app cache (10 threads): 19,6 get per second
    File (1 threads): 1,9 get per second
    File (10 threads): 2,2 get per second
    File with web app cache (1 threads): 16,7 get per second
    File with web app cache (10 threads): 19,4 get per second
    File directly by web server (1 threads): 17,1 get per second
    File directly by web server (10 threads): 20 get per second
    Database (1 threads): 10,7 get per second
    Database (10 threads): 11,9 get per second
    Database with web app cache (1 threads): 17,3 get per second
    Database with web app cache (10 threads): 19,6 get per second
    File (1 threads): 1,9 get per second
    File (10 threads): 2,1 get per second
    File with web app cache (1 threads): 16,8 get per second
    File with web app cache (10 threads): 17,8 get per second
    File directly by web server (1 threads): 17,6 get per second
    File directly by web server (10 threads): 20,1 get per second
    Database (1 threads): 10,7 get per second
    Database (10 threads): 11,6 get per second
    Database with web app cache (1 threads): 17,1 get per second
    Database with web app cache (10 threads): 19,5 get per second
     
    Arne Vajhøj, May 12, 2008
    #9
  10. Lew wrote:
    > Tom Anderson wrote:
    >> If someone (credible) told me that there was a five-to-one difference
    >> in times for an ArrayList.get and an array[], then, if we were in a
    >> context where performance was important, i certainly would take that
    >> as kind of damning for collections. I think that's perfectly reasonable.

    >
    > Eric is correct that you should not generalize from a single data point,
    > especially when full details are not available for that one case. I
    > presented the anecdote not to prove anything but to provide evidence, at
    > least that the matter is worth pursuing more objectively.
    >
    > My comment that I lack credibility was not about personality but about
    > evidence and what makes it valid. I provided no hard data, nor enough
    > information about the anecdote to make it something on which you can
    > base conclusions. By hard standards, I indeed do lack credibility to
    > assert that direct-to-disk is five times faster. OTOH, it is enough to
    > share one experience that indicates that the claim is potentially
    > supportable, just maybe.


    It depends on how much effort you're willing to put in. A major reason
    for databases running faster when off the file system, for example, is
    doing things like putting SQL tables into a continuous range of
    cylinders, and putting the lowest level of the primary index on track
    zero of those same cylinders. Then, to find a given record, you go
    through the high-level parts of the index tree, which are probably
    cached, anyway; move the access arm to the correct cylinder; read the
    low-level index record from track zero (perhaps it is replicated as many
    times as it will fit on the track, so that it comes up as fast as
    possible); and then read the data without moving the access arm again.
    That gives you indexed access only imperceptibly slower than hashed.

    Much-larger-than-RAM sorts can use similar tricks, such as merging
    RAM-sorted strings from the top half of the disk to the bottom half and
    then back again, until the final merge to the target.

    IBM mainframes have used methods like this since the mid 1960s.

    If, on the other hand, you're just doing the equivalent of reading a
    FileInputStream or writing a FileOutputStream, stick with the file
    system. Moving it off will accomplish almost nothing.

    --
    John W. Kennedy
    "Give up vows and dogmas, and fixed things, and you may grow like
    That. ...you may come to think a blow bad, because it hurts, and not
    because it humiliates. You may come to think murder wrong, because it
    is violent, and not because it is unjust."
    -- G. K. Chesterton. "The Ball and the Cross"
     
    John W Kennedy, May 14, 2008
    #10
  11. Srini

    Mark Space Guest

    Eric Sosman wrote:
    > Lew wrote:
    >> In this case, no. If you read up on Oracle's direct-disk management
    >> it is not dumb, nor is its file-system interaction. None of the big
    >> players are stupid with file-system access, be they Oracle, DB2,
    >> Postgres or whomever. [...]

    >
    > I see you've not encountered Lotus "databases" ...
    >


    Just out of morbid curiosity, does anyone know where MS SQL Server falls
    on this spectrum?
     
    Mark Space, May 15, 2008
    #11
  12. Mark Space wrote:
    > Eric Sosman wrote:
    >> Lew wrote:
    >>> In this case, no. If you read up on Oracle's direct-disk management
    >>> it is not dumb, nor is its file-system interaction. None of the big
    >>> players are stupid with file-system access, be they Oracle, DB2,
    >>> Postgres or whomever. [...]

    >>
    >> I see you've not encountered Lotus "databases" ...

    >
    > Just out of morbid curiosity, does anyone know where MS SQL Server falls
    > on this spectrum?


    Oracle DB, IBM DB2, MS SQLServer, Sybase ASE all support databases
    both in file system and on raw partitions.

    I don't thing raw partitions are much used for MS SQLServer. It is
    in many ways a *nix thingy.

    Arne
     
    Arne Vajhøj, Jun 1, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Karl
    Replies:
    1
    Views:
    499
    Craig Deelsnyder
    Sep 6, 2004
  2. Curt tabor

    Application[] storage issue

    Curt tabor, Nov 4, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    411
    Curt tabor
    Nov 4, 2005
  3. Engineer
    Replies:
    6
    Views:
    678
    Jeremy Bowers
    May 1, 2005
  4. Replies:
    0
    Views:
    477
  5. sarathy
    Replies:
    2
    Views:
    696
    sarathy
    Jul 17, 2006
Loading...

Share This Page