Data Storage Issue (Basic Issue)

Srini · May 9, 2008

Which is more efficient? Why??
Storing or Reading a file in database or on to disk in a web server
environment? considering all things like network, connections, memory
are ideal.

My knowledge says that using database is more efficient but few my
pierce argue that storing in disk is more efficient when concurrent
people accessing it...

Tom Anderson · May 9, 2008

Which is more efficient? Why?? Storing or Reading a file in database or
on to disk in a web server environment? considering all things like
network, connections, memory are ideal.

My knowledge says that using database is more efficient but few my
pierce argue that storing in disk is more efficient when concurrent
people accessing it...

Databases typically store their data in a file [1]. That means that a
file-based solution can always be at least as fast as using a database,
because it can just do what the database does.

The problem is that to make a file-based solution that's as fast as a good
database and also provides things like transactionality, you may have to
write code that's as complex as a database. Which is not good.

If you're mostly reading your data, so you don't have to worry about
concurrency and transactionality, and you have a straightforward
organisation (like having fixed-size records which you can refer to by
index in a sequence), then you can write a simple file-based
implementation that should be faster than a database, because it avoids
the overhead and complexity.

There's nothing in the java libraries, that i'm aware of, for doing this
kind of non-database structured file access. There are things for some
texual formats, like XML and properties files (remember those?!), but
nothing like DBM or COBOL's record-oriented files. There are third-party
libraries, though - see Berkeley DB, Java Edition:

http://www.oracle.com/technology/products/berkeley-db/je/index.html

and JDBM:

http://jdbm.sourceforge.net/

It's also not that hard to write your own fixed-size record manager, and
not that hard to layer variable-sized records on top of such a thing.

There's also an excellent trick for using the unix filesystem as a
database by storing data in symbolic links: the path of a symlink is
actually an arbitrary text string, so you can store information, rather
than an actual path, in it. Gives you hierarchically organised,
string-keyed records of up to a kilobyte (YMMV) without any actual file
IO!

A performance question for the wise: i hacked up a little fixed-size
record manager, and wrote two backends, one using RandomAccessFile, and
one using a NIO MappedByteBuffer. For both, i provided a way to flush to
disk after each write - with RandomAccessFile, via getFD().sync(), and
with MappedByteBuffer with force(). Timings to do a batch of reads and
writes (100 000 operations, 75% reads, on 10 000 records of 256 bytes
each; a different random pattern each time, on a machine doing nothing but
this and playing MP3s):

Implementation Flush? Time (ms)

RandomAccessFile no 733
RandomAccessFile yes 20659
MappedByteBuffer no 63
MappedByteBuffer yes 33087

The mapped file is an order of magnitude faster without flushing, but 50%
slower with. Any idea why?

tom

[1] Okay, so seriously heavyweight ones use disk extents/partitions and
bypass the filesystem; how much of a difference does that make?

Tom Anderson · May 10, 2008

Tom said:
Tom said:

Databases typically store their data in a file [1]. ... [1] Okay, so
seriously heavyweight ones use disk extents/partitions and bypass the
filesystem; how much of a difference does that make?

Click to expand...

Anecdotally, on a large-scale test, above five-to-one.

Yikes! That's kind of damning for filesystems. I know they do a lot of
stuff that a raw disk doesn't but still. Wow.

tom

Patricia Shanahan · May 10, 2008

Lew said:
OK, that's fine. It was something that convinced me that it worked in
that case because I could interact with people who conducted the test.
You choose to reject the anecdote because I have no credibility with you.

I'm sure the result can be reproduced by an independent experimenter, if
they were authorized to use the various proprietary parts.

I would not be surprised if more objective results are available to you
with a quick Google. Given the amount of time Oracle and others have
offered direct disk-management modules, I would be surprised if such
results were not available to you. Do let us know what you find, won't
you?

The Transaction Processing Performance Council, http://www.tpc.org,
creates a series of benchmarks that are strongly dependent on database
performance, and archives results. It should be possible to work out how
the disks are managed, at least from the full disclosure reports.

If anyone is interested in researching this, TPC may be a starting point.

Patricia

Roedy Green · May 10, 2008

Storing or Reading a file in database or on to disk in a web server
environment? considering all things like network, connections, memory
are ideal.

It doesn't matter since the reading and writing are not
interchangeable.

Reading is faster since it can be cached with no physical disk i/o.

In an SQL database with writing you have to do actual physical i/o
PLUS disk i/o to handle transaction logging and rollback

Tom Anderson · May 11, 2008

No, no, no, Lew: Your credibility is just fine (or my credulity; take
it either way). All I'm saying is that the information you've presented
is insufficient to support the inference Tom Anderson drew from it
("kind of damning for filesystems"), and that he should consider being a
little less hasty in drawing sweeping conclusions from sketchy and
incomplete reports.

I'm not saying filesystems are worthless. Just that when it comes to
shunting data on and off a disk, the fact that a different method can work
five times faster than a filesystem means that if you're in the business
of shunting data on and off a disk as fast as possible, a filesystem is
probably not a great idea.

Of course, if there was some particular reason why there was such a big
difference in this case that doesn't apply in all or most cases, then fair
enough, i'm over-generalising. I can't think of what that might be,
though.

If someone told you there was a "five-to-one difference" between
ArrayList and plain array, would you conclude that this was "kind of
damning" for Collections? Of course not. I'm urging Tom not to make
the analogous error.

If someone (credible) told me that there was a five-to-one difference in
times for an ArrayList.get and an array[], then, if we were in a context
where performance was important, i certainly would take that as kind of
damning for collections. I think that's perfectly reasonable.

tom

Arne VajhÃ¸j · May 11, 2008

Lew said:
From what I've been reading and hearing from DBAs (disclaimer: I am but
a lowly programmer myself, not a DBA), journaling file syatems are not
the key to safety in big-iron DMBSes like PostgreSQL and Oracle.

I don't think PostgreSQL is big-iron.

Oracle, DB2 and to some extent Sybase and SQLServer are
rather dominant in that area.

Arne

Arne VajhÃ¸j · May 11, 2008

Lew said:
Tom said:

Databases typically store their data in a file [1]. ...
[1] Okay, so seriously heavyweight ones use disk extents/partitions
and bypass the filesystem; how much of a difference does that make?

Click to expand...

Anecdotally, on a large-scale test, above five-to-one.

I doubt that.

There are obviously overhead in the file system, but 400%
overhead in writing to a pre allocated file isntead of a raw
partition does not make any sense.

(if the file needs to be extended during write, then it sounds
very plausible, but that is not a fair comparison)

My impression is that raw partitions was not preferred for
performance but to ensure that there were no file systems
cached not being flushed to the plates.

Arne

Arne Vajhøj · May 11, 2008

Srini said:
Which is more efficient? Why??
Storing or Reading a file in database or on to disk in a web server
environment? considering all things like network, connections, memory
are ideal.

My knowledge says that using database is more efficient but few my
pierce argue that storing in disk is more efficient when concurrent
people accessing it...

There is a myth among many web developers that storing files
in database kills performance.

And my guess is that ASP accessing an Access 95 database on
a Pentium running NT 4 possible did.

But things are different today. With some good hardware and
a good database, then you should be able to get good
performance out if storing files in the database.

And it is so much easier to ensure data integrity and
administer the database solution.

I once did some experiments on the topic. The results are
attached below.

They are ASP.NET against MySQL, but technologies should
not matter much.

Arne

==================================================

File (1 threads): 1,9 get per second
File (10 threads): 2,2 get per second
File with web app cache (1 threads): 6,6 get per second
File with web app cache (10 threads): 16,1 get per second
File directly by web server (1 threads): 17,6 get per second
File directly by web server (10 threads): 20,1 get per second
Database (1 threads): 10,7 get per second
Database (10 threads): 11,8 get per second
Database with web app cache (1 threads): 15,4 get per second
Database with web app cache (10 threads): 19,6 get per second
File (1 threads): 1,9 get per second
File (10 threads): 2,2 get per second
File with web app cache (1 threads): 16,7 get per second
File with web app cache (10 threads): 19,4 get per second
File directly by web server (1 threads): 17,1 get per second
File directly by web server (10 threads): 20 get per second
Database (1 threads): 10,7 get per second
Database (10 threads): 11,9 get per second
Database with web app cache (1 threads): 17,3 get per second
Database with web app cache (10 threads): 19,6 get per second
File (1 threads): 1,9 get per second
File (10 threads): 2,1 get per second
File with web app cache (1 threads): 16,8 get per second
File with web app cache (10 threads): 17,8 get per second
File directly by web server (1 threads): 17,6 get per second
File directly by web server (10 threads): 20,1 get per second
Database (1 threads): 10,7 get per second
Database (10 threads): 11,6 get per second
Database with web app cache (1 threads): 17,1 get per second
Database with web app cache (10 threads): 19,5 get per second

John W Kennedy · May 14, 2008

Lew said:
Tom said:

If someone (credible) told me that there was a five-to-one difference
in times for an ArrayList.get and an array[], then, if we were in a
context where performance was important, i certainly would take that
as kind of damning for collections. I think that's perfectly reasonable.

Click to expand...

Eric is correct that you should not generalize from a single data point,
especially when full details are not available for that one case. I
presented the anecdote not to prove anything but to provide evidence, at
least that the matter is worth pursuing more objectively.

My comment that I lack credibility was not about personality but about
evidence and what makes it valid. I provided no hard data, nor enough
information about the anecdote to make it something on which you can
base conclusions. By hard standards, I indeed do lack credibility to
assert that direct-to-disk is five times faster. OTOH, it is enough to
share one experience that indicates that the claim is potentially
supportable, just maybe.

It depends on how much effort you're willing to put in. A major reason
for databases running faster when off the file system, for example, is
doing things like putting SQL tables into a continuous range of
cylinders, and putting the lowest level of the primary index on track
zero of those same cylinders. Then, to find a given record, you go
through the high-level parts of the index tree, which are probably
cached, anyway; move the access arm to the correct cylinder; read the
low-level index record from track zero (perhaps it is replicated as many
times as it will fit on the track, so that it comes up as fast as
possible); and then read the data without moving the access arm again.
That gives you indexed access only imperceptibly slower than hashed.

Much-larger-than-RAM sorts can use similar tricks, such as merging
RAM-sorted strings from the top half of the disk to the bottom half and
then back again, until the final merge to the target.

IBM mainframes have used methods like this since the mid 1960s.

If, on the other hand, you're just doing the equivalent of reading a
FileInputStream or writing a FileOutputStream, stick with the file
system. Moving it off will accomplish almost nothing.

--
John W. Kennedy
"Give up vows and dogmas, and fixed things, and you may grow like
That. ...you may come to think a blow bad, because it hurts, and not
because it humiliates. You may come to think murder wrong, because it
is violent, and not because it is unjust."
-- G. K. Chesterton. "The Ball and the Cross"

Mark Space · May 14, 2008

Eric said:
Lew said:

In this case, no. If you read up on Oracle's direct-disk management
it is not dumb, nor is its file-system interaction. None of the big
players are stupid with file-system access, be they Oracle, DB2,
Postgres or whomever. [...]

Click to expand...

I see you've not encountered Lotus "databases" ...

Just out of morbid curiosity, does anyone know where MS SQL Server falls
on this spectrum?

Arne VajhÃ¸j · May 31, 2008

Mark said:
Eric said:

Lew said:

In this case, no. If you read up on Oracle's direct-disk management
it is not dumb, nor is its file-system interaction. None of the big
players are stupid with file-system access, be they Oracle, DB2,
Postgres or whomever. [...]

Click to expand...

I see you've not encountered Lotus "databases" ...

Click to expand...

Just out of morbid curiosity, does anyone know where MS SQL Server falls
on this spectrum?

Oracle DB, IBM DB2, MS SQLServer, Sybase ASE all support databases
both in file system and on raw partitions.

I don't thing raw partitions are much used for MS SQLServer. It is
in many ways a *nix thingy.

Arne

How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
Raspberry Pi Open Source PLC Communication Wonder LECPython, and Example of Communication with Omron PLC	0	Oct 9, 2024
Expert Guide to Convert MBOX to PST File Manually in 2025	7	Dec 1, 2024
Data Structure Issue	5	Jul 4, 2007
key/value store optimized for disk storage	23	May 2, 2012
Solutions for data storage?	5	Jan 19, 2005
Storage of symmetric boolean matrix	9	Jun 27, 2009
I'm tempted to quit out of frustration	1	Aug 13, 2023

Data Storage Issue (Basic Issue)

Srini

Tom Anderson

Tom Anderson

Patricia Shanahan

Roedy Green

Tom Anderson

Arne VajhÃ¸j

Arne VajhÃ¸j

Arne Vajhøj

John W Kennedy

Mark Space

Arne VajhÃ¸j

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads