should I use a database or a flat file?

James Dinkel · Apr 1, 2008

I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

Basically, I will have a list of names each belonging to one of 5
categories. Sort of like this:

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name7
-name8
-name9
-etc...

There will be hundreds of names, evenly divided between the categories.
But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.

Robert Klemme · Apr 1, 2008

2008/4/1 said:
I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

Basically, I will have a list of names each belonging to one of 5
categories. Sort of like this:

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name7
-name8
-name9
-etc...

There will be hundreds of names, evenly divided between the categories.

That's not much. I'd probably use XML - but that also depends on what
generates the data and what needs to be able to read it. You can
efficiently generate it and read it (using a stream parser for
example, but that seems unnecessary for hundreds of names only).

But ultimately it depends on what you want to do with the data. In
some cases a DB might be a better choice. Also, if your volume is
going to increase dramatically etc.

But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.

YAML is another nice alternative because it is human readable. And
you can use Marshal if producer and consumer of the data are Ruby
programs.

Kind regards

robert

Lionel Bouton · Apr 1, 2008

James said:
I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

Basically, I will have a list of names each belonging to one of 5
categories. Sort of like this:

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name7
-name8
-name9
-etc...

There will be hundreds of names, evenly divided between the categories.
But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.

IMHO Databases are best when you have concurrent access to data being
modified regularly and want to enforce constraints during concurrent
write accesses.

In your case, the data is mostly static and constraints are easily
handled outside the storage layer (you overwrite all data with another
consistent version in one pass). I'd advise to use the simplest storage
method, which probably is a YAML dump of an object holding all this data.

Marshall.dump/load is an option too. It may be faster than YAML if this
matters to you (I've not benchmarked it, so you better do it if you need
fast read/write). It's not human-readable, so it can be a drawback when
debugging.

That was the code/integration complexity side of your problem.

For the performance side of the problem :

If you dump your data in a temporary file and then rename it to
overwrite the final destination, you can use a neat hack for long
running processes needing fresh data: you can design a little cache that
checks the mtime of the backing store (the final destination) on read
accesses and reload it when it changes.
mtime checks are cheap and simple to code and if the need arise for
really high throughput you can minimize them by coding a TTL logic.

Lionel

James Dinkel · Apr 1, 2008

But ultimately it depends on what you want to do with the data.

yeah, it's kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

But I didn't know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category, so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

Thanks for the info. I'm really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

James.

Zundra Daniel · Apr 1, 2008

[Note: parts of this message were removed to make it a legal post.]

Seems like the type of problem yaml thats perfect for yaml

Todd Benson · Apr 1, 2008

yeah, it's kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

But I didn't know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category, so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

Thanks for the info. I'm really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

James.

I'm going to slightly disagree with Lionel -- and also Robert -- on
this one. First of all, a database is not necessarily just for
concurrency. It's for data integrity and allows the ability to build
reports on that data that you can trust because of the strict nature
of the underlying data store (I'm talking about RDBMS, but I've kept
my eyes open about OO databases as well; stay away from Pick,
though!!).

Here's the problem with relational databases, though (RDBMSs): it's
hard to model a hierarchy (which you can pull off somewhat clumsily
with XML).

If you are not going to do serious queries and inserts on the db, and
your data isn't complex, then a flat file approach might work. It
works, after all, for software builds. I strongly recommend against
it in higher languages, though, even for small apps. And, no, I am
not a database vendor.

I always tell people they should learn SQL, but nowadays I'm getting a
cold shoulder, especially with OO people

The other important thing that I've noticed about data and storage is:
what do you want to do with it and how often? Store it, query it (and
how), add to it, move it around, archive it, etc. These are important
factors to consider.

Todd

Kyle Schmitt · Apr 1, 2008

Oh wait, Lionel already suggested that.

Kyle Schmitt · Apr 1, 2008

Don't forget: you could put the data into a hash, and marshall it to
disc. Not a DB, but better than a flat file!

Lionel Bouton · Apr 1, 2008

Todd said:
I'm going to slightly disagree with Lionel -- and also Robert -- on
this one. First of all, a database is not necessarily just for
concurrency. It's for data integrity

Yes I agree (as explained below concurrency is what I consider the main
problem to solve to enforce data integrity). That said if you write your
data in one pass as the OP, you don't need data integrity in the storage
layer... rename is atomic : you either renamed the temp file to its
final position before a crash or not.

The problem are partial updates where you need to maintain consistancy.
And on the top of my head the only problems with partial updates are :
- concurrent accesses (most common, counting both concurrent read and
write accesses),
- crashes (fortunately less common and can even be adressed by backups
in many cases).

These are why I disagree with people wanting to push all the consistency
logic into the applicaltion layer on database-backed applications with
concurrent access (like often advocated for Rails). It's simply not
doable without recoding the whole concurrent access manager and
log-based/MVCC/... crash resistance of the database in the application
layer (good luck with that).

Lionel.

Todd Benson · Apr 1, 2008

Yes I agree (as explained below concurrency is what I consider the main
problem to solve to enforce data integrity). That said if you write your
data in one pass as the OP, you don't need data integrity in the storage
layer... rename is atomic : you either renamed the temp file to its
final position before a crash or not.

The problem are partial updates where you need to maintain consistancy.
And on the top of my head the only problems with partial updates are :
- concurrent accesses (most common, counting both concurrent read and
write accesses),
- crashes (fortunately less common and can even be adressed by backups
in many cases).

These are why I disagree with people wanting to push all the consistency
logic into the applicaltion layer on database-backed applications with
concurrent access (like often advocated for Rails). It's simply not
doable without recoding the whole concurrent access manager and
log-based/MVCC/... crash resistance of the database in the application
layer (good luck with that).

Lionel.

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP's model, for example...

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name1
-name2
-name3
etc...

Note the same category names, but in different categories.

Now, surely, you can say, "Well, the application logic will take care
of that ambiguity." But I say we should continue to separate
application logic from data logic.

I'm no CS guy, so I don't know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I've found it's
usefulness generally not that beneficial.

Todd

Todd Benson · Apr 1, 2008

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP's model, for example...

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name1
-name2
-name3
etc...

Note the same category names, but in different categories.

Now, surely, you can say, "Well, the application logic will take care
of that ambiguity." But I say we should continue to separate
application logic from data logic.

I'm no CS guy, so I don't know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I've found it's
usefulness generally not that beneficial.

Todd

Sorry Lionel; missed the OP's "But each name will go in only one
category". I do still think it wouldn't be that bad to use a DB.

Todd

Joel VanderWerf · Apr 1, 2008

James said:
I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

One option is FSDB[1] (file-system database), with one file per
"category", and each file stored as YAML. This scales as well as your
file system scales, is always human-readable, and should be fairly
efficient. (It's thread and process safe too, not that it matters for
your app.)

For example:

require 'fsdb'
require 'yaml'

db = FSDB:

atabase.new "~/tmp/my_data"
db.formats = [FSDB::YAML_FORMAT] + db.formats

3.times do |i|
db["Cat#{i}.yml"] = %w{
name1
name2
name3
}
end

path = "Cat1.yml"

puts "Here's the object:"
puts "=================="
p db[path]
puts "=================="
puts

puts "Here's the file:"
puts "=================="
puts File.read(File.join(db.dir, path))
puts "=================="
puts

and this is the output:

Here's the object:
==================
["name1", "name2", "name3"]
==================

Here's the file:
==================
---
- name1
- name2
- name3
==================

The dir structure looks like this:

[~/tmp] ls my_data
Cat0.yml Cat1.yml Cat2.yml

[1] http://redshift.sourceforge.net/fsdb

Shawn Anderson · Apr 1, 2008

[Note: parts of this message were removed to make it a legal post.]

I was thinking that maybe the OP could use something like KirbyBase.
I've used it before, and it allows the code to stay very portable because
kirbybase is just ruby code.

You can locate it here:
http://rubyforge.org/projects/kirbybase

/Shawn

James said:
James said:

I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

Click to expand...

One option is FSDB[1] (file-system database), with one file per
"category", and each file stored as YAML. This scales as well as your
file system scales, is always human-readable, and should be fairly
efficient. (It's thread and process safe too, not that it matters for
your app.)

For example:

require 'fsdb'
require 'yaml'

db = FSDB:atabase.new "~/tmp/my_data"
db.formats = [FSDB::YAML_FORMAT] + db.formats

3.times do |i|
db["Cat#{i}.yml"] = %w{
name1
name2
name3
}
end

path = "Cat1.yml"

puts "Here's the object:"
puts "=================="
p db[path]
puts "=================="
puts

puts "Here's the file:"
puts "=================="
puts File.read(File.join(db.dir, path))
puts "=================="
puts

and this is the output:

Here's the object:
==================
["name1", "name2", "name3"]
==================

Here's the file:
==================
---
- name1
- name2
- name3
==================

The dir structure looks like this:

[~/tmp] ls my_data
Cat0.yml Cat1.yml Cat2.yml

[1] http://redshift.sourceforge.net/fsdb

Robert Klemme · Apr 1, 2008

Exactly. With regard to all that we've learned about the issue at
hand a DB seems overkill here. KISS

Totally agree - but this is another story.

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP's model, for example...

Now, surely, you can say, "Well, the application logic will take care
of that ambiguity." But I say we should continue to separate
application logic from data logic.

But the consistency needs to be /somewhere/ and if no database is
needed then enforcing it in app logic is certainly ok.

I'm no CS guy, so I don't know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I've found it's
usefulness generally not that beneficial.

What is "this" in this paragraph?

Generally I do not think we're far away - if at all. Given the scale
of the problem and the apparent lack of future extension with regard
to size, complexity and concurrency a simple solution suffices IMHO.
Of course it's good to know the options - that's why we discuss here.

Kind regards

robert

A modified version of the script since the other posting did not seem
to make it into usenet. This one has consistency check as originally
required:

#!/bin/env ruby

require 'set'
require 'yaml'

class CatNames
def self.load(file_name)
File.open(file_name) {|io| YAML.load(io)}
end

def save(file_name)
File.open(file_name, "w") {|io| YAML.dump(self, io)}
end

def initialize
@cat = {}
@all = {}
end

def add(cat, name)
raise "Consistency Error" if @all[name]
s = (@cat[cat] ||= Set.new)
s << name
@all[name] = s
end

def remove(cat, name)
c = @cat[cat] and c.delete name
@all.delete name
end

def clear
@cat.clear
@all.clear
end

def size
@cat.inject(0) {|sum,(name,set)| sum + set.size}
end
end

t = Time.now

d = CatNames.new

1000.times do |i|
d.add("cat#{i % 10}", "name#{i}")
end

puts d.size

tt = Time.now
printf "%6.3f %s\n", tt-t, "create"
t = tt

d.save "test.yaml"

tt = Time.now
printf "%6.3f %s\n", tt-t, "write"
t = tt

d2 = CatNames.load "test.yaml"

tt = Time.now
printf "%6.3f %s\n", tt-t, "load"
t = tt

begin
d2.add "foo", "name0"
rescue Exception => e
puts e
end

James Dinkel · Apr 1, 2008

Wow, this has been a very good discussion. Feel free to keep
discussing, but, being as I'm the OP, I just thought I would let you
know that I think I will go with YAML for this case.

Todd Benson · Apr 1, 2008

Exactly. With regard to all that we've learned about the issue at
hand a DB seems overkill here. KISS

I admit, I tend to like using a sledgehammer to turn a machine screw,
but in that respect, I'm usually thinking of scalability and data
integrity.

When I said "there's a time and place for this", "this" was referring
to the various forms of flat file storage.

With this particular situation, I would probably go with YAML, and
migrate to a database if need be (which shouldn't be that hard,
depending on how deeply nested the data is).

Todd

Phillip Gawlowski · Apr 1, 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Benson wrote:
| On Tue, Apr 1, 2008 at 2:35 PM, Robert Klemme
|
|> Exactly. With regard to all that we've learned about the issue at
|> hand a DB seems overkill here. KISS
|
| I admit, I tend to like using a sledgehammer to turn a machine screw,
| but in that respect, I'm usually thinking of scalability and data
| integrity.

Why use a sledge hammer, when you can use the surgeon's knife SQLite?

That's the RDBM I'd use, if I would be using a SQL DB in this situation.

It doesn't always have to be Postgre or Oracle.

- -- Phillip Gawlowski
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkfysB4ACgkQbtAgaoJTgL+ZAgCfcoatmISXF6htOk2AeiaQ71EN
ZkYAnAtfV7gsp1kYgNUFhMdjjd4ZQ4p9
=e857
-----END PGP SIGNATURE-----

Todd Benson · Apr 1, 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Benson wrote:
| On Tue, Apr 1, 2008 at 2:35 PM, Robert Klemme
|
|> Exactly. With regard to all that we've learned about the issue at
|> hand a DB seems overkill here. KISS
|
| I admit, I tend to like using a sledgehammer to turn a machine screw,
| but in that respect, I'm usually thinking of scalability and data
| integrity.

Why use a sledge hammer, when you can use the surgeon's knife SQLite?

Well, "sledgehammer" was for humor. A better analogy of my approach
would be this darn overly large swiss army knife that doesn't always
fit comfortably in my pocket, but I wear it any way just in case.

My only problem with SQLite is lack of foreign key constraints.

cheers,
Todd

Phillip Gawlowski · Apr 1, 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Benson wrote:

|
| Well, "sledgehammer" was for humor.

Duly recognized, just ignored to stretch the metaphor to its breaking
point.

I didn't mean to imply that the analogy was devoid of humor.

| A better analogy of my approach would be this darn overly large
| swiss army knife that doesn't always fit comfortably in my pocket,
| but I wear it any way just in case.

Well, a Leatherman would be my cultural weapon of choice.</discworld
reference>

| My only problem with SQLite is lack of foreign key constraints.

Which, last I heard, is in the works.

However, the zero-config approach suits very well for rapid development,
and, at least, prototyping.

And with ORM tools like Sequel, or Og, details like the the specific DB
become less of a concern, too.

- -- Phillip Gawlowski
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkfytyoACgkQbtAgaoJTgL/sewCeN2GiZda9A0nYeyOmiq7qwrIG
qY4An03u5kMsJjz/kwroKuLL+GzszWl7
=vVBj
-----END PGP SIGNATURE-----

Todd Benson · Apr 1, 2008

On Tue, Apr 1, 2008 at 5:29 PM, Phillip Gawlowski

Well, a Leatherman would be my cultural weapon of choice.</discworld
reference>

Right there with you, in real life. I like the swiss army reference
because those things can be so huge! Leatherman, absolutely. Slim
and useful

Todd

Parse specific text in email body to CSV file	2	Mar 8, 2008
creating forms dynamically from a flat file	1	Jun 5, 2008
Email from a flat file database	2	Jan 8, 2006
How to read sequentially from a random point in a large Xml File.(200 - 2000 MB)	1	Apr 3, 2008
Article : Resource File Generator (Resgen.exe .Net FrameWork Tools Series)	1	Nov 2, 2004
Can I put this in a .h file?	4	Jan 25, 2006
Python plain-text database or library that supports joins?	5	Jun 22, 2007
[announce] 'Flat File to XML Data Conversion' Tops Stylus Studio Box Office	0	Jul 27, 2005

should I use a database or a flat file?

James Dinkel

Robert Klemme

Lionel Bouton

James Dinkel

Zundra Daniel

Todd Benson

Kyle Schmitt

Kyle Schmitt

Lionel Bouton

Todd Benson

Todd Benson

Joel VanderWerf

Shawn Anderson

Robert Klemme

James Dinkel

Todd Benson

Phillip Gawlowski

Todd Benson

Phillip Gawlowski

Todd Benson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads