When to use a db?

M

Moya Pilot

Dumb question but I am reading in a constant stream of lines via
fastercsv and pushing it into an array. Each line is either a new
object or will update some information in an existing object. 98% of
all lines will be an update to an object. I will also be doing a lot of
lookups on the objects to see if X exists somewhere. I don't have too
much experience with this stuff but I have read and been told by a
couple people that if I am just iterating repeatedly, updating and doing
a bunch of lookups on objects then perhaps a db would be more
efficient/faster.

I know there isn't a sure fire threshold that states when you are
dealing with X amount of data you should use a database but I always
assumed that if I am using an array it would be faster since it is in
memory. I am not opposed to using something minimal like SQLite I just
wasn't sure if things would be easier/faster/more efficient to use a db.
Object Persistence is a plus but not a necessity for me, if my script
dies and the array is lost it isn't a huge deal.

Will using a SQLite DB be faster or easier or more efficient for me than
just doing all my iterations thru an array?

Thanks
 
B

Brian Candler

Moya said:
Will using a SQLite DB be faster or easier or more efficient for me than
just doing all my iterations thru an array?

Unless your data structure won't fit in RAM, then you'll be better off
using a Hash (so you can locate the object you're interested straight
away, rather than iterating through an Array)
 
K

Kevin Hopkins

[Note: parts of this message were removed to make it a legal post.]

I recently had a project here at work where I had to take close to 120k
products and format them from a csv into a set of necessary attribute
changes and then wrote them to a .tab file. I had experimented with a few
different methodologies for manipulating the data and iterating over with it
and by far the fastest way was just to create a db, pump the data into it
via the csv, and read from the model and write to a csv.

I was working with segments initially that were only about 1500 records in
size and they performed faster in ram than through a model. It seems once I
broke the 10k mark of records, performance increased about 50% by switching
to the model base.

I think codewise, looking up whether "X exists" would probably be easier
using the model than trying to find it in the csv as well as faster with the
db indexes but it all depends on how many records you have.

If you want to reply with how many records/attributes you have or with just
maybe the size on the filesystem that the csv is, I can give a better
appraisal but it really is up to you. If you're already doing it by storing
it in RAM and its working fine, done change it. But if you want to squeek
some more time out of the project, the database may be worth a shot.

Hope this helps, I know its a long winded answer...

--Kevin
 
S

Seebs

My answer used to be "oh, use flat files and the like, they're simpler".

Then I actually used sqlite for some stuff.

Now my answer is pretty much "if you expect to have to ever look up records
by more than one attribute, start in SQL." It's worked really well for me.

-s
 
K

Kyle Schmitt

Unless your data structure won't fit in RAM, then you'll be better off
using a Hash (so you can locate the object you're interested straight
away, rather than iterating through an Array)

Actually, if your data won't fit in RAM, you still probably want to do
it in a hash. Just use BDB or GBDM. Those are more or less just fast
on-disc hashes. Before everything was SQL (back when dinosaurs roamed
the earth), that's what folks called a database.

--Kyle
 
S

Seebs

Actually, if your data won't fit in RAM, you still probably want to do
it in a hash. Just use BDB or GBDM. Those are more or less just fast
on-disc hashes. Before everything was SQL (back when dinosaurs roamed
the earth), that's what folks called a database.

I'm a big fan of those tools, but what I've found is that the moment I
first want to iterate through the table looking for something by a value
other than its key, I have lost -- I would have been better off with sqlite.

-s
 
M

Moya Pilot

Thank you all for your comments, I really appreciate it. I have to
admit I am new to Ruby and fairly new to programming in any language and
am still confused about a lot of things.

To explain my needs a little better, I am reading in thousands of lines
in a day via a tcp connection. Most lines will be one attribute updates
to an object but the lines I am reading are like so
NAME, jim joe, 123-456-7890, 123 elm st, zip
PREFS, firefox, flash 10, java, windows, jim joe
FAVORITES, facebook, jim joe, ebay, google

In short, I am reading them in via fastercsv and working on using an if
statement to do something to see if the username exists in the array
already, if so update the object found with the needed data. If not add
the new data to as a new object in the array.

User = Struct.new:)a, b:, c:, d:, e:, f:, g:, h:, i:, j:, k:, l:, m:,
n:, o:, p:, q:, r:)
array=[]
CSV.parse lines do |row|
if ( lines =~ /^NAME:/)
array.find{ |foobar| if foobar.b == row[1]
puts "User already added"

elsif ( lines =~ /^PREFS:/)
array.find{ |foobar| if foobar.b == row[5]
puts "This is where I need so other help, I need to add this
rows contents[0..4] to the object matching in the array, so I think
something like foobar[6, 7, 8, 9] = row[1, 2, 3, 4]"

elsif ( lines =~ /^FAVORITES:/)
array.find{ |foobar| if foobar.b == row[2]
puts "This is where I need so other help, I need to add row[1,
3, 4] contents to the object matching in the array, so I think something
like foobar[12, 14, 15] = row[1, 3, 4]"

else
array << User.new(*row[0..5])
end }

I have been re-reading some more on hashes but I am still confused on
the differences between hashes and arrays. If it is simply just giving
a name to something like array[0] I am not sure how that speeds things
up. I am also still confused on how to efficiently add the cvs parsed
rows into an object in the array. Is array.find my best bet if using an
array? Would hashes work or BDB, GDBM or SQLite be better suited
towards this?

Thanks again I appreciate all your help.
 
D

David Masover

To explain my needs a little better, I am reading in thousands of lines
in a day via a tcp connection. Most lines will be one attribute updates
to an object but the lines I am reading are like so
NAME, jim joe, 123-456-7890, 123 elm st, zip
PREFS, firefox, flash 10, java, windows, jim joe
FAVORITES, facebook, jim joe, ebay, google

In other words, a naive implementation, you'd be reading in the entire
document, parsing it, making the change, and writing it all back out. A
slightly more optimized form would be buffering a bit -- read the entire
document in, cache something useful in RAM, then write it back out once a
minute or so, at the risk of losing a minute of updates if you crash.
In short, I am reading them in via fastercsv and working on using an if
statement to do something to see if the username exists in the array
already, if so update the object found with the needed data. If not add
the new data to as a new object in the array.

That is quite literally the textbook example of when a hash or a set makes
sense. Let's take a simplified example, maybe an address book. Let's pretend I
magically have an array like this:

addresses = [
['Joe', '(e-mail address removed)'],
['John Smith', '(e-mail address removed)'],
['Alice', '(e-mail address removed)'],
['Bob', '(e-mail address removed)'],
['Eve', '(e-mail address removed)']
]

That's reasonably small, so it's reasonably quick to do. Basically, you could
do something like this:

def put(username, email)
found = false
addresses.each do |record|
if record[0] == username
record[1] = email
found = false
break
end
end
addresses << [username, email] unless found
end

Basically, you want to update the user's email address in the database, or
create a new record if the user isn't there already. Your code is probably a
bit more elegant, but Array's "find" method is going to be doing the same kind
of thing under the hood anyway.

Now think about this. How long is that going to take? It's proportionate to
the number of users in the database, and how long their names are. The more
users you have, the longer it will take.

Think about it -- if you're adding a new record, that means you just read
through the ENTIRE DATABASE just to find out they're not there.

put('charlie', '(e-mail address removed)') # needs to loop through 5 items
put('superman', '(e-mail address removed)') # needs to loop through 6 items
...

The more data you have, the slower it gets, and it gets slower
proportionately. That means that adding a new record to an array of a million
records will take at least a thousand times as long as it takes to add a new
record to an array of a thousand records -- and maybe longer, because of all
those string comparisons.

Say you want to look something up. It's no better:

def get(username)
addresses.each do |record|
return record[1] if record[0] == username
end
nil
end

Again, if the user isn't there, you have to look at the ENTIRE DATABASE to
find out they aren't.

This is about the least efficient way you could possibly ever do this, so it's
important that you understand _why_ this is so inefficient.

Now, here's what it looks like as a hash:

addresses = {
'Joe' => '(e-mail address removed)',
'John Smith' => '(e-mail address removed)',
'Alice' => '(e-mail address removed)',
'Bob' => '(e-mail address removed)',
'Eve' => '(e-mail address removed)'
}

Now let's write those methods again:

def put(username, email)
addresses[username] = email
end

def get(username)
addresses[username]
end

It's not immediately obvious that these are faster. Maybe Ruby is doing some
magic behind the scenes that makes this just as slow?

But that's not the case. If you're curious, look up hash tables, but in the
mean time, take my word for it: This runs, on average, in _constant_ time.
That means it will take exactly the same amount of time to put an address into
a hash with a million entries as it would to put that address into a hash with
ten entries.

Plus, it's less code anyway, once you understand it.

So in short, it's much, MUCH more than
simply just giving
a name to something like array[0]

But even if that's the only reason, it's still worthwhile. Readable code is a
Good Thing. Using bare integers as array offsets is something I almost never
do -- it's not always bad, but it's almost always ugly.
Would hashes work or BDB, GDBM or SQLite be better suited
towards this?

I would tend towards SQLite with a decent ORM on top of it, maybe DataMapper.

But almost certainly yes.

Flat files like CSV are not at all good at doing random updates, which is what
you just described. CSV is great for looking at your data in a spreadsheet, if
your data reasonably fits in a spreadsheet. Other flat file formats can be
much simpler to work with, especially for a config file, or to make things
human readable.

But as a database, they're slow, and they get slower the more you put in them.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,149
Latest member
Vinay Kumar Nevatia0
Top