To explain my needs a little better, I am reading in thousands of lines
in a day via a tcp connection. Most lines will be one attribute updates
to an object but the lines I am reading are like so
NAME, jim joe, 123-456-7890, 123 elm st, zip
PREFS, firefox, flash 10, java, windows, jim joe
FAVORITES, facebook, jim joe, ebay, google
In other words, a naive implementation, you'd be reading in the entire
document, parsing it, making the change, and writing it all back out. A
slightly more optimized form would be buffering a bit -- read the entire
document in, cache something useful in RAM, then write it back out once a
minute or so, at the risk of losing a minute of updates if you crash.
In short, I am reading them in via fastercsv and working on using an if
statement to do something to see if the username exists in the array
already, if so update the object found with the needed data. If not add
the new data to as a new object in the array.
That is quite literally the textbook example of when a hash or a set makes
sense. Let's take a simplified example, maybe an address book. Let's pretend I
magically have an array like this:
addresses = [
['Joe', '(e-mail address removed)'],
['John Smith', '(e-mail address removed)'],
['Alice', '(e-mail address removed)'],
['Bob', '(e-mail address removed)'],
['Eve', '(e-mail address removed)']
]
That's reasonably small, so it's reasonably quick to do. Basically, you could
do something like this:
def put(username, email)
found = false
addresses.each do |record|
if record[0] == username
record[1] = email
found = false
break
end
end
addresses << [username, email] unless found
end
Basically, you want to update the user's email address in the database, or
create a new record if the user isn't there already. Your code is probably a
bit more elegant, but Array's "find" method is going to be doing the same kind
of thing under the hood anyway.
Now think about this. How long is that going to take? It's proportionate to
the number of users in the database, and how long their names are. The more
users you have, the longer it will take.
Think about it -- if you're adding a new record, that means you just read
through the ENTIRE DATABASE just to find out they're not there.
put('charlie', '(e-mail address removed)') # needs to loop through 5 items
put('superman', '(e-mail address removed)') # needs to loop through 6 items
...
The more data you have, the slower it gets, and it gets slower
proportionately. That means that adding a new record to an array of a million
records will take at least a thousand times as long as it takes to add a new
record to an array of a thousand records -- and maybe longer, because of all
those string comparisons.
Say you want to look something up. It's no better:
def get(username)
addresses.each do |record|
return record[1] if record[0] == username
end
nil
end
Again, if the user isn't there, you have to look at the ENTIRE DATABASE to
find out they aren't.
This is about the least efficient way you could possibly ever do this, so it's
important that you understand _why_ this is so inefficient.
Now, here's what it looks like as a hash:
addresses = {
'Joe' => '(e-mail address removed)',
'John Smith' => '(e-mail address removed)',
'Alice' => '(e-mail address removed)',
'Bob' => '(e-mail address removed)',
'Eve' => '(e-mail address removed)'
}
Now let's write those methods again:
def put(username, email)
addresses[username] = email
end
def get(username)
addresses[username]
end
It's not immediately obvious that these are faster. Maybe Ruby is doing some
magic behind the scenes that makes this just as slow?
But that's not the case. If you're curious, look up hash tables, but in the
mean time, take my word for it: This runs, on average, in _constant_ time.
That means it will take exactly the same amount of time to put an address into
a hash with a million entries as it would to put that address into a hash with
ten entries.
Plus, it's less code anyway, once you understand it.
So in short, it's much, MUCH more than
simply just giving
a name to something like array[0]
But even if that's the only reason, it's still worthwhile. Readable code is a
Good Thing. Using bare integers as array offsets is something I almost never
do -- it's not always bad, but it's almost always ugly.
Would hashes work or BDB, GDBM or SQLite be better suited
towards this?
I would tend towards SQLite with a decent ORM on top of it, maybe DataMapper.
But almost certainly yes.
Flat files like CSV are not at all good at doing random updates, which is what
you just described. CSV is great for looking at your data in a spreadsheet, if
your data reasonably fits in a spreadsheet. Other flat file formats can be
much simpler to work with, especially for a config file, or to make things
human readable.
But as a database, they're slow, and they get slower the more you put in them.