Using hash to see if object's attributes have changed

B

Bryan

When a user submits a request to update an object in my web app, I
make the changes in the DB, along w/ who last updated it and when. I
only want to update the updated/updatedBy columns in the DB if the
data has actually changed however.

I'm thinking of having the object in question be able to return a list
of its values that constitute its state. Then I can take a hash of
that list as the object exists in the database before the request, and
then on the object that the user has made changes to. If they are not
equal, the user has changed the object.

I imagine it working something like this:

def getValues(obj):
return [obj.a, obj.b, obj.c]

foo = Obj()
foo.a = foo.b = foo.c = 1
stateBefore = hashlib.sha1(str(getValues(foo)))
foo.b = 'changed'
stateNow = hashlib.sha1(str(getValues(foo)))
assert stateBefore != stateNow


I originally thought about running the hash on the __dict__ attribute,
but there may be things in there that don't actually constitute the
object's state as far as the database is concerned, so I thought it
better to have each object be responsible for returning a list of
values that constitute its state as far as the DB is concerned.

I would appreciate any insight into why this is a good/bad idea given
your past experiences.
 
R

Robert Kern

When a user submits a request to update an object in my web app, I
make the changes in the DB, along w/ who last updated it and when. I
only want to update the updated/updatedBy columns in the DB if the
data has actually changed however.

I'm thinking of having the object in question be able to return a list
of its values that constitute its state. Then I can take a hash of
that list as the object exists in the database before the request, and
then on the object that the user has made changes to. If they are not
equal, the user has changed the object.

It *might* work, but probably won't be robust especially as you are relying on
the string representation. You would be much better off using an ORM, which will
do all of this for you. This is exactly what they are for.

They usually determine whether attributes have change by instrumentation rather
than inspection. If you still don't want to use a full ORM, you should at least
emulate that strategy. Add descriptors to your classes for each attribute you
want to map to a column. On __set__, they should compare the new value to the
old, and set a "dirty" flag if the the attribute changes value. Or just
implement __setattr__ on your classes to do a similar check.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
B

Bryan

It *might* work, but probably won't be robust especially as you are relying on
the string representation. You would be much better off using an ORM, which will
do all of this for you. This is exactly what they are for.

They usually determine whether attributes have change by instrumentation rather
than inspection. If you still don't want to use a full ORM, you should at least
emulate that strategy. Add descriptors to your classes for each attribute you
want to map to a column. On __set__, they should compare the new value to the
old, and set a "dirty" flag if the the attribute changes value. Or just
implement __setattr__ on your classes to do a similar check.

I am using sqlalchemy, and it does a pretty good job of this. The
problem is that it considers an object changed whenever an attribute
gets set after the object is loaded, even if the attribute is being
set to the same value.

Another thing I could do is when applying the user's changes to the
object that was loaded from the db, only apply the change if the value
is actually different. In that case I could use the ORM's isDirty()
test that keeps track of what was set since being loaded. Then the
object doesn't know anything about its state or isDirty() tests, only
the controller would, which I like as it is less intrusive, but it
also is dependent on a ORM.

Instrumentation would require a bit more work. I would have to watch
for changes using set/get and __dict__ access. Also, I would have to
be able to say, "Has this object changed since time x". Imagine
loading an object from the db. Depending on the ORM implementation,
it may instantiate an object, then set all its attributes from the DB
row, which would trigger my isDirty instrumentation. So it would look
dirty even fresh out of the DB w/ no changes. So I would have to be
able to manually clear the isDirty flag. Not that any of this is
impossible, just more complex.
 
S

Steven D'Aprano

When a user submits a request to update an object in my web app, I make
the changes in the DB, along w/ who last updated it and when. I only
want to update the updated/updatedBy columns in the DB if the data has
actually changed however.

I'm thinking of having the object in question be able to return a list
of its values that constitute its state. Then I can take a hash of that
list as the object exists in the database before the request,

Storing the entire object instead of the hash is not likely to be *that*
much more expensive. We're probably talking about [handwaves] a few dozen
bytes versus a few more dozen bytes -- trivial in the large scheme of
things.

So give the object a __ne__ method, store a copy of the object, and do
this:

if current_object != existing_object:
update(...)


and then
on the object that the user has made changes to. If they are not equal,
the user has changed the object.

If all you care about is a flag that says whether the state has changed
or not, why don't you add a flag "changed" to the object and update it as
needed?

if current_object.changed:
update(...)
current_object.changed = False


That would require all attributes be turned into properties, but that
shouldn't be hard. Or use a datestamp instead of a flag:

if current_object.last_changed > database_last_changed:
update(...)

I imagine it working something like this:

def getValues(obj):
return [obj.a, obj.b, obj.c]

foo = Obj()
foo.a = foo.b = foo.c = 1
stateBefore = hashlib.sha1(str(getValues(foo)))
foo.b = 'changed'
stateNow = hashlib.sha1(str(getValues(foo)))
assert stateBefore != stateNow


You probably don't need a cryptographically strong hash. Just add a
__hash__(self) method to your class:


def MyObject(object): # or whatever it is called
def __hash__(self):
t = (self.a, self.b, self.c)
return hash(t)

stateNow = hash(foo)



In fact, why bother with hashing it? Just store the tuple itself, or a
serialized version of it, and compare that.

I originally thought about running the hash on the __dict__ attribute,
but there may be things in there that don't actually constitute the
object's state as far as the database is concerned, so I thought it
better to have each object be responsible for returning a list of values
that constitute its state as far as the DB is concerned.

I would appreciate any insight into why this is a good/bad idea given
your past experiences.


Call me paranoid if you like, but I fear collisions. Even
cryptographically strong hashes aren't collision-free (mathematically,
they can't be). Even though the chances of a collision might only be one
in a trillion-trillion-trillion, some user might be unlucky and stumble
across such a collision, leading to a bug that might cause loss of data.
As negligible as the risk is, why take that chance if there are ways of
detecting changes that are just as good and probably faster?

Hash functions have their uses, but I don't think that this is one of
them.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,281
Latest member
Pedroaciny

Latest Threads

Top