three column dataset - additions and deletions

D

draeath

I'm going to be writing a utility that will be pulling three fields from
a MySQL table. I've already got a sample dataset - there's a long int
(which is a db key), a short string, and a looong string. Many rows.

As it is, receive this data from the DB interface as a rather large tuple
of tuples. I plan on hashing the long string field (both for convenience
and security) and storing the set in a pickle.

The idea is that this script will run periodically, pulling the table,
and comparing the data gathered at that run to that stored by the
previous, acting on changes made, and storing the current data back (to
be referenced against in the next invocation)

I figure it will be easy enough to determine changed hashes for a given
key. What I'm unclear on is what the best type of structure to keep this
data in, given that I need to modify the data after it comes in
(replacing that long string with, say, an MD5 from hashlib) and both need
to act on "new" rows (rows that don't exist in the 'old' data) and
deleted rows (rows that only exist in the 'old' data).

Keeping in mind that I'm a newbie here, and I'm probably not aware of
most of the different ways to store such things. I shouldn't have any
problems with the logic itself - I just know enough to know I don't know
the best ways of doing things :)

Any suggestions? I'm not asking for code or handholding, but some objects
or datatypes to look into would be very helpful at this early stage.

Thanks!
 
T

Tim Harig

The idea is that this script will run periodically, pulling the table,
and comparing the data gathered at that run to that stored by the
previous, acting on changes made, and storing the current data back (to
be referenced against in the next invocation)

So, basically, you want to store a local copy of the data and sync it to
the original.
I figure it will be easy enough to determine changed hashes for a given
key. What I'm unclear on is what the best type of structure to keep this
data in, given that I need to modify the data after it comes in
(replacing that long string with, say, an MD5 from hashlib) and both need
to act on "new" rows (rows that don't exist in the 'old' data) and
deleted rows (rows that only exist in the 'old' data).

You need to differentiate between the in memory data model and the storage
model. Since this data comes from a database in the first place, I would
dump it to an sqlite3 database from the beginning. You can use this to
store, modify, and change the values as you receive them from the database.

If you are looking for in-memory structures, then you haven't really
provided us with enough information on the significance and organization of
the data.
 
D

draeath

On Thu, 02 Dec 2010 22:55:53 +0000, Tim Harig wrote:

Thanks for taking the time to check in on this, Tim!
So, basically, you want to store a local copy of the data and sync it to
the original.
In a way. I only need to store one copy of the data, and make note of
changes between it and the current data.
You need to differentiate between the in memory data model and the
storage model. Since this data comes from a database in the first
place, I would dump it to an sqlite3 database from the beginning. You
can use this to store, modify, and change the values as you receive them
from the database.
I thought of doing that, but given that I only need to store a single
instance of the data, a simple pickle will do the job nicely (am I
correct in reading that it can save/load any python object?)
If you are looking for in-memory structures, then you haven't really
provided us with enough information on the significance and organization
of the data.
The data columns:
Long Int, String (under 30 chars), String (over 100 chars)
The rows can scale up to hundreds, perhaps thousands.

The integer is the database key, the shorter string is a user name, and
the longer string is an access control definition. The whole idea of this
script is to check, daily, for any added or removed users - or any
altered access control definition.

I realize this could likely all be done from inside the database itself -
but altering the DB itself is not an option (as the product vendor is
very touchy about that, and altering it can null our support agreement)
 
T

Tim Harig

On Thu, 02 Dec 2010 22:55:53 +0000, Tim Harig wrote:

Thanks for taking the time to check in on this, Tim!

In a way. I only need to store one copy of the data, and make note of
changes between it and the current data.

Perhaps I am missing a sublty that makes those statements different.
I thought of doing that, but given that I only need to store a single
instance of the data, a simple pickle will do the job nicely (am I
correct in reading that it can save/load any python object?)

So, you start by dumping the data from the remote server into an sqlite3
database table. What you end up with is a record=record copy of the
original query (plus any other meta data that you want to add). Then,
when the data changes, you apply those same changes to your local table
(or just regenerate it since you seem to be pulling all of the information
from the server anyway. The result is a *single instance of the data*.

Why you would want to replace this with a pickle of a nested set up tuples
or a homebrew on disk data structure is beyond me. Using sqlite3 is almost
certainly faster and more functional then anything you are going to create
without some serious work.
The data columns:
Long Int, String (under 30 chars), String (over 100 chars)
The rows can scale up to hundreds, perhaps thousands.

Then those are the columns that you create for your local table.
The integer is the database key, the shorter string is a user name, and
the longer string is an access control definition. The whole idea of this
script is to check, daily, for any added or removed users - or any
altered access control definition.

The question is how are you going to use this information once you have
mirroed it locally. Most likely, from you description, you just need to
access it as a local read only data store. Now compare the differences in
how you would acces the data:

pickle method:
1. You have to load the entire pickle into memory.
2. Unless you add some kind of tree or indexing mechanism, you will have to
walk through an average of 1/2 of the records to find the matching
id. If you do use an advanced mechanism you have to create the
code that inserts and locates the data.

sqlite3 method:
1. You open the file using the sqlite3 connector which does not have to
read all of the data into memory.
2. You use a select query to get just the record for the id that you are
looking for. sqlite3 has already provided you with optimized
lookup and indexing capability, as well as modification
operations, etc (most likely written in C).

As an added bonus, you don't have to worry about locking issues, to keep
the clients from accesses the datastore and receiving an inconsistant
copy, while you are making your periodic updates to the database

Summary: the pickle method is reinventing the wheel. You can do what has
already been done for you with the sqlite3 module (and library)
that has already been written for you; but, getting anything
near the same functionality is going to require considerable effort
on your part; and you are probably going to have to write C to get
the equivilant performance.

Which seems like a better option to you?
I realize this could likely all be done from inside the database itself -
but altering the DB itself is not an option (as the product vendor is
very touchy about that, and altering it can null our support agreement)

Altering the remote database is not an option; but, I am talking about
modifying only your local copy. If you can rewrite your pickle file,
then you can modify the sqlite3 file.
 
M

MRAB

Perhaps I am missing a sublty that makes those statements different.
[snip]
I take the difference to be this:

The first statement says there are 2 copies, the local one and the
original one, with the local one kept in synch with the original one.

The second statement says there's 1 copy, plus the changes which have
been made to it.
 
T

Tim Harig

Perhaps I am missing a sublty that makes those statements different.
[snip]
I take the difference to be this:

The first statement says there are 2 copies, the local one and the
original one, with the local one kept in synch with the original one.

The second statement says there's 1 copy, plus the changes which have
been made to it.

Okay, so you keep one local copy of the data pristine with what is on the
server, then you want to keep what is effectively a set of local changes
from the what is on the server like a patch so that when you want to access
the data locally, you can generate return a local copy by applying the
patch to the pristine server data. Then you want to be able to pull down
and incorporate changes from the server, something like what you would do
with "update" on an SCM. The informated is still changed on a record by
record basis.

Does that sound correct?

In that case, I would use the same basic scheme that I suggested before
but adding a row in the table to track the changes for each field that
might be locally modified. Null would of course mark fields that have not
changed. You could store the changes literaly or using some kind of diff
style delta that could be applied against the pristine copy. Updating from
the server works as before; but, you need to decide what happens if there
is a conflict between what changes on the server and local changes.
 
J

John Nagle

On Thu, 02 Dec 2010 22:55:53 +0000, Tim Harig wrote:

Thanks for taking the time to check in on this, Tim!
I realize this could likely all be done from inside the database itself -
but altering the DB itself is not an option (as the product vendor is
very touchy about that, and altering it can null our support agreement)

A local database is probably the way to go. You're already using
MySQL, so you know how to do that. You can use MySQL or SQlite on
a local machine machine for your local database, while also talking
to the remote MySQL database.

Locally, you probably want to store the key, the short string,
the MD5 of the long string, and the long string. When you get an
update, put it in a temporary table, then compare that table with
your permanent table. (The comparison is one line of SQL.)
What you do with the differences is your problem.

I have a system running which does something like this. Every
three hours, it fetches PhishTank's database of a few hundred
thousand phishign sites, and compares it to my local copy.
Another system of mine reads the daily updates to SEC filings
and updates my local database. This is all routine database
stuff.

If you have to work with big, persistent data sets, use a
real database. That's what they are for, and they already have
good algorithms for the hard stuff. Storing some local data
structure with "pickle" is probably not the right approach.

John Nagle
 
D

Dennis Lee Bieber

I figure it will be easy enough to determine changed hashes for a given
key. What I'm unclear on is what the best type of structure to keep this

Since it sounds like you have to recompute the hash for each record
retrieved from MySQL, it sounds rather futile...

Retrieve record from MySQL
Generate hash from the long string
Find matching record in local copy
If local hash does not match
update local copy with string from MySQL and new hash

Is that really going to be faster than

Retrieve record from MySQL
Find matching record in local copy
if local string does not match MySQL string
update local copy with MySQL string

Any suggestions? I'm not asking for code or handholding, but some objects
or datatypes to look into would be very helpful at this early stage.

Pity you can't modify the MySQL database... Add a hash column, using
whatever hashing functions MySQL supports (if I read my old reference
properly, MD5)...

Modified records would be those

WHERE previous_hash IS NULL
OR previous_hash != MD5(long_string)

and after processing you run updates

SET previous_hash = MD5(long_string)


Otherwise you are probably better off using a local database
(SQLite3?) storing the fields of interest. No matter what you do, your
processing is going to reading EVERY record from MySQL database, using
the key to find the matching record in the local database (if not found,
add new record -- if found, retrieve and compare the long string from
MySQL with the one from local database; on difference, update local
database record).
 
D

draeath

The only reason I want the hash is that I don't want a copy of this
string laying around. I also don't need to know what it is, I just need
to know if it's different. Think of this as a tripwire - if someone's
user access level is changed, we find out.

I still think using a separate database (sqlite or not) is entirely
unnecessary. The task simply isn't large enough of a scale to justify
going to that extreme. That's like cutting a string with a chainsaw.
Granted, the next time I'm at my workstation I'll just give it a go and
just stick with whatever turns out "best"

The alternatives are what I was looking for, and you all did provide
them :)

So here's the goal, since it seems it's still unclear:

Cron will call my script on an interval (probably once a day... so yea).
The script will determine if users were added, removed, or changed. If
so, an email will be constructed with generic details "users bjoe, jdoe
created; user mwallace deleted; user psouter access level changed" and
sent to the MUA/MTA/whatever for delivery to myself and the other admins.
 
D

Dennis Lee Bieber

The only reason I want the hash is that I don't want a copy of this
string laying around. I also don't need to know what it is, I just need
to know if it's different. Think of this as a tripwire - if someone's
user access level is changed, we find out.
Then I suggest you don't bother retrieving it from the MySQL side...
If all you want is a hash, let MySQL generate it for you:

select key_field, user_field, MD5(string_field) from ... order by
key_field

I still think using a separate database (sqlite or not) is entirely
unnecessary. The task simply isn't large enough of a scale to justify

It's probably still simpler... You main processing loop would be a
standard sort/merge algorithm (with the sorting being done by the
relevant DBMS engines)...

mysqlcur.execute("select <as above>")
localcur.execute("select key_field, user_field, hash from... order
by key_field")

mrec = mysqlcur.fetchone()
lrec = localcur.fetchone()

while True:
if (mrec and not lrec) or (mrec.key < lrec.key):
#ran out of local records (or local gap), anything in MySQL is
NEW
#log NEW record
localcur2.execute("insert into ...", (mrec.key, mrec.user,
mrec.hash))
#note second cursor for inserts while reading
mrec = mysqlcur.fetchone()
elif (not mrec and lrec) or (mrec.key > lrec.key):
#ran out of MySQL or remote gap, local is deleted record
#log DELETED record
localcur2.execute("delete from ... where key = ?", (lrec.key,))
lrec = localcur.fetchone()
elif mrec.key = lrec.key:
if mrec.hash != lrec.hash:
#changed data in MySQL
#log CHANGE
localcur2.execute("update ... set hash = ? where key = ?",
(lrec.key, mrec.hash))
mrec = mysqlcur.fetchone()
lrec = localcur.fetchone()
if not mrec and not lrec: break
localcon.commit()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top