Using python to delta-load files into a central DB

C

Chris Nethery

Hello everyone,

I have a challenging issue I need to overcome and was hoping I might gain
some insights from this group.

I am trying to speed up the process I am using, which is as follows:

1) I have roughly 700 files that are modified throughout the day by users,
within a separate application

2) As modifications are made to the files, I use a polling service and mimic
the lock-file strategy used by the separate software application

3) I generate a single 'load' file and bulk insert into a load table

4) I update/insert/delete from the load table

This is just too time consuming, in my opinion.

At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

What I would like to do is spawn separate processes and load only the delta
data. The data must be 100% reliable, so I'm leary of using something like
difflib. I also want to make sure that my code scales since the number of
files is ever-increasing.

I would be grateful for any feedback you could provide.


Thank you,

Chris Nethery
 
G

Gabriel Genellina

At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

Are they text files, or what?
What kind of modifications? some lines changed/deleted/added? a column
recalculated along the whole file?
What I would like to do is spawn separate processes and load only the
delta
data. The data must be 100% reliable, so I'm leary of using something
like
difflib. I also want to make sure that my code scales since the number
of
files is ever-increasing.

Why don't you like difflib? AFAIK it has no known bugs.
 
C

Chris Nethery

Gabriel,

Thank you for your reply.

Yes, they are tab-delimited text files that will change very little
throughout the day.

But, this is messy, antiquated 80s junk, nonetheless.

Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.

Generally rows will either be added or changed, but sometimes rows will be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows and
sometimes certain column values are recalculated.

Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.


Thank you,

Chris Nethery
 
G

Gabriel Genellina

Yes, they are tab-delimited text files that will change very little
throughout the day.
But, this is messy, antiquated 80s junk, nonetheless.
Ugh...

Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The
comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.
Generally rows will either be added or changed, but sometimes rows will
be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows
and
sometimes certain column values are recalculated.

http://tgolden.sc.sabren.com/python/win32_how_do_i/watch_directory_for_changes.html

You could keep a copy of all files - let's say, as they were yesterday.
When you want to process the changes, iterate over all files and see if
they are newer than your copy. You could use the filecmp module:
http://docs.python.org/lib/module-filecmp.html
For each modified file: load it, and process the comments adding the
associated row type and the identity. Just do the same with the
"yesterday" file. (I assume they're not so big that you can keep both in
memory). You have then two lists of lines; then, use the functions in
module difflib to detect the changed lines; based on those results,
generate your database inserts/deletes/updates.

This way you will not process the unchanged files, and inside each file,
you will ignore unchanged lines. At least in principle it should be faster
than redoing all from scratch each time...
Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.

good - I imagine it's enough work as it is now...
 
C

Chris Nethery

Gabriel,

I think that would work well. Also, thank you for suggesting the use of
filecmp. I have never used this module, but it looks like a much better
solution than what I had been doing previously--using os.stat and performing
a DB lookup in order to verify that the filename and timestamp existed in a
'file update' table. Also, if the only limitation to difflib is that both
files reside in memory, I should be fine. The largest of all of these files
is just over 200k, which should be fine. If memory serves me right, I can't
use more than 4MB, so I should be fine. And, if I spawn separate processes
for generating the delta files, I should be able to speed things up even
more.

Thanks again for your help!


Best Regards,

Christopher Nethery
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top