File Auditing - Fast DB import and data manipulation

bcorrigan78 · Mar 20, 2006

Hey Folks,

Im looking for some suggestions here. I need to process 200 GZ
compressed files each day that contain comma delimited information in
the following format:
HostName, FileName, DirName, Modified_On

Each file contains upwards of 200K rows. I need to compare this
information to the information recieved the day before to look for
files that have changed.

My current plan was to:
##
do read file and uncompress line
do import line into Mysql
end
end
run several comparison queries to find changes save changes to a table
automatically review changes based on rules, and those that are
leftover are unauthorized.
##

The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
Extrapolating this, it will take 2000 minutes, or 33 hours each day to
do just the import. Unfortunately, earth days only have 24 hours.

So, I need some way to compare todays file, to yesterdays and see
changes. Any good way to do this using the text files and skip the
import process? Im worried that this will slow down the comparison
process, but I'd like to try it....

Thanks

Brian

Corey Lawson · Mar 21, 2006

Well, if you're doing it on *nix, use the 'diff' command. On Windows,
use 'fc', and write some other code to handle the differences. Loading
the files into the database is not the best way forward. Use 'gunzip'
on either system (you'll need to download gunzip from gnuwin32
(http://gnuwin32.sourceforge.net/packages.html. Heck, might as well
download the package with diff and grep in it as well).

Wait, this has come up before, and sparked quite the flamefest each time!

With moderately clever code, especially on *nix, you should even be
able to do this in a shell script with minimal lines of code,
something like this really bad mix of CMD and sh:

for each f in (*.gz) do
if ff .neq. "" then
#arg, gotta gunzip the file...
gunzip %f %f.txt
f =3D %f.txt
diff %f %ff | grep -v '$>' > %f.diff
end if
ff =3D %f
#next f

The databases I've dealt with (Oracle, SQL Server, Access, Postgres)
don't have a good way to do this in SQL or stored procedure language,
because they don't really support recursion at the SQL level. Plus,
it's just too expensive for the database to allocate table space (temp
or not) to load a couple of files, run all the queries that you would
need to run to spot the differences and extract them meaningfully, and
then go on to the next file.

Otherwise, another quick-and-dirty way to do it is to set up the table
to import so that there is a multi-column PK or unique constraint, add
a field to store the file name (but keep it out of the constraint),
and then just bulk import each file. If you updated the filename field
the last time, then all the records that were successfully added will
have NULL for a filename. Depending on the database, the rows that
failed to be added due to constraint violations might be in the error
logs, or they just may not really matter at all. Run any reports you
need to for the new records before updating the filename field.

Lather-rinse-repeat.

For the SQL Server geeks, doing the "loop over the files in a
directory" is possible via DTS (sqldts.com has sample VBScript code to
do it...), but it's a sucky pain in the ass to do. But maybe it's
easier to do in SS2005.

Vance Heron · Mar 26, 2006

You can use the 'mysqlimport' utility to do a bulk load
of the data.

Vance

help with refactor and db/ar advice	1	Mar 12, 2009
large data file manipulation	0	Dec 15, 2004
writing reading from a csv or txt file	3	Mar 30, 2014
"common" HTML import file	3	May 22, 2005
Fast File Input	5	Feb 25, 2004
Input file, change data, write to file	4	Nov 2, 2006
Efficient processing of large nuumeric data file	12	Jan 18, 2008
difflib and intelligent file differences	9	Mar 26, 2009

File Auditing - Fast DB import and data manipulation

bcorrigan78

Corey Lawson

Vance Heron

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads