File Auditing - Fast DB import and data manipulation

Discussion in 'Ruby' started by bcorrigan78@gmail.com, Mar 20, 2006.

  1. Guest

    Hey Folks,

    Im looking for some suggestions here. I need to process 200 GZ
    compressed files each day that contain comma delimited information in
    the following format:
    HostName, FileName, DirName, Modified_On

    Each file contains upwards of 200K rows. I need to compare this
    information to the information recieved the day before to look for
    files that have changed.

    My current plan was to:
    ##
    do read file and uncompress line
    do import line into Mysql
    end
    end
    run several comparison queries to find changes save changes to a table
    automatically review changes based on rules, and those that are
    leftover are unauthorized.
    ##

    The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
    Extrapolating this, it will take 2000 minutes, or 33 hours each day to
    do just the import. Unfortunately, earth days only have 24 hours.

    So, I need some way to compare todays file, to yesterdays and see
    changes. Any good way to do this using the text files and skip the
    import process? Im worried that this will slow down the comparison
    process, but I'd like to try it....

    Thanks :)

    Brian
     
    , Mar 20, 2006
    #1
    1. Advertising

  2. Corey Lawson Guest

    Well, if you're doing it on *nix, use the 'diff' command. On Windows,
    use 'fc', and write some other code to handle the differences. Loading
    the files into the database is not the best way forward. Use 'gunzip'
    on either system (you'll need to download gunzip from gnuwin32
    (http://gnuwin32.sourceforge.net/packages.html. Heck, might as well
    download the package with diff and grep in it as well).

    Wait, this has come up before, and sparked quite the flamefest each time!

    With moderately clever code, especially on *nix, you should even be
    able to do this in a shell script with minimal lines of code,
    something like this really bad mix of CMD and sh:

    for each f in (*.gz) do
    if ff .neq. "" then
    #arg, gotta gunzip the file...
    gunzip %f %f.txt
    f =3D %f.txt
    diff %f %ff | grep -v '$>' > %f.diff
    end if
    ff =3D %f
    #next f

    The databases I've dealt with (Oracle, SQL Server, Access, Postgres)
    don't have a good way to do this in SQL or stored procedure language,
    because they don't really support recursion at the SQL level. Plus,
    it's just too expensive for the database to allocate table space (temp
    or not) to load a couple of files, run all the queries that you would
    need to run to spot the differences and extract them meaningfully, and
    then go on to the next file.

    Otherwise, another quick-and-dirty way to do it is to set up the table
    to import so that there is a multi-column PK or unique constraint, add
    a field to store the file name (but keep it out of the constraint),
    and then just bulk import each file. If you updated the filename field
    the last time, then all the records that were successfully added will
    have NULL for a filename. Depending on the database, the rows that
    failed to be added due to constraint violations might be in the error
    logs, or they just may not really matter at all. Run any reports you
    need to for the new records before updating the filename field.

    Lather-rinse-repeat.

    For the SQL Server geeks, doing the "loop over the files in a
    directory" is possible via DTS (sqldts.com has sample VBScript code to
    do it...), but it's a sucky pain in the ass to do. But maybe it's
    easier to do in SS2005.

    On 3/20/06, <> wrote:
    > Hey Folks,
    >
    > Im looking for some suggestions here. I need to process 200 GZ
    > compressed files each day that contain comma delimited information in
    > the following format:
    > HostName, FileName, DirName, Modified_On
    >
    > Each file contains upwards of 200K rows. I need to compare this
    > information to the information recieved the day before to look for
    > files that have changed.
    >
    > My current plan was to:
    > ##
    > do read file and uncompress line
    > do import line into Mysql
    > end
    > end
    > run several comparison queries to find changes save changes to a table
    > automatically review changes based on rules, and those that are
    > leftover are unauthorized.
    > ##
    >
    > The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
    > Extrapolating this, it will take 2000 minutes, or 33 hours each day to
    > do just the import. Unfortunately, earth days only have 24 hours.
    >
    > So, I need some way to compare todays file, to yesterdays and see
    > changes. Any good way to do this using the text files and skip the
    > import process? Im worried that this will slow down the comparison
    > process, but I'd like to try it....
    >
    > Thanks :)
    >
    > Brian
    >
    >
    >
     
    Corey Lawson, Mar 21, 2006
    #2
    1. Advertising

  3. Vance Heron Guest

    You can use the 'mysqlimport' utility to do a bulk load
    of the data.

    Vance

    wrote:

    >Hey Folks,
    >
    >Im looking for some suggestions here. I need to process 200 GZ
    >compressed files each day that contain comma delimited information in
    >the following format:
    >HostName, FileName, DirName, Modified_On
    >
    >Each file contains upwards of 200K rows. I need to compare this
    >information to the information recieved the day before to look for
    >files that have changed.
    >
    >My current plan was to:
    >##
    >do read file and uncompress line
    > do import line into Mysql
    > end
    >end
    >run several comparison queries to find changes save changes to a table
    >automatically review changes based on rules, and those that are
    >leftover are unauthorized.
    >##
    >
    >The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
    >Extrapolating this, it will take 2000 minutes, or 33 hours each day to
    >do just the import. Unfortunately, earth days only have 24 hours.
    >
    >So, I need some way to compare todays file, to yesterdays and see
    >changes. Any good way to do this using the text files and skip the
    >import process? Im worried that this will slow down the comparison
    >process, but I'd like to try it....
    >
    >Thanks :)
    >
    >Brian
    >
    >
    >
    >
     
    Vance Heron, Mar 26, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robey Holderith

    Python Code Auditing Tool

    Robey Holderith, Feb 2, 2005, in forum: Python
    Replies:
    9
    Views:
    463
    Skip Montanaro
    Feb 2, 2005
  2. JimLad
    Replies:
    0
    Views:
    362
    JimLad
    Sep 12, 2006
  3. CptDondo

    Auditing C code

    CptDondo, Sep 19, 2006, in forum: C Programming
    Replies:
    9
    Views:
    492
    Chris Torek
    Sep 24, 2006
  4. Elhanan

    auditing with context?

    Elhanan, Mar 12, 2009, in forum: Java
    Replies:
    4
    Views:
    382
    Arved Sandstrom
    Mar 13, 2009
  5. RM

    Auditing .net generated files

    RM, Oct 6, 2009, in forum: ASP .Net
    Replies:
    0
    Views:
    305
Loading...

Share This Page