comparing two arrays, too slow

Discussion in 'Ruby' started by joe chesak, Aug 6, 2010.

  1. joe chesak

    joe chesak Guest

    [Note: parts of this message were removed to make it a legal post.]

    I don't have access to the database, but everyday I get a csv dump of all
    live customer listings.
    There are around 15,000 live listings.
    A record contains 3 fields: id, name, description
    Everyday some new accounts are created while others are deleted.
    Everyday I want to compare yesterday's dump with today's dump and print
    the full record of each terminated record and each new record.
    I am achieving this with the script below.
    That last line makes the script run very slow.

    Is there a more elegant way to compare two
    arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
    ride?


    WORKING CODE:
    require 'rubygems'
    require 'fastercsv'

    sundays_dump = FasterCSV.read("./sunday.csv")
    mondays_dump = FasterCSV.read("./monday.csv")

    sundays_ids = sundays_dump.collect {|row| row[1]}
    mondays_ids = mondays_dump.collect {|row| row[1]}

    newaccount_ids = mondays_ids - sundays_ids
    terminated_ids = sundays_ids - mondays_ids

    sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
    terminated_ids.include? row[1]}
    mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
    newaccount_ids.include? row[1]}


    TEST DATA:
    sunday.csv
    id,name,otherstuff
    1,larry,bald
    2,curly,tall
    3,moe,meanie

    monday.csv
    id,name,otherstuff
    1,larry,bald
    4,shemp,greasy

    OUTPUT:
    delete,2,curly
    delete,3,moe
    create,4,shemp
     
    joe chesak, Aug 6, 2010
    #1
    1. Advertisements

  2. Rather than go through both sets of data, I'd do something like this:

    terminated_ids.each do |row|
    puts "delete,#{sundays_dump[row][0]},#{sundays_dump[row][1]}"
    end

    Same thing for the new account (substituting monday, of course)

    This way you're only outputting the changed records.

    Matt
     
    Matthew K. Williams, Aug 6, 2010
    #2
    1. Advertisements

  3. joe chesak

    Kirk Haines Guest

    The answer to, "it's too slow", is often, "don't do that, then."

    If comparing arrays is too slow, don't compare arrays. Instead, do
    something that's fast. Looking up data in a Hash is fast. And since
    your data appears to be keyed by unique ids (though your code is
    written with variable names assuming it is looking at ids, but with
    array lookups that look at the names, which is confusing and might be
    a bug), it is easy to stuff that data into a hash for later lookup.

    -----
    require 'rubygems'
    require 'fastercsv'

    def extract_data(dumpfile)
    dump = FasterCSV.read(dumpfile)
    ids = []
    data = {}
    dump.each do |row|
    ids << row[0]
    data[row[0]] = row
    end

    [ids,data]
    end

    sundays_dump = FasterCSV.read("./sunday.csv")
    mondays_dump = FasterCSV.read("./monday.csv")

    sunday_ids, sunday_data = extract_data("./sunday.csv")
    monday_ids, monday_data = extract_data("./monday.csv")

    newaccount_ids = monday_ids - sunday_ids
    terminated_ids = sunday_ids - monday_ids

    terminated_ids.each do |id|
    row = sunday_data[id]
    puts "delete,#{id},#{row[1]}"
    end

    newaccount_ids.each do |id|
    row = monday_data[id]
    puts "create,#{id},#{row[1]}"
    end
    -----

    If this were my script, I'd further clean it up so that all of your
    logic is in a simple library. Then either create a generic executable
    that takes the file names as arguments, or write your scripts with
    hard coded file names, but that just calls into your library with
    those names. It'd be much cleaner unless you are absolutely sure that
    this code will never be more than a one-off hack.


    Kirk Haines
     
    Kirk Haines, Aug 6, 2010
    #3
  4. Here's an idea... although, Matthew's suggestion is really the better
    way to go (tho I think his code won't quite work... you'll have to
    save away array indexes somewhere to make that approach effective).

    Anyway, it should be much faster if you use Sets instead of arrays for
    terminated_ids and newaccount_ids. Like this:

    require 'rubygems'
    require 'fastercsv'
    require 'set'

    sundays_dump = FasterCSV.read("./sunday.csv")
    mondays_dump = FasterCSV.read("./monday.csv")

    sundays_ids = sundays_dump.collect {|row| row[1]}
    mondays_ids = mondays_dump.collect {|row| row[1]}

    newaccount_ids = Set[*mondays_ids - sundays_ids]
    terminated_ids = Set[*sundays_ids - mondays_ids]

    sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
    terminated_ids.include? row[1]}
    mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
    newaccount_ids.include? row[1]}


    Turning sundays_ids and mondays_ids into Sets might also make it a tad
    bit faster yet, but I think that the above code has achieved the lions
    share of potential speedups.
     
    Caleb Clausen, Aug 6, 2010
    #4
  5. Another option: sort them by id, then walk through them together. If the
    current id on both list A and list B is the same, then skip forward on
    both. If the current id on list A is less than the current id on list B,
    then it exists only in B (so report this, and skip forward on A). And
    vice versa.

    The advantage of this is that it works with huge files - you can read
    them one line at a time instead of reading them all into RAM. And there
    are tools which can do an external sort efficiently, if they're not
    already sorted that way.

    To avoid coding this, just use the unix shell command 'join' to do the
    work for you (which you can invoke from IO.popen if you like). Just
    beware that it's very fussy about having the files correctly sorted
    first.

    $ join -t, -j1 -v1 sunday.csv monday.csv
    2,curly,tall
    3,moe,meanie
    $ join -t, -j1 -v2 sunday.csv monday.csv
    4,shemp,greasy
     
    Brian Candler, Aug 6, 2010
    #5
  6. joe chesak

    joe chesak Guest

    [Note: parts of this message were removed to make it a legal post.]

    Guys! Thanks for the great responses, and clever perspectives on this
    problem. I have tried out all of them--got 3 of 4 to work on my real
    dataset.

    Caleb: I actually had tried to apply Set to this problem and failed because
    I thought I should be able to point it at one column of a multi-column csv
    file...thinking of a .csv file almost as a full-fledged database table.
    Your minor tweak on my code shows just how easy it is to choose between
    Array vs. Set.

    Brian: That's a raw and fast solution. I will keep this in mind if my
    datasets get too large.

    Kirk: Thanks for coding that all the way through. Great idea to create an
    index file and a data file in one sweep.


    Joe Chesak
     
    joe chesak, Aug 9, 2010
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.