comparing two arrays, too slow

J

joe chesak

[Note: parts of this message were removed to make it a legal post.]

I don't have access to the database, but everyday I get a csv dump of all
live customer listings.
There are around 15,000 live listings.
A record contains 3 fields: id, name, description
Everyday some new accounts are created while others are deleted.
Everyday I want to compare yesterday's dump with today's dump and print
the full record of each terminated record and each new record.
I am achieving this with the script below.
That last line makes the script run very slow.

Is there a more elegant way to compare two
arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
ride?


WORKING CODE:
require 'rubygems'
require 'fastercsv'

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = mondays_ids - sundays_ids
terminated_ids = sundays_ids - mondays_ids

sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
newaccount_ids.include? row[1]}


TEST DATA:
sunday.csv
id,name,otherstuff
1,larry,bald
2,curly,tall
3,moe,meanie

monday.csv
id,name,otherstuff
1,larry,bald
4,shemp,greasy

OUTPUT:
delete,2,curly
delete,3,moe
create,4,shemp
 
M

Matthew K. Williams

I don't have access to the database, but everyday I get a csv dump of all
live customer listings.
There are around 15,000 live listings.
A record contains 3 fields: id, name, description
Everyday some new accounts are created while others are deleted.
Everyday I want to compare yesterday's dump with today's dump and print
the full record of each terminated record and each new record.
I am achieving this with the script below.
That last line makes the script run very slow.

Is there a more elegant way to compare two
arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
ride?


WORKING CODE:
require 'rubygems'
require 'fastercsv'

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = mondays_ids - sundays_ids
terminated_ids = sundays_ids - mondays_ids

sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
newaccount_ids.include? row[1]}

Rather than go through both sets of data, I'd do something like this:

terminated_ids.each do |row|
puts "delete,#{sundays_dump[row][0]},#{sundays_dump[row][1]}"
end

Same thing for the new account (substituting monday, of course)

This way you're only outputting the changed records.

Matt
 
K

Kirk Haines

I don't have access to the database, but everyday I get a csv dump of all
live customer listings.
There are around 15,000 live listings.
A record contains 3 fields: id, name, description
Everyday some new accounts are created while others are deleted.
Everyday I want to compare yesterday's dump with today's dump and print
the full record of each terminated record and each new record.
I am achieving this with the script below.
That last line makes the script run very slow.

Is there a more elegant way to compare two
arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
ride?


WORKING CODE:
require 'rubygems'
require 'fastercsv'

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = mondays_ids - sundays_ids
terminated_ids = sundays_ids - mondays_ids

sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
newaccount_ids.include? row[1]}

The answer to, "it's too slow", is often, "don't do that, then."

If comparing arrays is too slow, don't compare arrays. Instead, do
something that's fast. Looking up data in a Hash is fast. And since
your data appears to be keyed by unique ids (though your code is
written with variable names assuming it is looking at ids, but with
array lookups that look at the names, which is confusing and might be
a bug), it is easy to stuff that data into a hash for later lookup.

-----
require 'rubygems'
require 'fastercsv'

def extract_data(dumpfile)
dump = FasterCSV.read(dumpfile)
ids = []
data = {}
dump.each do |row|
ids << row[0]
data[row[0]] = row
end

[ids,data]
end

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sunday_ids, sunday_data = extract_data("./sunday.csv")
monday_ids, monday_data = extract_data("./monday.csv")

newaccount_ids = monday_ids - sunday_ids
terminated_ids = sunday_ids - monday_ids

terminated_ids.each do |id|
row = sunday_data[id]
puts "delete,#{id},#{row[1]}"
end

newaccount_ids.each do |id|
row = monday_data[id]
puts "create,#{id},#{row[1]}"
end
-----

If this were my script, I'd further clean it up so that all of your
logic is in a simple library. Then either create a generic executable
that takes the file names as arguments, or write your scripts with
hard coded file names, but that just calls into your library with
those names. It'd be much cleaner unless you are absolutely sure that
this code will never be more than a one-off hack.


Kirk Haines
 
C

Caleb Clausen

I don't have access to the database, but everyday I get a csv dump of all
live customer listings.
There are around 15,000 live listings.
A record contains 3 fields: id, name, description
Everyday some new accounts are created while others are deleted.
Everyday I want to compare yesterday's dump with today's dump and print
the full record of each terminated record and each new record.
I am achieving this with the script below.
That last line makes the script run very slow.

Is there a more elegant way to compare two
arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
ride?


WORKING CODE:
require 'rubygems'
require 'fastercsv'

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = mondays_ids - sundays_ids
terminated_ids = sundays_ids - mondays_ids

sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
newaccount_ids.include? row[1]}


TEST DATA:
sunday.csv
id,name,otherstuff
1,larry,bald
2,curly,tall
3,moe,meanie

monday.csv
id,name,otherstuff
1,larry,bald
4,shemp,greasy

OUTPUT:
delete,2,curly
delete,3,moe
create,4,shemp

Here's an idea... although, Matthew's suggestion is really the better
way to go (tho I think his code won't quite work... you'll have to
save away array indexes somewhere to make that approach effective).

Anyway, it should be much faster if you use Sets instead of arrays for
terminated_ids and newaccount_ids. Like this:

require 'rubygems'
require 'fastercsv'
require 'set'

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = Set[*mondays_ids - sundays_ids]
terminated_ids = Set[*sundays_ids - mondays_ids]

sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
newaccount_ids.include? row[1]}


Turning sundays_ids and mondays_ids into Sets might also make it a tad
bit faster yet, but I think that the above code has achieved the lions
share of potential speedups.
 
B

Brian Candler

Another option: sort them by id, then walk through them together. If the
current id on both list A and list B is the same, then skip forward on
both. If the current id on list A is less than the current id on list B,
then it exists only in B (so report this, and skip forward on A). And
vice versa.

The advantage of this is that it works with huge files - you can read
them one line at a time instead of reading them all into RAM. And there
are tools which can do an external sort efficiently, if they're not
already sorted that way.

To avoid coding this, just use the unix shell command 'join' to do the
work for you (which you can invoke from IO.popen if you like). Just
beware that it's very fussy about having the files correctly sorted
first.

$ join -t, -j1 -v1 sunday.csv monday.csv
2,curly,tall
3,moe,meanie
$ join -t, -j1 -v2 sunday.csv monday.csv
4,shemp,greasy
 
J

joe chesak

[Note: parts of this message were removed to make it a legal post.]

Guys! Thanks for the great responses, and clever perspectives on this
problem. I have tried out all of them--got 3 of 4 to work on my real
dataset.

Caleb: I actually had tried to apply Set to this problem and failed because
I thought I should be able to point it at one column of a multi-column csv
file...thinking of a .csv file almost as a full-fledged database table.
Your minor tweak on my code shows just how easy it is to choose between
Array vs. Set.

Brian: That's a raw and fast solution. I will keep this in mind if my
datasets get too large.

Kirk: Thanks for coding that all the way through. Great idea to create an
index file and a data file in one sweep.


Joe Chesak
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,900
Latest member
Nell636132

Latest Threads

Top