merging partial, incomplete data files

Benjamin Thomas · Oct 21, 2009

Hello all,

I write scripts for fun and am still a noob.

I'd like to use ruby to solve a problem for me but find it difficult
this time so I thought I'd ask for help.

It's very basic in fact.

I have 2 csv files I'd like to merge such as :

File1 File2
----- -----
ref;qty ref;price
A;10 A;100
B;20 D;150
C;30 C;200
E;5 B;75

outputs to File3 =>

File3
-----
ref;qty;price;total
A;10;100;1000
B;20;75;1500
C;30;200;6000
D;"missing_data";150;"missing_data"
E;5;"missing_data";"missing_data"

Here is the code I have so far:

###########################################################################
def process_file(file)
processed = []
opened = File.open(file, "r").readlines

opened.each {|line| processed << line.strip.split(";")}
return processed
end

arr1 = process_file("file1")
arr2 = process_file("file2")

p arr1, arr2
###########################################################################

==> [["ref", "qty"], ["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]
==> [["ref", "price"], ["A", "100"], ["D", "150"], ["C", "200"], ["B",
"75"]]

----
So I have 2 arrays which is great but I'm not sure on how to go about
merging all this since data is sometimes missing from file1, some other
times from file2. Of course those 2 files are more complex (and
bigger!), and I'd like to setup a mechanism that would allow me to
select which columns to filter and merge, but that's the basic idea
anyway.

Thanks for your input and direction.

Robert Klemme · Oct 21, 2009

Hello all,

I write scripts for fun and am still a noob.

I'd like to use ruby to solve a problem for me but find it difficult
this time so I thought I'd ask for help.

It's very basic in fact.

I have 2 csv files I'd like to merge such as :

File1 File2
----- -----
ref;qty ref;price
A;10 A;100
B;20 D;150
C;30 C;200
E;5 B;75

outputs to File3 =>

File3
-----
ref;qty;price;total
A;10;100;1000
B;20;75;1500
C;30;200;6000
D;"missing_data";150;"missing_data"
E;5;"missing_data";"missing_data"

Here is the code I have so far:

###########################################################################
def process_file(file)
processed = []
opened = File.open(file, "r").readlines

opened.each {|line| processed << line.strip.split(";")}
return processed
end

arr1 = process_file("file1")
arr2 = process_file("file2")

p arr1, arr2
###########################################################################

==> [["ref", "qty"], ["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]
==> [["ref", "price"], ["A", "100"], ["D", "150"], ["C", "200"], ["B",
"75"]]

----
So I have 2 arrays which is great but I'm not sure on how to go about
merging all this since data is sometimes missing from file1, some other
times from file2. Of course those 2 files are more complex (and
bigger!), and I'd like to setup a mechanism that would allow me to
select which columns to filter and merge, but that's the basic idea
anyway.

Thanks for your input and direction.

I'd say, you should look into classes Hash and Set. If you have both
files in a Hash each you can do

keys = f1.keys.to_set | f2.keys

Then you can do

keys.each do |key|
# fetch from f1 and f2
end

Kind regards

robert

James Edward Gray II · Oct 21, 2009

Hello all,
Hello.

I write scripts for fun and am still a noob.

I don't think I wrote code as good as you when I was new.

I have 2 csv files I'd like to merge such as :

File1 File2
----- -----
ref;qty ref;price
A;10 A;100
B;20 D;150
C;30 C;200
E;5 B;75

outputs to File3 =>

File3
-----
ref;qty;price;total
A;10;100;1000
B;20;75;1500
C;30;200;6000
D;"missing_data";150;"missing_data"
E;5;"missing_data";"missing_data"

So I have 2 arrays which is great but I'm not sure on how to go about
merging all this since data is sometimes missing from file1, some
other
times from file2.

It's a tricky question.

There are a few ways I can't of to do things.

First, you could load one giant Hash (indexed by ref) of Arrays or
Hashes with the data as you read it, then just dump the Hash to disk.
This is the easiest, but it requires you to have all data in memory at
once. Are the files small enough that you can do that?

James Edward Gray II

Gyepi SAM · Oct 21, 2009

Hi Benjamin,

This can be done in ruby, and my inline responses will hopefully point you in
the right direction. However, depending on the size of the data or the amount
of available memory, doing this in ruby may not be the best choice.
If I were doing this, I would probably import both files into a database
like sqlite3, using a script, of course, and generate the output with sql.
The front end script would accept parameters for the input files, the columns
to output (or a list of columns to suppress, if any) and produce the desired
results.

Rather than using generating two arrays, I would read one file
into a hash, keyed by the common key, 'ref', then read the second file
and match up its record, if any, in in other file, and output the combined
set. You'd want to keep track of which lines from file1 have a match so you
can output all the non-matching values. However, I would strongly suggest
that rather than doing the work yourself, generate code to script sqlite
or other database system.

HTH

-Gyepi

Benjamin Thomas · Oct 22, 2009

James said:
First, you could load one giant Hash (indexed by ref) of Arrays or
Hashes with the data as you read it, then just dump the Hash to disk.
This is the easiest, but it requires you to have all data in memory at
once. Are the files small enough that you can do that?

Yes, I don't think that would be a problem. This is what I have so far :

#############################################################################
def process_file(file)
header = []
processed = []
opened = File.open(file, "r").readlines # store in
array -- index=line

opened.each {|line| processed << line.strip.split(";")} # remove
separator and "\n"
header = processed.shift
return header, processed # the array
is clean
end

def build_hash(header, arr, col_pos)
hash = {}
arr.size.times {|counter|
hash["#{arr[counter][0]}_#{header[col_pos]}"]=arr[counter][col_pos] }
return hash
end

hdr1, arr1 = process_file("file1")
hdr2, arr2 = process_file("file2")

p hdr1, arr1
puts
p hdr2, arr2

hash1 = build_hash(hdr1, arr1, 0)
hash2 = build_hash(hdr1, arr1, 1)

hash3 = build_hash(hdr2, arr2, 0)
hash4 = build_hash(hdr2, arr2, 1)
puts

p hash1, hash2
puts
p hash3, hash4

puts
hash = {}
merge1 = hash1.merge(hash2)
merge2 = hash3.merge(hash4)
hash = merge1.merge(merge2)

p hdr1 + hdr2
p hash
#########################################################################
Outputs to :
["ref", "qty"]
[["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]

["ref", "price"]
[["A", "100"], ["D", "150"], ["C", "200"], ["B", "75"]]

{"A_ref"=>"A", "C_ref"=>"C", "B_ref"=>"B", "E_ref"=>"E"}
{"A_qty"=>"10", "C_qty"=>"30", "B_qty"=>"20", "E_qty"=>"5"}

{"A_ref"=>"A", "C_ref"=>"C", "B_ref"=>"B", "D_ref"=>"D"}
{"B_price"=>"75", "A_price"=>"100", "D_price"=>"150", "C_price"=>"200"}

["ref", "qty", "ref", "price"]
{"B_price"=>"75", "A_ref"=>"A", "A_price"=>"100", "C_ref"=>"C",
"D_ref"=>"D", "B_ref"=>"B", "A_qty"=>"10", "D_price"=>"150",
"C_price"=>"200", "C_qty"=>"30", "E_ref"=>"E", "B_qty"=>"20",
"E_qty"=>"5"}
############################################################################

Robert said:
I'd say, you should look into classes Hash and Set. If you have both
files in a Hash each you can do...

Thanks, I will look into "Set" which I don't know about.

Gyepi said:
However, I would strongly suggest that rather than doing the work yourself, > generate code to script sqlite or other database system.

Yes, I have thought about that. I would guess it would be the most
efficient solution in terms of ressources and speed. However I don't
mind taking the time and making an effort to understand data
manipulation.

Rather than using generating two arrays, I would read one file
into a hash, keyed by the common key, 'ref', then read the second file
and match up its record, if any, in in other file, and output the
combined

Is what I've done similar to what you had in mind ?

Thanks all for the input.

Merge/append CSV files with different headers	0	Mar 24, 2014
Gathering Source files	1	Apr 21, 2005
Translater + module + tkinter	1	Feb 16, 2023
Saving a compiled source file with the "require"d files	3	Apr 21, 2005
Comparing Two Files line by line and word by word	8	Jan 11, 2006
A script to separate out file names from the path?	12	Dec 11, 2006
A data transformation framework. A presentation inviting commentary.	0	Aug 21, 2013
Puzzle about multi definitions	2	Mar 21, 2007

merging partial, incomplete data files

Benjamin Thomas

Robert Klemme

James Edward Gray II

Gyepi SAM

Benjamin Thomas

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads