merging partial, incomplete data files

B

Benjamin Thomas

Hello all,

I write scripts for fun and am still a noob.

I'd like to use ruby to solve a problem for me but find it difficult
this time so I thought I'd ask for help.

It's very basic in fact.

I have 2 csv files I'd like to merge such as :

File1 File2
----- -----
ref;qty ref;price
A;10 A;100
B;20 D;150
C;30 C;200
E;5 B;75

outputs to File3 =>

File3
-----
ref;qty;price;total
A;10;100;1000
B;20;75;1500
C;30;200;6000
D;"missing_data";150;"missing_data"
E;5;"missing_data";"missing_data"

Here is the code I have so far:

###########################################################################
def process_file(file)
processed = []
opened = File.open(file, "r").readlines

opened.each {|line| processed << line.strip.split(";")}
return processed
end

arr1 = process_file("file1")
arr2 = process_file("file2")

p arr1, arr2
###########################################################################

==> [["ref", "qty"], ["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]
==> [["ref", "price"], ["A", "100"], ["D", "150"], ["C", "200"], ["B",
"75"]]

----
So I have 2 arrays which is great but I'm not sure on how to go about
merging all this since data is sometimes missing from file1, some other
times from file2. Of course those 2 files are more complex (and
bigger!), and I'd like to setup a mechanism that would allow me to
select which columns to filter and merge, but that's the basic idea
anyway.

Thanks for your input and direction.
 
R

Robert Klemme

Hello all,

I write scripts for fun and am still a noob.

I'd like to use ruby to solve a problem for me but find it difficult
this time so I thought I'd ask for help.

It's very basic in fact.

I have 2 csv files I'd like to merge such as :

File1 File2
----- -----
ref;qty ref;price
A;10 A;100
B;20 D;150
C;30 C;200
E;5 B;75

outputs to File3 =>

File3
-----
ref;qty;price;total
A;10;100;1000
B;20;75;1500
C;30;200;6000
D;"missing_data";150;"missing_data"
E;5;"missing_data";"missing_data"

Here is the code I have so far:

###########################################################################
def process_file(file)
processed = []
opened = File.open(file, "r").readlines

opened.each {|line| processed << line.strip.split(";")}
return processed
end

arr1 = process_file("file1")
arr2 = process_file("file2")

p arr1, arr2
###########################################################################

==> [["ref", "qty"], ["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]
==> [["ref", "price"], ["A", "100"], ["D", "150"], ["C", "200"], ["B",
"75"]]

----
So I have 2 arrays which is great but I'm not sure on how to go about
merging all this since data is sometimes missing from file1, some other
times from file2. Of course those 2 files are more complex (and
bigger!), and I'd like to setup a mechanism that would allow me to
select which columns to filter and merge, but that's the basic idea
anyway.

Thanks for your input and direction.

I'd say, you should look into classes Hash and Set. If you have both
files in a Hash each you can do

keys = f1.keys.to_set | f2.keys

Then you can do

keys.each do |key|
# fetch from f1 and f2
end

Kind regards

robert
 
J

James Edward Gray II

Hello all,
Hello.

I write scripts for fun and am still a noob.

I don't think I wrote code as good as you when I was new. ;)
I have 2 csv files I'd like to merge such as :

File1 File2
----- -----
ref;qty ref;price
A;10 A;100
B;20 D;150
C;30 C;200
E;5 B;75

outputs to File3 =>

File3
-----
ref;qty;price;total
A;10;100;1000
B;20;75;1500
C;30;200;6000
D;"missing_data";150;"missing_data"
E;5;"missing_data";"missing_data"
So I have 2 arrays which is great but I'm not sure on how to go about
merging all this since data is sometimes missing from file1, some
other
times from file2.

It's a tricky question.

There are a few ways I can't of to do things.

First, you could load one giant Hash (indexed by ref) of Arrays or
Hashes with the data as you read it, then just dump the Hash to disk.
This is the easiest, but it requires you to have all data in memory at
once. Are the files small enough that you can do that?

James Edward Gray II
 
G

Gyepi SAM

Hi Benjamin,

This can be done in ruby, and my inline responses will hopefully point you in
the right direction. However, depending on the size of the data or the amount
of available memory, doing this in ruby may not be the best choice.
If I were doing this, I would probably import both files into a database
like sqlite3, using a script, of course, and generate the output with sql.
The front end script would accept parameters for the input files, the columns
to output (or a list of columns to suppress, if any) and produce the desired
results.

Rather than using generating two arrays, I would read one file
into a hash, keyed by the common key, 'ref', then read the second file
and match up its record, if any, in in other file, and output the combined
set. You'd want to keep track of which lines from file1 have a match so you
can output all the non-matching values. However, I would strongly suggest
that rather than doing the work yourself, generate code to script sqlite
or other database system.

HTH

-Gyepi
 
B

Benjamin Thomas

James said:
First, you could load one giant Hash (indexed by ref) of Arrays or
Hashes with the data as you read it, then just dump the Hash to disk.
This is the easiest, but it requires you to have all data in memory at
once. Are the files small enough that you can do that?

Yes, I don't think that would be a problem. This is what I have so far :

#############################################################################
def process_file(file)
header = []
processed = []
opened = File.open(file, "r").readlines # store in
array -- index=line

opened.each {|line| processed << line.strip.split(";")} # remove
separator and "\n"
header = processed.shift
return header, processed # the array
is clean
end

def build_hash(header, arr, col_pos)
hash = {}
arr.size.times {|counter|
hash["#{arr[counter][0]}_#{header[col_pos]}"]=arr[counter][col_pos] }
return hash
end

hdr1, arr1 = process_file("file1")
hdr2, arr2 = process_file("file2")

p hdr1, arr1
puts
p hdr2, arr2

hash1 = build_hash(hdr1, arr1, 0)
hash2 = build_hash(hdr1, arr1, 1)

hash3 = build_hash(hdr2, arr2, 0)
hash4 = build_hash(hdr2, arr2, 1)
puts

p hash1, hash2
puts
p hash3, hash4

puts
hash = {}
merge1 = hash1.merge(hash2)
merge2 = hash3.merge(hash4)
hash = merge1.merge(merge2)

p hdr1 + hdr2
p hash
#########################################################################
Outputs to :
["ref", "qty"]
[["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]

["ref", "price"]
[["A", "100"], ["D", "150"], ["C", "200"], ["B", "75"]]

{"A_ref"=>"A", "C_ref"=>"C", "B_ref"=>"B", "E_ref"=>"E"}
{"A_qty"=>"10", "C_qty"=>"30", "B_qty"=>"20", "E_qty"=>"5"}

{"A_ref"=>"A", "C_ref"=>"C", "B_ref"=>"B", "D_ref"=>"D"}
{"B_price"=>"75", "A_price"=>"100", "D_price"=>"150", "C_price"=>"200"}

["ref", "qty", "ref", "price"]
{"B_price"=>"75", "A_ref"=>"A", "A_price"=>"100", "C_ref"=>"C",
"D_ref"=>"D", "B_ref"=>"B", "A_qty"=>"10", "D_price"=>"150",
"C_price"=>"200", "C_qty"=>"30", "E_ref"=>"E", "B_qty"=>"20",
"E_qty"=>"5"}
############################################################################
Robert said:
I'd say, you should look into classes Hash and Set. If you have both
files in a Hash each you can do...

Thanks, I will look into "Set" which I don't know about.

Gyepi said:
However, I would strongly suggest that rather than doing the work yourself, > generate code to script sqlite or other database system.

Yes, I have thought about that. I would guess it would be the most
efficient solution in terms of ressources and speed. However I don't
mind taking the time and making an effort to understand data
manipulation.
Rather than using generating two arrays, I would read one file
into a hash, keyed by the common key, 'ref', then read the second file
and match up its record, if any, in in other file, and output the
combined

Is what I've done similar to what you had in mind ?

Thanks all for the input.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top