merging partial, incomplete data files

Discussion in 'Ruby' started by Benjamin Thomas, Oct 21, 2009.

  1. Hello all,

    I write scripts for fun and am still a noob.

    I'd like to use ruby to solve a problem for me but find it difficult
    this time so I thought I'd ask for help.

    It's very basic in fact.

    I have 2 csv files I'd like to merge such as :

    File1 File2
    ----- -----
    ref;qty ref;price
    A;10 A;100
    B;20 D;150
    C;30 C;200
    E;5 B;75

    outputs to File3 =>

    File3
    -----
    ref;qty;price;total
    A;10;100;1000
    B;20;75;1500
    C;30;200;6000
    D;"missing_data";150;"missing_data"
    E;5;"missing_data";"missing_data"

    Here is the code I have so far:

    ###########################################################################
    def process_file(file)
    processed = []
    opened = File.open(file, "r").readlines

    opened.each {|line| processed << line.strip.split(";")}
    return processed
    end

    arr1 = process_file("file1")
    arr2 = process_file("file2")

    p arr1, arr2
    ###########################################################################

    ==> [["ref", "qty"], ["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]
    ==> [["ref", "price"], ["A", "100"], ["D", "150"], ["C", "200"], ["B",
    "75"]]

    ----
    So I have 2 arrays which is great but I'm not sure on how to go about
    merging all this since data is sometimes missing from file1, some other
    times from file2. Of course those 2 files are more complex (and
    bigger!), and I'd like to setup a mechanism that would allow me to
    select which columns to filter and merge, but that's the basic idea
    anyway.

    Thanks for your input and direction.
    --
    Posted via http://www.ruby-forum.com/.
    Benjamin Thomas, Oct 21, 2009
    #1
    1. Advertising

  2. On 10/21/2009 11:20 PM, Benjamin Thomas wrote:
    > Hello all,
    >
    > I write scripts for fun and am still a noob.
    >
    > I'd like to use ruby to solve a problem for me but find it difficult
    > this time so I thought I'd ask for help.
    >
    > It's very basic in fact.
    >
    > I have 2 csv files I'd like to merge such as :
    >
    > File1 File2
    > ----- -----
    > ref;qty ref;price
    > A;10 A;100
    > B;20 D;150
    > C;30 C;200
    > E;5 B;75
    >
    > outputs to File3 =>
    >
    > File3
    > -----
    > ref;qty;price;total
    > A;10;100;1000
    > B;20;75;1500
    > C;30;200;6000
    > D;"missing_data";150;"missing_data"
    > E;5;"missing_data";"missing_data"
    >
    > Here is the code I have so far:
    >
    > ###########################################################################
    > def process_file(file)
    > processed = []
    > opened = File.open(file, "r").readlines
    >
    > opened.each {|line| processed << line.strip.split(";")}
    > return processed
    > end
    >
    > arr1 = process_file("file1")
    > arr2 = process_file("file2")
    >
    > p arr1, arr2
    > ###########################################################################
    >
    > ==> [["ref", "qty"], ["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]
    > ==> [["ref", "price"], ["A", "100"], ["D", "150"], ["C", "200"], ["B",
    > "75"]]
    >
    > ----
    > So I have 2 arrays which is great but I'm not sure on how to go about
    > merging all this since data is sometimes missing from file1, some other
    > times from file2. Of course those 2 files are more complex (and
    > bigger!), and I'd like to setup a mechanism that would allow me to
    > select which columns to filter and merge, but that's the basic idea
    > anyway.
    >
    > Thanks for your input and direction.


    I'd say, you should look into classes Hash and Set. If you have both
    files in a Hash each you can do

    keys = f1.keys.to_set | f2.keys

    Then you can do

    keys.each do |key|
    # fetch from f1 and f2
    end

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Oct 21, 2009
    #2
    1. Advertising

  3. On Oct 21, 2009, at 4:20 PM, Benjamin Thomas wrote:

    > Hello all,


    Hello.

    > I write scripts for fun and am still a noob.


    I don't think I wrote code as good as you when I was new. ;)

    > I have 2 csv files I'd like to merge such as :
    >
    > File1 File2
    > ----- -----
    > ref;qty ref;price
    > A;10 A;100
    > B;20 D;150
    > C;30 C;200
    > E;5 B;75
    >
    > outputs to File3 =>
    >
    > File3
    > -----
    > ref;qty;price;total
    > A;10;100;1000
    > B;20;75;1500
    > C;30;200;6000
    > D;"missing_data";150;"missing_data"
    > E;5;"missing_data";"missing_data"


    > So I have 2 arrays which is great but I'm not sure on how to go about
    > merging all this since data is sometimes missing from file1, some
    > other
    > times from file2.


    It's a tricky question.

    There are a few ways I can't of to do things.

    First, you could load one giant Hash (indexed by ref) of Arrays or
    Hashes with the data as you read it, then just dump the Hash to disk.
    This is the easiest, but it requires you to have all data in memory at
    once. Are the files small enough that you can do that?

    James Edward Gray II
    James Edward Gray II, Oct 21, 2009
    #3
  4. Benjamin Thomas

    Gyepi SAM Guest

    Hi Benjamin,

    This can be done in ruby, and my inline responses will hopefully point you in
    the right direction. However, depending on the size of the data or the amount
    of available memory, doing this in ruby may not be the best choice.
    If I were doing this, I would probably import both files into a database
    like sqlite3, using a script, of course, and generate the output with sql.
    The front end script would accept parameters for the input files, the columns
    to output (or a list of columns to suppress, if any) and produce the desired
    results.

    Rather than using generating two arrays, I would read one file
    into a hash, keyed by the common key, 'ref', then read the second file
    and match up its record, if any, in in other file, and output the combined
    set. You'd want to keep track of which lines from file1 have a match so you
    can output all the non-matching values. However, I would strongly suggest
    that rather than doing the work yourself, generate code to script sqlite
    or other database system.

    HTH

    -Gyepi

    On Thu, Oct 22, 2009 at 06:20:24AM +0900, Benjamin Thomas wrote:
    > I have 2 csv files I'd like to merge such as :
    >
    > File1 File2
    > ----- -----
    > ref;qty ref;price
    > A;10 A;100
    > B;20 D;150
    > C;30 C;200
    > E;5 B;75
    >
    > outputs to File3 =>
    >
    > File3
    > -----
    > ref;qty;price;total
    > A;10;100;1000
    > B;20;75;1500
    > C;30;200;6000
    > D;"missing_data";150;"missing_data"
    > E;5;"missing_data";"missing_data"
    >
    > Here is the code I have so far:
    >
    > ###########################################################################
    > def process_file(file)
    > processed = []
    > opened = File.open(file, "r").readlines
    >
    > opened.each {|line| processed << line.strip.split(";")}
    > return processed
    > end
    >
    > arr1 = process_file("file1")
    > arr2 = process_file("file2")
    >
    > p arr1, arr2
    > ###########################################################################
    >
    > ==> [["ref", "qty"], ["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]
    > ==> [["ref", "price"], ["A", "100"], ["D", "150"], ["C", "200"], ["B",
    > "75"]]
    >
    > ----
    > So I have 2 arrays which is great but I'm not sure on how to go about
    > merging all this since data is sometimes missing from file1, some other
    > times from file2. Of course those 2 files are more complex (and
    > bigger!), and I'd like to setup a mechanism that would allow me to
    > select which columns to filter and merge, but that's the basic idea
    > anyway.
    Gyepi SAM, Oct 21, 2009
    #4
  5. James Edward Gray II wrote:
    > First, you could load one giant Hash (indexed by ref) of Arrays or
    > Hashes with the data as you read it, then just dump the Hash to disk.
    > This is the easiest, but it requires you to have all data in memory at
    > once. Are the files small enough that you can do that?


    Yes, I don't think that would be a problem. This is what I have so far :

    #############################################################################
    def process_file(file)
    header = []
    processed = []
    opened = File.open(file, "r").readlines # store in
    array -- index=line

    opened.each {|line| processed << line.strip.split(";")} # remove
    separator and "\n"
    header = processed.shift
    return header, processed # the array
    is clean
    end

    def build_hash(header, arr, col_pos)
    hash = {}
    arr.size.times {|counter|
    hash["#{arr[counter][0]}_#{header[col_pos]}"]=arr[counter][col_pos] }
    return hash
    end

    hdr1, arr1 = process_file("file1")
    hdr2, arr2 = process_file("file2")

    p hdr1, arr1
    puts
    p hdr2, arr2

    hash1 = build_hash(hdr1, arr1, 0)
    hash2 = build_hash(hdr1, arr1, 1)

    hash3 = build_hash(hdr2, arr2, 0)
    hash4 = build_hash(hdr2, arr2, 1)
    puts

    p hash1, hash2
    puts
    p hash3, hash4

    puts
    hash = {}
    merge1 = hash1.merge(hash2)
    merge2 = hash3.merge(hash4)
    hash = merge1.merge(merge2)

    p hdr1 + hdr2
    p hash
    #########################################################################
    Outputs to :
    ["ref", "qty"]
    [["A", "10"], ["B", "20"], ["C", "30"], ["E", "5"]]

    ["ref", "price"]
    [["A", "100"], ["D", "150"], ["C", "200"], ["B", "75"]]

    {"A_ref"=>"A", "C_ref"=>"C", "B_ref"=>"B", "E_ref"=>"E"}
    {"A_qty"=>"10", "C_qty"=>"30", "B_qty"=>"20", "E_qty"=>"5"}

    {"A_ref"=>"A", "C_ref"=>"C", "B_ref"=>"B", "D_ref"=>"D"}
    {"B_price"=>"75", "A_price"=>"100", "D_price"=>"150", "C_price"=>"200"}

    ["ref", "qty", "ref", "price"]
    {"B_price"=>"75", "A_ref"=>"A", "A_price"=>"100", "C_ref"=>"C",
    "D_ref"=>"D", "B_ref"=>"B", "A_qty"=>"10", "D_price"=>"150",
    "C_price"=>"200", "C_qty"=>"30", "E_ref"=>"E", "B_qty"=>"20",
    "E_qty"=>"5"}
    ############################################################################
    Robert Klemme wrote:
    > I'd say, you should look into classes Hash and Set. If you have both
    > files in a Hash each you can do...


    Thanks, I will look into "Set" which I don't know about.

    Gyepi SAM wrote:
    > However, I would strongly suggest that rather than doing the work yourself, > generate code to script sqlite or other database system.


    Yes, I have thought about that. I would guess it would be the most
    efficient solution in terms of ressources and speed. However I don't
    mind taking the time and making an effort to understand data
    manipulation.

    > Rather than using generating two arrays, I would read one file
    > into a hash, keyed by the common key, 'ref', then read the second file
    > and match up its record, if any, in in other file, and output the
    > combined


    Is what I've done similar to what you had in mind ?

    Thanks all for the input.
    --
    Posted via http://www.ruby-forum.com/.
    Benjamin Thomas, Oct 22, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Billy
    Replies:
    2
    Views:
    490
    Billy
    Feb 1, 2006
  2. Joe
    Replies:
    0
    Views:
    393
  3. Peter A. Schott
    Replies:
    1
    Views:
    470
    Martin Franklin
    Jan 18, 2006
  4. Thomas Heller
    Replies:
    13
    Views:
    842
    Michele Simionato
    Feb 8, 2007
  5. J. Clifford Dyer

    Re: Partial 1.0 - Partial classes for Python

    J. Clifford Dyer, Feb 8, 2007, in forum: Python
    Replies:
    0
    Views:
    509
    J. Clifford Dyer
    Feb 8, 2007
Loading...

Share This Page