File Merge help request from Newbie

S

Snoopy Dog

First let me say that I am an absolute Newbie to Ruby. So please be
tolerant of my newbie question.

My situation is this. I am gathering financial data, and am about to
change data suppliers. I want to "merge" the files from both suppliers
to have as much data history as possible. I have the data in ASCII
format in a comma delimited file.

I have the data in the following structre:
c:\data\1Original\abc.csv - a new data file
c:\data\2Processed\abc.csb - the historical file and my processing
reference
Each file has the same file structure of:
Symbol, Date, Open, High, Low, Close, Volume

I already have a process that references the files in the
c:\data\processed directory structure.

Currently I have figured out how to walk the directory tree and copy any
NEW files into the Processed directory. I am hung up on the merging of
the files into the processed directory.

Sample files to demonstrate:
c:\data\1Original\abc.csv (new data)
abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456
abc, 20060902. 1.9, 2.3, 1.8, 2.3, 147454

c:\data\2Processed\abc.csv (historical)
abc, 20010101, 2.1, 2.5, 2.0, 2.45, 254677
abc, 20010102. 2.4, 2.6, 2.4, 2.5, 333444
.......
abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456

I need to create
c:\data\2Processed\abc.csv (historical)
abc, 20010101, 2.1, 2.5, 2.0, 2.45, 254677
abc, 20010102. 2.4, 2.6, 2.4, 2.5, 333444
.......
abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456
abc, 20060902. 1.9, 2.3, 1.8, 2.3, 147454

So, I am with how to read the files in and merge.
Here is my thought process:
1. Read the files into arrays (of rows)
2. Check the dates of the rows
3. Output the early dates from the historical file
4. Output the common data from either file (probably historical as
already in it)
5. Output new data from new file

So, the code I have so far is this...

puts 'start'
require 'find'
require 'ftools'

dir1original = 'c:/Data/1Original/'
dir2processed = 'c:/Data/2Processed/'

puts 'Here'
Find.find(dir1original) { |path| puts path}

Find.find(dir1original) do |path|
puts 'The current item is ' + path
if File.file? path
puts path + ' is a file'
end
end

puts 'create log files'
# Set up Log files and Specific output files
runlogfile = 'c:/Data/runlog.txt'
open(runlogfile, "w") { |f| f << "Runlog of StepOneIncrement\n"}
puts 'Created runlog file'
open('c:/Data/Exist1not2.txt', "w") {|f| f << "List of files from
Original not in Processed\n"}
puts 'Created Exist1not2'
open('c:/Data/Exist2not1.txt', "w") {|f| f << "List of files from
Processed not in Original\n"}
puts 'Created Exist2not1'




# Walk the Original Directory Tree and check for files and matches
Find.find(dir1original) do |path|
if File.file? path
second = path.gsub(dir1original,dir2processed)
if File.file? second
puts 'Found'
if File.size(path) != File.size(second)
puts 'Not same size'
#Now we will have to look at the data
puts open(path) { |f| f.read(20)}
puts open(second) { |f| f.read(20)}
#search out parsdate for possibly parsing the date data



#need help here on
# read files into an array
# date based calculations
# merging the files


else
puts 'Complete Match'
# if file.cmp(path, second)
end
else
filename = path.gsub(dir1original, '')
puts filename + ' Not Found'
# an alternate method to get the file name
puts File.basename(path) + ' Not Found'
puts File.basename(path, ".csv") + ' Not Found'
open('c:/Data/Exist1not2.txt', "a") {|f| f << filename +"\n"}
File.copy(path,second)
end
end
end

So, some help on the arrays would be GREATLY Appreciated.

Snoopy
 
P

Pit Capitain

Snoopy said:
(... merging data of two files ...)

Snoopy, if your data is as well structured as you've shown (minus a few
typos), merging the data of two files should simply be:

data = File.readlines(path)
additional_data = File.readlines(second)

data.concat(additional_data)
data.uniq!
data.sort!

# write data to destination file

Regards,
Pit
 
W

William James

Snoopy said:
First let me say that I am an absolute Newbie to Ruby. So please be
tolerant of my newbie question.

My situation is this. I am gathering financial data, and am about to
change data suppliers. I want to "merge" the files from both suppliers
to have as much data history as possible. I have the data in ASCII
format in a comma delimited file.

I have the data in the following structre:
c:\data\1Original\abc.csv - a new data file
c:\data\2Processed\abc.csb - the historical file and my processing
reference
Each file has the same file structure of:
Symbol, Date, Open, High, Low, Close, Volume

I already have a process that references the files in the
c:\data\processed directory structure.

Currently I have figured out how to walk the directory tree and copy any
NEW files into the Processed directory. I am hung up on the merging of
the files into the processed directory.

Sample files to demonstrate:
c:\data\1Original\abc.csv (new data)
abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456
abc, 20060902. 1.9, 2.3, 1.8, 2.3, 147454

c:\data\2Processed\abc.csv (historical)
abc, 20010101, 2.1, 2.5, 2.0, 2.45, 254677
abc, 20010102. 2.4, 2.6, 2.4, 2.5, 333444
.......
abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456

I need to create
c:\data\2Processed\abc.csv (historical)
abc, 20010101, 2.1, 2.5, 2.0, 2.45, 254677
abc, 20010102. 2.4, 2.6, 2.4, 2.5, 333444
.......
abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456
abc, 20060902. 1.9, 2.3, 1.8, 2.3, 147454

dir1 = '1-Original/'
dir2 = '2-Processed/'
def file_to_a s
File.exist?(s) ?
IO.read(s).map{|x| x.chomp.split(/\s*,\s*/)} : []
end
Dir[dir1+"*.csv"].each{|full_name|
bare_name = full_name[ %r{[^/]*$} ]
ary = file_to_a( full_name ) | file_to_a( dir2 + bare_name )
File.open( dir2+bare_name, 'w'){|f|
f.puts ary.sort.map{|x| x.join(", ")} }
}
 
S

Snoopy Dog

Pit said:
Snoopy, if your data is as well structured as you've shown (minus a few
typos), merging the data of two files should simply be:

data = File.readlines(path)
additional_data = File.readlines(second)

data.concat(additional_data)
data.uniq!
data.sort!

# write data to destination file

Regards,
Pit

Thank You Pit.

I implemented this in minutes following your example.

The data is very well structured, unfortunately the data is not
identical between suppliers. Sometimes there are differences in prices
or volumes fields.

So now I get unique records, but some have the same date. Since I use
the data as a time/price series, I can't have duplicate dates.
Unfortunately I don't have time to work on this tonight, but will keep
at it.

Thanks.
Snoopy
 
S

Snoopy Dog

William said:
Snoopy Dog wrote: snip
}

Thanks William James.

As I am new to Ruby, and regular expressions, I implemented Pit's
method. Since I have some issues with the data (values, not structure)
both your's and Pit's methods have the same data issues (DATE not
unique).

I think I will be able to use your regular expressions to split up my
incoming data and find a way to use the unique feature on the date
value.

Will post my results, when I get a chance to work on it.

Thanks again.
Snoopy
 
S

Snoopy Dog

Paul said:
Snoopy Dog wrote: snip snip

Please do this. Just show the example data, and say what result you
want,
and in what form. Be specific. Someone will then solve the problem on
their
own, which Ruby allows us to do faster than by analyzing your code.

Paul,

I didn't really want someone else to "solve" it for me. I want to
learn. I thought by laying out the problem, sample input data, and
showing the desired result, and my thought process on how to get there,
the good folks (like yourself) on the forum could point me in the right
direction (which they have).

As I am still learning Ruby, I need to learn the constructs of the
language, and how to use them. Both forum suggestions have greatly
aided my project and understanding of Ruby. Now I am thinking about
some new ways to approach the DATE issue.

So a new question here: How do I find out what methods exist for an
object?

eg: For the File object, how can I find out about the open, file?,
close, and other methods (as well as their syntax).

Also what references should I be looking at other than this forum?

Thanks in advance.
 
S

Sam Gentle

As I am new to Ruby, and regular expressions, I implemented Pit's
method. Since I have some issues with the data (values, not structure)
both your's and Pit's methods have the same data issues (DATE not
unique).

I think I will be able to use your regular expressions to split up my
incoming data and find a way to use the unique feature on the date
value.

I'd do something like this:

require 'enumerator'

data = File.readlines(path)
additional_data = File.readlines(second)
data.concat(additional_data)

data.sort!
mergeddata = [data[0]]
data.each_cons(2) {|x1,x2| mergeddata.push(x2) unless x1.split(/,/)[1]
== x2.split(/,/)[1]}

Sam
 
S

Snoopy Dog

Sam Gentle wrote:
snip
I'd do something like this:

require 'enumerator'

data = File.readlines(path)
additional_data = File.readlines(second)
data.concat(additional_data)

data.sort!
mergeddata = [data[0]]
data.each_cons(2) {|x1,x2| mergeddata.push(x2) unless x1.split(/,/)[1]
== x2.split(/,/)[1]}

Sam

Sam, that works great. Unfortunately I still don't understand it all
yet.

mergedata = [data[0]] - creates an new array from data, but why do we
need the subscript??

data.each_cons(2)... - what does the each_cons(2) do... I understdand
the split, but don't know the .push(x2)

I will go do some more reading to see if I can figure them out.

Thanks
Snoopy
 
S

Snoopy Dog

Snoopy said:
Sam Gentle wrote:
snip ... snip snip

I will go do some more reading to see if I can figure them out.

Thanks
Snoopy

OK, I think I have figured out most of it... Amazing how the
documentation can help out when you see the code in action.

mergedata = [data[0]] - creates a new array with just one element in it!

the mergedata.push - pushes elements on to the array (appends).

and the data.each_cons(2) - I ASSUME that this takes two elements at a
time from the data array.

How it stays in sync I don't know.

This looks like it will do everything that I need. I will do a bit more
testing and then throw it on to the live data.

THANKS
Snoopy
 
P

Pit Capitain

Snoopy said:
Sam said:
require 'enumerator'

data = File.readlines(path)
additional_data = File.readlines(second)
data.concat(additional_data)

data.sort!
mergeddata = [data[0]]
data.each_cons(2) {|x1,x2| mergeddata.push(x2) unless x1.split(/,/)[1]
== x2.split(/,/)[1]}

Sam, that works great.

Snoopy, you mentioned in one of our posts that the suppliers might
deliver inconsistent data, for example different volumes for the same
date, and Sam's solution guarantees that you get only one row for each
date. You should be aware, though, that it randomly chooses this one
row. For some data, it could be the row of the first file, for other
data it could be the row of the second file. If you want to prefer one
of the suppliers over the others, you have to implement a slightly
different algorithm. The problem is that Ruby's sort isn't a stable sort.

Regards,
Pit
 
M

MonkeeSage

Snoopy said:
OK, I think I have figured out most of it... Amazing how the
documentation can help out when you see the code in action.

mergedata = [data[0]] - creates a new array with just one element in it!

the mergedata.push - pushes elements on to the array (appends).

You got it. It could also be written:

mergedata = Array.new(data.at(0))
....
mergedata << x2

(All of these are synonyms for the way it was written, which is why I
mention them.)
and the data.each_cons(2) - I ASSUME that this takes two elements at a
time from the data array.

Kind of, sort of...see the docs:
http://ruby-doc.org/core/classes/Enumerable.html#M002115


As to the problem, you could also do something like this, which is a
little more verbose than the other solutions, but is (I think) easier
to understand:

data1 = File.readlines(file1) # historical
data2 = File.readlines(file2) # new

# dates are from index 5-12 in the row string
# like your example data, change as needed
dates1 = data1.collect { |row| row[5..12] }
dates2 = data2.collect { |row| row[5..12] }

i = 0
while i < dates2.size
if dates1.include?(dates2)
dates2.delete_at(i)
data2.delete_at(i)
else
i += 1
end
end

out_data = (data1 + data2).sort


Also, you could use a little trick with Hash; just index the rows in a
hash by their date, then when you hit a duplicate date, you'll just
overwrite the previous value indexed by that date (change the order of
reading in file1 and file2 to keep historical rows rather than new
ones, the current order keeps new rows):

hash = {}
data = File.readlines(file1) +
File.readlines(file2)
data.each { |row|
date = row[5..12]
hash[date] = row
}
data = hash.values.sort

Regards,
Jordan
 
S

Snoopy Dog

Pit said:
Snoopy said:
data.each_cons(2) {|x1,x2| mergeddata.push(x2) unless x1.split(/,/)[1]
== x2.split(/,/)[1]}

Sam, that works great.

Snoopy, you mentioned in one of our posts that the suppliers might
deliver inconsistent data, for example different volumes for the same
date, and Sam's solution guarantees that you get only one row for each
date. You should be aware, though, that it randomly chooses this one
row. For some data, it could be the row of the first file, for other
data it could be the row of the second file. If you want to prefer one
of the suppliers over the others, you have to implement a slightly
different algorithm. The problem is that Ruby's sort isn't a stable
sort.

Regards,
Pit

Pit,

Thanks for mentioning that. I assumed that the sort kept them in order,
and I used the push(x1) instead of push(x2) after a few of my tests.
That way I kept the historical data. Since my sample data tests are
small, I was just lucky not to have them out of the order I expected
them.

Now looking at Jordan's code, I think I will use (a variant) of it to
control what I keep for historical data.

Thanks again Pit.

Snoopy
 
S

Snoopy Dog

Jordan Callicoat wrote:
..snip snip
Also, you could use a little trick with Hash; just index the rows in a
hash by their date, then when you hit a duplicate date, you'll just
overwrite the previous value indexed by that date (change the order of
reading in file1 and file2 to keep historical rows rather than new
ones, the current order keeps new rows):

hash = {}
data = File.readlines(file1) +
File.readlines(file2)
data.each { |row|
date = row[5..12]
hash[date] = row
}
data = hash.values.sort

Regards,
Jordan

Jordan,

Thanks for the suggestion. I am implementing the hash idea you
provided. That way I can keep my historical data (for common dates) and
just grab new data for new dates.

Now just a little tweak. The symbol data is not always just 3
characters.
I am currently using a regular expression to split out the values, so I
get the date from the split.

#Using Jordan's methodology
hash = {}
data = File.readlines(path) + File.readlines(second)
data.each { |row|
(sym, date, open, high, low, close, vol) = row.split(/,/)
hash[date] = row
}
data = hash.values.sort
open(second, 'w') { |f| f.puts data}


Works fine, but I really don't care about the information past the date
field.
Additionally, I have another set of files that have a column after the
vol, I am not sure how to handle it in the regular Expression.

I just want to do:
(symbol, date, ignore_the_rest) = row.split(/,/) for just the first
two columns. I am off to read more on regular expressions.

Thanks
Snoopy
 
M

MonkeeSage

Snoopy said:
Works fine, but I really don't care about the information past the date
field.
Additionally, I have another set of files that have a column after the
vol, I am not sure how to handle it in the regular Expression.

I just want to do:
(symbol, date, ignore_the_rest) = row.split(/,/) for just the first
two columns. I am off to read more on regular expressions.

Hi there,

The split method just returns an array of every item on either side of
the delimiter...

p row.split(/,/)
# => ["abc", " 20060901", " 1.5", " 2.1", " 1.4", " 1.9", " 123456\n"]

You can assign it to a variable:

a = row.split(/,/)
a[0]

You can index it anonymously:

row.split(/,/)[0]

Or unpack some (or all) members of it:

symbol, date = row.split(/,/)[0..1] # .. is a range operator

And so on.

I think you want something like the last example, but unless you need
the symbol too, you can just use: date = row.split(/,/)[1] An extra
column at the end of the rows won't effect anything.

Regards,
Jordan
 
M

MonkeeSage

Actually, there's no reason to use a regexp for a delimiter here. A
string works fine:

date = row.split(',')[1]

Regards,
Jordan
 
S

Snoopy Dog

Jordan said:
Actually, there's no reason to use a regexp for a delimiter here. A
string works fine:

date = row.split(',')[1]

Regards,
Jordan

Jordan,

Thanks. That is quicker than me finding the proper references online.

Thanks to you and all the other folks on the forum who have helped out.

I am now able to do my required processing task, and hopefully have
learned enought that I will be able to implement a few more "nice to
have" tasks soon.

When I run in to more stumbling (learning) blocks along the way, I will
know where to look for EXCELLENT help.

Thanks again to you and all the forum.

Snoopy
 
M

MonkeeSage

Snoopy said:
Thanks to you and all the other folks on the forum who have helped out.

I'm sure I speak for everyone when I say that we're glad to help out.
I've been using ruby for 3 years or so, but I still learn something new
every day (literally! especially from people like Mauricio Fernandez,
among others). Just keep hacking and keep a copy of the core docs on
your nightstand, and don't hesitate to ask when you have questions!
Welcome to the ease and beauty of ruby. :)

Regards,
Jordan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top