alphanumeric data clean up

Charles L. Snyder · Oct 25, 2008

Hi

I have a .csv file with terrible mixed text and numeric data:

Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
.....hundreds more lines...

I need to extract ->
date as: mo/day/Year
Code1 as: numeric part only
Code2 as: numeric part only
eg.,

DATE,SERIAL,Code1,Code2
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

I have tried
f = File.open("datadump.csv", "r")
v = File.open("results.txt", "w")

f.each_line do |z|
myarray=[]
myarray=z.chomp.split(",")
myarray.each do |g|
g = g.gsub(/\D/,'')
puts g
end
end

.... no good
(and it doesn't address the date conversion)..

Any help appreciated!

clsnyder

Craig Demyanovich · Oct 25, 2008

[Note: parts of this message were removed to make it a legal post.]

FasterCSV [ http://fastercsv.rubyforge.org/ ] should help you tokenize data
from the file. Once you have that, you should be able to munge the values as
you see fit.

Regards,
Craig

James Gray · Oct 25, 2008

DATE,SERIAL,Code1,Code2
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

How's this:

$ cat in.csv
Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
Galactica:~/Desktop$ cat filter_csv.rb
#!/usr/bin/env ruby -wKU

require "date"

require "rubygems"
require "faster_csv"

FCSV.filter

headers => true, :return_headers => true) do |row|
# remove unwanted columns
row.headers.each do |h|
row.delete(h) unless %w[DATE SERIAL Code1 Code2].include? h
end

unless row.header_row?
# convert the DATE field
row["DATE"] = Date.strptime(row["DATE"]).strftime("%m/%d/%Y")
# convert CodeN fields
1.upto(2) do |n|
f = "Code#{n}"
row[f] = row[f].to_s[/\d+(?:\.\d+)?/]
end
end
end
Galactica:~/Desktop$ ruby filter_csv.rb in.csv > out.csv
Galactica:~/Desktop$ cat out.csv
DATE,SERIAL,Code1,Code2
09/05/2008,1013726,550.92,49500
09/04/2008,1035331,752.51,54692

Hope that helps.

James Edward Gray II

William James · Oct 26, 2008

Charles said:
Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS

DATA.each_line{|line|
ary = line.split(',').values_at(1,2,5,6).
map{|s| s[ /^\S*/ ] }
ary[0] = ary[0].split("-").map{|s| s.to_i}.
values_at(1,2,0).join("/")
puts ary.join(",")
}

__END__
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS

--- output ---
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

Charles L. Snyder · Oct 28, 2008

Those work perfectly

Thanks again for the help!

cls

alphanumeric data clean up

Charles L. Snyder

Craig Demyanovich

James Gray

William James

Charles L. Snyder

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads