Splitting a CSV file into 40,000 line chunks

D

Drew Olson

All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
xyz_part_2.csv
etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (index == 0)
header = row
end
writer << row
index += 1
end

writer.close()
 
S

Simon Strandgaard

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:


File.open("test", "w+") do |f|
20.times do |i|
f.write "abc|def|ghi#{i}\n"
end
end

File.open("test", "r") do |fin|
part = 0
while row = fin.gets

File.open("test_part#{part}", "w+") do |f|
5.times do
f.write(row)
break unless row = fin.gets
end
end

part += 1
end
end
 
W

Will Jessop

Paul said:
Why are you using CSV for this? You aren't parsing the lines into fields, so
the fact that they contain CSV content has no bearing on the present task.
Your goal is to split the input file into groups of lines delimited by
linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series of
output files, until the input file is exhausted?

Because CSV understands csv data with embedded newlines:

[will@localhost ~]$ cat csvfile.csv
some, field,"new
line"
other,field,here
[will@localhost ~]$ cat test.rb
require 'csv'

CSV.open('csvfile.csv', 'r') do |row|
p row
end
[will@localhost ~]$ ruby test.rb
["some", " field", "new\nline"]
["other", "field", "here"]

will.
 
C

ChrisH

Nice Simon

But the 'while row = fin.gets' is skipping a row each time around and
you don't handle the header. My "fix":
Simon Strandgaard wrote:...
File.open("test", "w+") do |f|
20.times do |i|
f.write "abc|def|ghi#{i}\n"
end
end

File.open("test", "r") do |fin|
part = 0
header = fin.gets
row = fin.gets
while row
File.open("test_part#{part}", "w+") do |f| f.write(header)
5.times do
f.write(row)
break unless row = fin.gets
end
end

part += 1
end
end
....


Cheers
Chris
 
C

ChrisH

James said:
Because the sample code is also moving the headers to the new files.

James Edward Gray II

Also since he's feeding Excel this is most likely on Windows rather
than *NIX

Cheers
Chris
 
S

Simon Strandgaard

Nice Simon

But the 'while row = fin.gets' is skipping a row each time around and
you don't handle the header. My "fix":

I usually mess up in my loops like this :-(
Thanks :)
 
J

James Edward Gray II

Any advice on making the code nicer is appreciated.

I'll let you decide if this is any better:

Firefly:~/Desktop$ ls
gen_csv.rb split_csv.rb
Firefly:~/Desktop$ ruby gen_csv.rb
Firefly:~/Desktop$ ls
data.csv gen_csv.rb split_csv.rb
Firefly:~/Desktop$ cat data.csv
rec_no,whatever
0,"blah, blah, blah, ..."
1,"blah, blah, blah, ..."
2,"blah, blah, blah, ..."
3,"blah, blah, blah, ..."
4,"blah, blah, blah, ..."
5,"blah, blah, blah, ..."
6,"blah, blah, blah, ..."
7,"blah, blah, blah, ..."
8,"blah, blah, blah, ..."
9,"blah, blah, blah, ..."
10,"blah, blah, blah, ..."
11,"blah, blah, blah, ..."
12,"blah, blah, blah, ..."
13,"blah, blah, blah, ..."
14,"blah, blah, blah, ..."
15,"blah, blah, blah, ..."
16,"blah, blah, blah, ..."
17,"blah, blah, blah, ..."
18,"blah, blah, blah, ..."
19,"blah, blah, blah, ..."
Firefly:~/Desktop$ ruby split_csv.rb data.csv
Firefly:~/Desktop$ ls
data.csv data_part_2.csv data_part_4.csv gen_csv.rb
data_part_1.csv data_part_3.csv data_part_5.csv split_csv.rb
Firefly:~/Desktop$ cat data_part_1.csv
rec_no,whatever
0,"blah, blah, blah, ..."
1,"blah, blah, blah, ..."
2,"blah, blah, blah, ..."
3,"blah, blah, blah, ..."
Firefly:~/Desktop$ cat data_part_3.csv
rec_no,whatever
8,"blah, blah, blah, ..."
9,"blah, blah, blah, ..."
10,"blah, blah, blah, ..."
11,"blah, blah, blah, ..."
Firefly:~/Desktop$ cat split_csv.rb
#!/usr/bin/env ruby -w

require "rubygems"
require "faster_csv"

original_file = ARGV.shift
counter = 1

FCSV.open(original_file.sub(".", "_part_#{counter}."), "w") do |out|
FCSV.open(original_file, :headers => true).each_with_index do |
row, index|
if (index % 4).zero?
if index.nonzero?
counter += 1
out.reopen(original_file.sub(".", "_part_#{counter}."), "w")
end
out << row.headers
end
out << row
end
end

__END__

James Edward Gray II
 
D

Drew Olson

Thanks for all the responses. As noted in a post above, I am trying to
place the headers at the beginning of each split file. Should have
mentioned that earlier. Also, regarding the above responses, I am
working on a Windows system, hence not being able to use nice shell
scripts.

Anyway, I seems that my logic was on the right track, the use of ugly
counters just made my inner-rubyist upset. Thanks for all the feedback,
I will take a look and reading/writing the files using the File object
as opposed to the CSV object. In fact, I didn't even think about that.
In previous scripts I had been accessing the CSV fields individually
but, as it is unnecessary here, I will toy around with performance gains
using File object.

Speaking of performance, using the code I've described seemed to be
quite slow. This is understandable based on the fact that the input file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects. Also,
any other tips on increasing the performance of this code would be
helpful.

Again, thanks for all the responses. I am continually impressed by the
attitude and intelligence of this community.
 
J

James Edward Gray II

Speaking of performance, using the code I've described seemed to be
quite slow. This is understandable based on the fact that the input
file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects.
Also,
any other tips on increasing the performance of this code would be
helpful.

CSV is known to be pretty slow, thus FasterCSV. ;)

James Edward Gray II
 
J

James Edward Gray II

A plain-text CSV file uses linefeeds as record delimiters. A
program that
uses "readline" or "gets" splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds
embedded in
fields.

Your opinion doesn't make you right on this one. The CSV RFC clearly
defines handling for carriage-returns and linefeeds. They certainly
are allowed in fields. Here is a link to the document, in case you
want to read up:

http://www.ietf.org/rfc/rfc4180.txt

Not to use a CSV parser on this task would be shooting yourself in
the foot. The result using a simple File object would be broken and,
much worse, it might look OK for a while. You just can't be sure you
are never going to split a CSV file that has an embedded linefeed in
it (especially since that's perfectly legal), and when you do you
will be responsible for destroying data. There's just no reason for
that.

I know you're a don't-use-a-library guy and you know I disagree.
This is the reason why. The edge cases will get you every time.

James Edward Gray II
 
D

Drew Olson

James -

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

Code:

require 'csv'

infile = ""

if ARGV[0] != nil
infile = ARGV[0]
else
puts "Please enter a file to split"
exit
end

puts "loading file"

rows = CSV.read(infile)

puts "file loaded"

rows.sort!{|x,y| x[3] <=> y[3]}

counter = 1
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

rows.each_index do |i|
if(i != 0 && i%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (i == 0)
header = row
end
writer << row
end

writer.close()

Output:

loading file
The exception unknown software exception (0xc00000fd) occured in the
application at location.....etc
 
J

James Edward Gray II

Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The
code
and output are below. How else can I go about this?

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II
 
R

Reid Thompson

Reid -

I also have cygwin installed on my box, so if there is a simply way to
do this in unix I'd love to know. My unix is rusty (that's being
generous).
rthompso@jhereg:~$ cat splitandhead
#!/bin/bash

fname=$1
split -l 4000 -a 6 -d $fname $fname

for file in `ls ${fname}0*`
do
sed -e '1i\Col1 Col2 Col3 Col4\' $file > ${file}.tmp
mv ${file}.tmp $file
done



rthompso@jhereg:~$ head testsed
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
rthompso@jhereg:~$ wc testsed
40140 160560 602100 testsed
rthompso@jhereg:~$ time ./splitandhead testsed

real 0m0.499s
user 0m0.140s
sys 0m0.092s
rthompso@jhereg:~$ head -2 testsed0000*
==> testsed000000 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000001 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000002 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000003 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000004 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000005 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000006 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000007 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000008 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000009 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000010 <==
Col1 Col2 Col3 Col4
this is a test
rthompso@jhereg:~$
 
R

Reid Thompson

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II
sort -n +4 out
or
sort +4 out


rthompso@jhereg:~$ head -50 out
this is a test 10202
this is a test 23990
this is a test 11056
this is a test 9606
this is a test 28590
this is a test 18264
this is a test 12902
this is a test 12856
this is a test 27571
this is a test 5495
this is a test 15965
this is a test 22229
this is a test 18865
this is a test 31339
this is a test 21913
this is a test 406
this is a test 8602
this is a test 5329
this is a test 10048
this is a test 6458
this is a test 20069
this is a test 19771
this is a test 21844
this is a test 24719
this is a test 30894
this is a test 25239
this is a test 9900
this is a test 1727
this is a test 12042
this is a test 20832
this is a test 23735
this is a test 28768
this is a test 10283
this is a test 390
this is a test 10480
this is a test 1337
this is a test 2745
this is a test 26398
this is a test 32288
this is a test 3797
this is a test 22251
this is a test 458
this is a test 14679
this is a test 29642
this is a test 19943
this is a test 26342
this is a test 24232
rthompso@jhereg:~$ sort +4 out
this is a test 0
this is a test 10
this is a test 10
this is a test 100
this is a test 100
this is a test 100
this is a test 1000
this is a test 1000
this is a test 10001
this is a test 10001
this is a test 10005
this is a test 10006
this is a test 10006
this is a test 10007
this is a test 10007
this is a test 10008
this is a test 10008
this is a test 10009
this is a test 10009
this is a test 1001
this is a test 10010
this is a test 10011
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10013
this is a test 10014
this is a test 10014
this is a test 10015
this is a test 10017
this is a test 10017
this is a test 10017
this is a test 10018
this is a test 10019
this is a test 1002
this is a test 1002
this is a test 10021
this is a test 10026
this is a test 10026
-------------------- OR
rthompso@jhereg:~$ sort -n +4 out
this is a test 0
this is a test 2
this is a test 3
this is a test 3
this is a test 4
this is a test 7
this is a test 10
this is a test 10
this is a test 12
this is a test 13
this is a test 14
this is a test 14
this is a test 15
this is a test 16
this is a test 16
this is a test 20
this is a test 21
this is a test 24
this is a test 25
this is a test 28
this is a test 29
this is a test 29
this is a test 29
this is a test 30
this is a test 30
this is a test 31
this is a test 32
this is a test 32
this is a test 34
this is a test 35
this is a test 35
this is a test 35
this is a test 36
this is a test 36
this is a test 37
this is a test 37
this is a test 38
this is a test 38
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 42
this is a test 42
this is a test 42
this is a test 43
this is a test 45
this is a test 46
this is a test 46
this is a test 47
this is a test 47
 
M

Mike Harris

Drew said:
All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
xyz_part_2.csv
etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (index == 0)
header = row
end
writer << row
index += 1
end

writer.close()
I will ignore the CSV issue, not because it isn't important, but simply
because I'm not familar with the csv parser, and this example
sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the
helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
Enumerable#chunks, all of which I've written at one time or another.
Mentally sub in the appropriate code as desired.

File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
File.write_fresh("xyz_part_#{i+1}",chunk.join("\n"))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn't exist, truncates any
existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the
helper methods, but it is also clearer. It is obvious at a glance what
it does. The same can't be said for the version without helper methods.
 
M

Mike Harris

Drew said:
All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
xyz_part_2.csv
etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (index == 0)
header = row
end
writer << row
index += 1
end

writer.close()
I will ignore the CSV issue, not because it isn't important, but simply
because I'm not familar with the csv parser, and this example
sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the
helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
Enumerable#chunks, all of which I've written at one time or another.
Mentally sub in the appropriate code as desired.

File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
File.write_fresh("xyz_part_#{i+1}",chunk.join("\n"))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn't exist, truncates any
existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the
helper methods, but it is also clearer. It is obvious at a glance what
it does. The same can't be said for the version without helper methods.
 
E

Edwin Fine

I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time). This makes it very hard to write code that can take advantage of
duck typing, e.g.

def get_batch(f, count)
lines = nil
count.times do
break unless line = f.gets
(lines ||= []) << line
end
lines
end

get_batch(File.open("foo"), 40000) # No problem
get_batch(CSV.open("foo", "r"), 40000) # Error

I wonder why it was not patterned more closely on IO? Any thoughts?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,533
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top