Splitting a CSV file into 40,000 line chunks

Drew Olson · Nov 29, 2006

All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
xyz_part_2.csv
etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (index == 0)
header = row
end
writer << row
index += 1
end

writer.close()

Simon Strandgaard · Nov 29, 2006

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

File.open("test", "w+") do |f|
20.times do |i|
f.write "abc|def|ghi#{i}\n"
end
end

File.open("test", "r") do |fin|
part = 0
while row = fin.gets

File.open("test_part#{part}", "w+") do |f|
5.times do
f.write(row)
break unless row = fin.gets
end
end

part += 1
end
end

Will Jessop · Nov 29, 2006

Paul said:
Why are you using CSV for this? You aren't parsing the lines into fields, so
the fact that they contain CSV content has no bearing on the present task.
Your goal is to split the input file into groups of lines delimited by
linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series of
output files, until the input file is exhausted?

Because CSV understands csv data with embedded newlines:

[will@localhost ~]$ cat csvfile.csv
some, field,"new
line"
other,field,here
[will@localhost ~]$ cat test.rb
require 'csv'

CSV.open('csvfile.csv', 'r') do |row|
p row
end
[will@localhost ~]$ ruby test.rb
["some", " field", "new\nline"]
["other", "field", "here"]

will.

James Edward Gray II · Nov 29, 2006

Not ruby, but why not use "split -l 40000"?

Because the sample code is also moving the headers to the new files.

James Edward Gray II

ChrisH · Nov 29, 2006

Nice Simon

But the 'while row = fin.gets' is skipping a row each time around and
you don't handle the header. My "fix":
Simon Strandgaard wrote:...

File.open("test", "w+") do |f|
20.times do |i|
f.write "abc|def|ghi#{i}\n"
end
end

File.open("test", "r") do |fin|
part = 0

header = fin.gets
row = fin.gets
while row

File.open("test_part#{part}", "w+") do |f| f.write(header)
5.times do
f.write(row)
break unless row = fin.gets
end
end

part += 1
end
end

....

Cheers
Chris

ChrisH · Nov 29, 2006

James said:
Because the sample code is also moving the headers to the new files.

James Edward Gray II

Also since he's feeding Excel this is most likely on Windows rather
than *NIX

Cheers
Chris

Simon Strandgaard · Nov 29, 2006

Nice Simon

But the 'while row = fin.gets' is skipping a row each time around and
you don't handle the header. My "fix":

I usually mess up in my loops like this :-(
Thanks

James Edward Gray II · Nov 29, 2006

Any advice on making the code nicer is appreciated.

I'll let you decide if this is any better:

Firefly:~/Desktop$ ls
gen_csv.rb split_csv.rb
Firefly:~/Desktop$ ruby gen_csv.rb
Firefly:~/Desktop$ ls
data.csv gen_csv.rb split_csv.rb
Firefly:~/Desktop$ cat data.csv
rec_no,whatever
0,"blah, blah, blah, ..."
1,"blah, blah, blah, ..."
2,"blah, blah, blah, ..."
3,"blah, blah, blah, ..."
4,"blah, blah, blah, ..."
5,"blah, blah, blah, ..."
6,"blah, blah, blah, ..."
7,"blah, blah, blah, ..."
8,"blah, blah, blah, ..."
9,"blah, blah, blah, ..."
10,"blah, blah, blah, ..."
11,"blah, blah, blah, ..."
12,"blah, blah, blah, ..."
13,"blah, blah, blah, ..."
14,"blah, blah, blah, ..."
15,"blah, blah, blah, ..."
16,"blah, blah, blah, ..."
17,"blah, blah, blah, ..."
18,"blah, blah, blah, ..."
19,"blah, blah, blah, ..."
Firefly:~/Desktop$ ruby split_csv.rb data.csv
Firefly:~/Desktop$ ls
data.csv data_part_2.csv data_part_4.csv gen_csv.rb
data_part_1.csv data_part_3.csv data_part_5.csv split_csv.rb
Firefly:~/Desktop$ cat data_part_1.csv
rec_no,whatever
0,"blah, blah, blah, ..."
1,"blah, blah, blah, ..."
2,"blah, blah, blah, ..."
3,"blah, blah, blah, ..."
Firefly:~/Desktop$ cat data_part_3.csv
rec_no,whatever
8,"blah, blah, blah, ..."
9,"blah, blah, blah, ..."
10,"blah, blah, blah, ..."
11,"blah, blah, blah, ..."
Firefly:~/Desktop$ cat split_csv.rb
#!/usr/bin/env ruby -w

require "rubygems"
require "faster_csv"

original_file = ARGV.shift
counter = 1

FCSV.open(original_file.sub(".", "_part_#{counter}."), "w") do |out|
FCSV.open(original_file, :headers => true).each_with_index do |
row, index|
if (index % 4).zero?
if index.nonzero?
counter += 1
out.reopen(original_file.sub(".", "_part_#{counter}."), "w")
end
out << row.headers
end
out << row
end
end

__END__

James Edward Gray II

Drew Olson · Nov 29, 2006

Thanks for all the responses. As noted in a post above, I am trying to
place the headers at the beginning of each split file. Should have
mentioned that earlier. Also, regarding the above responses, I am
working on a Windows system, hence not being able to use nice shell
scripts.

Anyway, I seems that my logic was on the right track, the use of ugly
counters just made my inner-rubyist upset. Thanks for all the feedback,
I will take a look and reading/writing the files using the File object
as opposed to the CSV object. In fact, I didn't even think about that.
In previous scripts I had been accessing the CSV fields individually
but, as it is unnecessary here, I will toy around with performance gains
using File object.

Speaking of performance, using the code I've described seemed to be
quite slow. This is understandable based on the fact that the input file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects. Also,
any other tips on increasing the performance of this code would be
helpful.

Again, thanks for all the responses. I am continually impressed by the
attitude and intelligence of this community.

James Edward Gray II · Nov 29, 2006

Speaking of performance, using the code I've described seemed to be
quite slow. This is understandable based on the fact that the input
file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects.
Also,
any other tips on increasing the performance of this code would be
helpful.

CSV is known to be pretty slow, thus FasterCSV.

James Edward Gray II

James Edward Gray II · Nov 29, 2006

A plain-text CSV file uses linefeeds as record delimiters. A
program that
uses "readline" or "gets" splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds
embedded in
fields.

Your opinion doesn't make you right on this one. The CSV RFC clearly
defines handling for carriage-returns and linefeeds. They certainly
are allowed in fields. Here is a link to the document, in case you
want to read up:

http://www.ietf.org/rfc/rfc4180.txt

Not to use a CSV parser on this task would be shooting yourself in
the foot. The result using a simple File object would be broken and,
much worse, it might look OK for a while. You just can't be sure you
are never going to split a CSV file that has an embedded linefeed in
it (especially since that's perfectly legal), and when you do you
will be responsible for destroying data. There's just no reason for
that.

I know you're a don't-use-a-library guy and you know I disagree.
This is the reason why. The edge cases will get you every time.

James Edward Gray II

Drew Olson · Nov 29, 2006

James -

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

Code:

require 'csv'

infile = ""

if ARGV[0] != nil
infile = ARGV[0]
else
puts "Please enter a file to split"
exit
end

puts "loading file"

rows = CSV.read(infile)

puts "file loaded"

rows.sort!{|x,y| x[3] <=> y[3]}

counter = 1
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

rows.each_index do |i|
if(i != 0 && i%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (i == 0)
header = row
end
writer << row
end

writer.close()

Output:

loading file
The exception unknown software exception (0xc00000fd) occured in the
application at location.....etc

James Edward Gray II · Nov 29, 2006

Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The
code
and output are below. How else can I go about this?

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II

Reid Thompson · Nov 29, 2006

Also since he's feeding Excel this is most likely on Windows rather
than *NIX

not a show stopper..
http://www.research.att.com/sw/tools/uwin/

http://unxutils.sourceforge.net/

Drew Olson · Nov 29, 2006

Reid said:
not a show stopper..
http://www.research.att.com/sw/tools/uwin/

http://unxutils.sourceforge.net/

Reid -

I also have cygwin installed on my box, so if there is a simply way to
do this in unix I'd love to know. My unix is rusty (that's being
generous).

Reid Thompson · Nov 29, 2006

Reid -

I also have cygwin installed on my box, so if there is a simply way to
do this in unix I'd love to know. My unix is rusty (that's being
generous).

rthompso@jhereg:~$ cat splitandhead
#!/bin/bash

fname=$1
split -l 4000 -a 6 -d $fname $fname

for file in `ls ${fname}0*`
do
sed -e '1i\Col1 Col2 Col3 Col4\' $file > ${file}.tmp
mv ${file}.tmp $file
done

rthompso@jhereg:~$ head testsed
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
rthompso@jhereg:~$ wc testsed
40140 160560 602100 testsed
rthompso@jhereg:~$ time ./splitandhead testsed

real 0m0.499s
user 0m0.140s
sys 0m0.092s
rthompso@jhereg:~$ head -2 testsed0000*
==> testsed000000 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000001 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000002 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000003 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000004 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000005 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000006 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000007 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000008 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000009 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000010 <==
Col1 Col2 Col3 Col4
this is a test
rthompso@jhereg:~$

Reid Thompson · Nov 29, 2006

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II

sort -n +4 out
or
sort +4 out

rthompso@jhereg:~$ head -50 out
this is a test 10202
this is a test 23990
this is a test 11056
this is a test 9606
this is a test 28590
this is a test 18264
this is a test 12902
this is a test 12856
this is a test 27571
this is a test 5495
this is a test 15965
this is a test 22229
this is a test 18865
this is a test 31339
this is a test 21913
this is a test 406
this is a test 8602
this is a test 5329
this is a test 10048
this is a test 6458
this is a test 20069
this is a test 19771
this is a test 21844
this is a test 24719
this is a test 30894
this is a test 25239
this is a test 9900
this is a test 1727
this is a test 12042
this is a test 20832
this is a test 23735
this is a test 28768
this is a test 10283
this is a test 390
this is a test 10480
this is a test 1337
this is a test 2745
this is a test 26398
this is a test 32288
this is a test 3797
this is a test 22251
this is a test 458
this is a test 14679
this is a test 29642
this is a test 19943
this is a test 26342
this is a test 24232
rthompso@jhereg:~$ sort +4 out
this is a test 0
this is a test 10
this is a test 10
this is a test 100
this is a test 100
this is a test 100
this is a test 1000
this is a test 1000
this is a test 10001
this is a test 10001
this is a test 10005
this is a test 10006
this is a test 10006
this is a test 10007
this is a test 10007
this is a test 10008
this is a test 10008
this is a test 10009
this is a test 10009
this is a test 1001
this is a test 10010
this is a test 10011
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10013
this is a test 10014
this is a test 10014
this is a test 10015
this is a test 10017
this is a test 10017
this is a test 10017
this is a test 10018
this is a test 10019
this is a test 1002
this is a test 1002
this is a test 10021
this is a test 10026
this is a test 10026
-------------------- OR
rthompso@jhereg:~$ sort -n +4 out
this is a test 0
this is a test 2
this is a test 3
this is a test 3
this is a test 4
this is a test 7
this is a test 10
this is a test 10
this is a test 12
this is a test 13
this is a test 14
this is a test 14
this is a test 15
this is a test 16
this is a test 16
this is a test 20
this is a test 21
this is a test 24
this is a test 25
this is a test 28
this is a test 29
this is a test 29
this is a test 29
this is a test 30
this is a test 30
this is a test 31
this is a test 32
this is a test 32
this is a test 34
this is a test 35
this is a test 35
this is a test 35
this is a test 36
this is a test 36
this is a test 37
this is a test 37
this is a test 38
this is a test 38
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 42
this is a test 42
this is a test 42
this is a test 43
this is a test 45
this is a test 46
this is a test 46
this is a test 47
this is a test 47

Mike Harris · Nov 29, 2006

Drew said:
All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
xyz_part_2.csv
etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (index == 0)
header = row
end
writer << row
index += 1
end

writer.close()

I will ignore the CSV issue, not because it isn't important, but simply
because I'm not familar with the csv parser, and this example
sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the
helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
Enumerable#chunks, all of which I've written at one time or another.
Mentally sub in the appropriate code as desired.

File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
File.write_fresh("xyz_part_#{i+1}",chunk.join("\n"))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn't exist, truncates any
existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the
helper methods, but it is also clearer. It is obvious at a glance what
it does. The same can't be said for the version without helper methods.

Mike Harris · Nov 29, 2006

Drew said:
All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
xyz_part_2.csv
etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (index == 0)
header = row
end
writer << row
index += 1
end

writer.close()

I will ignore the CSV issue, not because it isn't important, but simply
because I'm not familar with the csv parser, and this example
sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the
helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
Enumerable#chunks, all of which I've written at one time or another.
Mentally sub in the appropriate code as desired.

File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
File.write_fresh("xyz_part_#{i+1}",chunk.join("\n"))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn't exist, truncates any
existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the
helper methods, but it is also clearer. It is obvious at a glance what
it does. The same can't be said for the version without helper methods.

Edwin Fine · Nov 29, 2006

I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time). This makes it very hard to write code that can take advantage of
duck typing, e.g.

def get_batch(f, count)
lines = nil
count.times do
break unless line = f.gets
(lines ||= []) << line
end
lines
end

get_batch(File.open("foo"), 40000) # No problem
get_batch(CSV.open("foo", "r"), 40000) # Error

I wonder why it was not patterned more closely on IO? Any thoughts?

How to put loop result in csv file	1	Jan 2, 2023
Update a particular Field in a csv file	1	Apr 15, 2010
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
transfer GBK into UTF-8 in csv file	0	Apr 15, 2010
Read csv file. Beginner's HELP!!	6	Jan 11, 2011
CSV confusion newbie question	1	Dec 5, 2009
Command Line Arguments	0	Mar 7, 2023
CSV with Ruby Newbie	9	Nov 2, 2007

Splitting a CSV file into 40,000 line chunks

Drew Olson

Simon Strandgaard

Will Jessop

James Edward Gray II

ChrisH

ChrisH

Simon Strandgaard

James Edward Gray II

Drew Olson

James Edward Gray II

James Edward Gray II

Drew Olson

James Edward Gray II

Reid Thompson

Drew Olson

Reid Thompson

Reid Thompson

Mike Harris

Mike Harris

Edwin Fine

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads