Splitting a CSV file into 40,000 line chunks

J

James Edward Gray II

I am perplexed by CSV.open.

Me too. ;)
I wonder why it was not patterned more closely on IO? Any thoughts?

The author does not believe CSV objects should be IO like. If you
search the archives for my messages about including FasterCSV in the
standard library you will run into our debates on this subject.

FasterCSV's open() behaves the way you expect.

James Edward Gray II
 
M

M. Edward (Ed) Borasky

Paul said:
James Edward Gray II wrote:
[snip]

Hmmmm ... seems like a bunch of folks are going to a bunch of trouble
and to baby a tool that can only handle a CSV file with less than 65536
rows. Excel is broken ... Excel sucks at math ... Excel can't handle
lots of rows and lots of columns ... Excel costs money. Excel was a
great wonder in 1986 -- 20 years later it's lame beyond words.

I'm not sure about the column count limitations, but I know almost any
database will deal with a CSV file with more than 65536 rows. There are
some awfully good free databases. Did I mention how bad Excel is at
math? Sometimes you *don't* get what you pay for. :)
 
J

James Edward Gray II

Your own code ... er, excuse me, your own library ... will meet your
requirements exactly, it won't cover cases that are not relevant to
the
problem at hand, it will be much faster overall than existing
solutions,
and you will learn things about Ruby that you would not if you used
someone
else's library.

Now you're guilty of a new sin: encouraging people to reinvent the
wheel. You just can't win, can you? ;)

Different problems require different levels of paranoia. Sometimes a
little code will get you over the hump, but you may be making some
trade-offs when you don't use a robust library. Sometimes those are
even good trade-offs, like sacrificing edge case handling to gain
some speed. Sometimes it's even part of the goal to avoid the
library, like when I built FasterCSV to address some needs CSV wasn't
meeting. As soon as things start getting serious though, *I* usually
feel safer reaching for the library.

The people reading this list have seen us debate the issue now and be
able to make well informed decisions about what they think is best.
On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)...

Really? How do they handle data with newlines in it?

Which "CSV databases" are you referring to here?

James Edward Gray II
 
L

Louis J Scoras

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II

Indeed. That problem is difficult in general because you need to have
the whole set of elements in memory before you can begin sorting them.
As James pointed out, you might be able to use some sort of
memoization technique to track only the bits relevent to sorting. The
problem is you'll also need some way to get back to the original
record.

Depending on how you ending up parsing the records, you might be able
to store the file position of the start of the record and the record
length.

Records -> [sort_key, file.pos, record.length]

Then sort those arrays if you can fit them all in memory. Finally,
you can use the offsets for random access to grab the records and
stick them into the new files as you've been doing.

Basically, you're looking at a complicated swartzian transformation.
If it will work depends on how big your records are. If they are
fairly large, you might be able to pull if off; however, if they're
small and the problem is only that there are too many records, you'll
still have a problem.


In that case, you might want to just shove them in an RDBMS and let it
sort it for you.
 
T

Thomas Mueller

2006/11/30 said:
I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

You could import it into MS Access, sort there and export again. Maybe
you could even do all the splitting up in Access directly, but I don't
know too much about Access to help with that.
And I guess using Access would make it harder to automate this whole
task, in case that's required.

That's not very "rubyish", of course :)

Thomas
 
J

James Edward Gray II

Also, there is a hidden assumption in your position -- that
libraries, ipso facto, represent robust methods.
For the newbies, however, it might matter. They might think library
contents differ from ordinary code.

I sure hope they think that! I know I do.

There's no faster way to find bugs than to bundle up some code and
turn it loose on the world. That leads to more robust code. This is
the reason open source development works so well.

If one of us patches a library, everyone benefits. It's like having
a few hundred extra programmers on your staff.

Yes, I realize I'm over generalizing there. There will always be
poorly supported or weak libraries, but someone just forks or
replaces those eventually.
Linefeeds are escaped as though in a normal quoted string. This is
how I
have always dealt with embedded linefeeds, which is why I was
ignorant of
the specification's language on this (an explanation, not an excuse).

So a linefeed is \n and then we need to escape the \ so that is \\, I
assume. Interesting.

I would argue that is not CSV, but it's certainly debatable. My
reasoning is that you either need to post process the CSV parsed data
to restore it or use a custom parser that understands CSV plus your
escaping rules.
MySQL, the database I am most familiar with, uses this method for
import or
export of comma- or tab-separated plain-text data. Within MySQL's own
database protocol, linefeeds really are linefeeds, but an imported or
exported plain-text table has them escaped within fields.

Wild. I use MySQL everyday. Guess I've never dumped a CSV of
linefeed containing data with it though. (I generally walk the
database myself with a Ruby script and dump with FasterCSV.)
It just takes longer if all the database
handling (not just record parsing) must use the same state machine
that
field parsing must use.

I don't understand this comment. MySQL does not use CSV internally,
like most databases.
It's very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.

Have you heard of the FasterCSV library? ;) It's pretty zippy.
But parsing will necessarily be slow, character by character, the
entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few
extra
states.

You don't really have to parse CSV character by character. FasterCSV
does most of its parsing with a single highly optimized (to avoid
backtracking) regular expression and a few tricks.

Basically you can read line by line and divide into fields. If you
have an unclosed field at the end of the line, you hit an embedded
linefeed. You then just pull and append the next line and continue
eating fields.

The standard CSV library does not do this and that is one of two big
reasons it is so slow.

James Edward Gray II
 
M

M. Edward (Ed) Borasky

Thomas said:
You could import it into MS Access, sort there and export again. Maybe
you could even do all the splitting up in Access directly, but I don't
know too much about Access to help with that.
And I guess using Access would make it harder to automate this whole
task, in case that's required.

That's not very "rubyish", of course :)

Thomas
Well ... *databases* are highly Ruby-ish! Every major database,
including, I'm guessing, MS Abscess*, has a simple Ruby interface.
Reading a huge file into memory is the wrong approach.

* Abscess -- An abscess is a localized collection of pus in any part of
the body, caused by an infection.

http://www.nlm.nih.gov/medlineplus/ency/article/001353.htm
 
M

M. Edward (Ed) Borasky

Drew said:
James -

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?
On Linux (or CygWin, since you're on Windows):

$ sort -k 4 -t , <inputfile> > <outputfile>

That will probably put the header line at the bottom of the file or
perhaps the middle, but that's pretty easy to fix.

$ cp <a-file-with-just-the-header-line> <outputfile>
$ grep -v <some-field-in-the-header-and-not-in-the-data> <inputfile> |
sort -k 4 -t , >> <outputfile>

**However**:

If the file is truly CSV (numeric values unquoted, date and time stamps
in official format for your locale, and all text fields quoted) there is
actually a way you can treat it like a table in a database with ODBC.
Open your little ODBC data sources gizmo and create a "File DSN". Use
the "text" driver. It will let you configure it so that the "database"
is the directory where your CSV file lives and your CSV file is a table
in that database. Then anything that can connect to an ODBC Data Source
Name will be able to query your CSV file.

If the sort is numeric, add a "-n" to the sort command above. If it's
descending, add "-r" and if it's both numeric and descending add "-nr".
"man sort" is your friend.
 
J

Jamey Cribbs

I'm coming to this party really late, so I hope I don't come across as
shamelessly plugging KirbyBase, but, you might want to try it for this.

If you are simply trying to take a large csv file, sort it by one of its
fields, and split it up into smaller files that each contain 40,000
records, I think it might work.

Here's some code (not tested, could be incorrect) off the top of my head:


require 'kirbybase'

db = KirbyBase.new

tbl = db.create_table:)foo, :field1, :String, :field2, :Integer,
:field3, :String............................

tbl.import_csv(name_of_csv_file)

rec_count = tbl.total_recs
last_recno_written_out = 0

while rec_count > 0
recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno <
last_recno_written_out + 40000 }.sort:)field4)

........ here is where you put the code to write these 40,000 recs to
a csv output file .............

last_recno_written_out = recs.last.recno

rec_count = rec_count - 40000
end


KirbyBase will even use FasterCSV for it's csv stuff if you have it
installed. :)


Anyway, hope this helps. If I have totally misunderstood the request,
feel free to ignore!

Jamey Cribbs
 
E

Edwin Fine

Paul said:
/ ...


This is an experience with which I am becoming familiar. Someone
requests a
solution to a problem. Someone else offers the option of a library to
solve
the problem. Then the original problem fades into the background,
replaced
by discussion of the library's problems.

This same pattern has repeated itself about four times in the past
fortnight, in just this one newsgroup.
/snip/

Ummm, I am not sure exactly how to interpret the above post, but I see
my name quoted there, so I feel compelled to clarify what I was thinking
in making my original post. I had just written a small Ruby program that
would satisfy the OP's stated problem, but using IO/File. While I was
doing this, more posts appeared, which alerted me to the possibility
that I would have to cater for newlines in the input., "Oh well", I
thought, "I'll just replace every use of "IO" with "CSV", and that will
be that. BZZZ! Wrong! Thank you for playing. I couldn't drop in CSV
instead of IO? WTF???

This is where my perplexity came in. Matz himself has regularly and
clearly stated that he designed Ruby along the Principle Of Least
Surprise (or LOLA, Law of Least Astonishment). Well, I was grievously
surprised and astonished when CSV#open behaved differently to every open
I have used in any language. All the other opens that I know of return
the concept of a handle/object, or some *thing* that can then be
beseeched to bring forth the contents of the actual I/O "device", one
element at a time, or all at once. The CSV#open skips this step and goes
straight from open to bringing forth entities, and thereby breaks
possible compatibility with IO/File. IMHO, this is inelegant design.

I have written many, many libraries (not in Ruby) and know how important
it is to present to your users a consistent, clean, orthogonal,
robust,reliable set of interfaces and implementations. That's why it is
inadvisable to release a very early version of a library to a large
audience of end users until it has proven itself in battle, as it were.
Otherwise, you face the prospect of having to change the interface to be
less surprising (*and* keep a backward-compatible, deprecated one) and
re-releasing it to possibly hundreds of users.

The bottom line is, although I am perfectly capable of doing so, I don't
WANT to reinvent the wheel. I really like reusing good, dependable,
predictable code. I haven't had time to look at FasterCSV yet, but I
predict that I will like what I see, because to my mind, from the works
of his I have seen, the author does his best to embody the "Tao" of Ruby
(LOLA). (Although you can never accurately describe the Tao of anything,
I know...)

Well, that's my 2c worth :)
 
J

Jamey Cribbs

Jamey said:
I'm coming to this party really late, so I hope I don't come across as
shamelessly plugging KirbyBase, but, you might want to try it for this.

If you are simply trying to take a large csv file, sort it by one of
its fields, and split it up into smaller files that each contain
40,000 records, I think it might work.

Here's some code (not tested, could be incorrect) off the top of my head:


require 'kirbybase'

db = KirbyBase.new

tbl = db.create_table:)foo, :field1, :String, :field2, :Integer,
:field3, :String............................

tbl.import_csv(name_of_csv_file)

rec_count = tbl.total_recs
last_recno_written_out = 0

while rec_count > 0
recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno
< last_recno_written_out + 40000 }.sort:)field4)

........ here is where you put the code to write these 40,000 recs to
a csv output file .............

last_recno_written_out = recs.last.recno

rec_count = rec_count - 40000
end

I realized this morning that the solution I posted last night won't work
because you need the whole dataset sorted *before* you start splitting
it up into 40,000 record files. Oops!

Anyway, in an attempt to recover gracefully from my mistake and also to
give me the opportunity to shamelessly plug another one of my libraries,
I present the following proposed solution that is totally untested and
probably full of holes:

require 'mongoose'
db = Mongoose::Database.new

db.create_table:)foo) do |tbl|
tbl.add_column:)field1, :string)
tbl.add_column:)field2, :string)
tbl.add_column:)field3, :integer)
tbl.add_indexed_column:)field4, :string)
 
M

M. Edward (Ed) Borasky

Paul said:
Nice, informative post. There are a lot of issues here, primarily the fact
that the database under discussion is too big to hold in memory, and it is
also too big to fit into Excel in one chunk, which appears to be its
destination.

Most people have begin to drift toward suggesting a database approach,
rather than anything that involves direct manipulation of the database in
Ruby. Because of the size of the database and because sorting the records
is one goal, I have to agree.
I haven't "begun to drift" -- I'll flat out say, "Use a %^$&%^$( database!"
 
M

Mike Harris

Louis said:
Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II


Indeed. That problem is difficult in general because you need to have
the whole set of elements in memory before you can begin sorting them.
As James pointed out, you might be able to use some sort of
memoization technique to track only the bits relevent to sorting. The
problem is you'll also need some way to get back to the original
record.

Depending on how you ending up parsing the records, you might be able
to store the file position of the start of the record and the record
length.

Records -> [sort_key, file.pos, record.length]

Then sort those arrays if you can fit them all in memory. Finally,
you can use the offsets for random access to grab the records and
stick them into the new files as you've been doing.

Basically, you're looking at a complicated swartzian transformation.
If it will work depends on how big your records are. If they are
fairly large, you might be able to pull if off; however, if they're
small and the problem is only that there are too many records, you'll
still have a problem.


In that case, you might want to just shove them in an RDBMS and let it
sort it for you.
Let's say you want to sort by the foo column

Read in all the foo values and sort them
Get every 40,000th value from the list.
Now, upon reading any row, you can determine what page it should go on.
Read the file, get the rows for the first N pages, ignoring the rest of
the rows, where N is a number that won't run you out of memory.
Create the files for those rows
Remove references to the rows you read in.
Repeat with the next N pages until finished.
 
D

Drew Olson

Thanks everyone for ALL the replys. Lots of interesting things to think
about. I'll take a look at using a database approach for this and I'm
looking at FasterCSV now. Also, some very good insight related to
building code from scratch and using libraries.

Another great example of the ruby community at work IMO.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top