Fastest CSV parsing?

William James · Aug 16, 2007

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

James Edward Gray II · Aug 16, 2007

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?

James Edward Gray II

William James · Aug 17, 2007

James said:
This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

Click to expand...

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?

That is a dishonest comment.

What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

Parsing CSV isn't very difficult.
"FasterCSV" is too slow and far too large. People don't need
to be installing it on their systems when a few lines of code
will do the job.

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.

David Mullet · Aug 17, 2007

William said:
What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

Point made. However...

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.

From JEG2's own blog post:

"If your number one concern when working with CSV data in Ruby is raw
speed, you might want to know that FasterCSV is no longer the fastest
option."

http://blog.grayproductions.net/articles/2007/04/16/no-longer-the-fastest-game-in-town

Your code may or may not be faster -- you've offered no comparison.
Regardless, I doubt that JEG2 was trying to stifle your efforts; just
suggesting that you may want to avoid reinventing the wheel.

David

Tim Hunter · Aug 17, 2007

William said:
Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.

Hmmm...and here I was thinking that FasterCSV was free software. Have
you identified some way that James is making money from it? Have you
identified some way that using FasterCSV is hurtful?

William, there's no need to be so angry. We're all here to help each other.

James Edward Gray II · Aug 17, 2007

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

Click to expand...

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?

Click to expand...

That is a dishonest comment.

Not honest? I guess I'm not sure how you meant that.

FasterCSV's parser uses a very similar regular expression. Quoting
from the source:

# prebuild Regexps for faster parsing
@parsers = {
:leading_fields =>
/\A(?:#{Regexp.escape(@col_sep)})+/, # for empty leading
fields
:csv_row =>
### The Primary Parser ###
/ \G(?:^|#{Regexp.escape(@col_sep)}) # anchor the match
(?: "((?>[^"]*)(?>""[^"]*)*)" # find quoted fields
| # ... or ...
([^"#{Regexp.escape(@col_sep)}]*) # unquoted fields
)/x,
### End Primary Parser ###
:line_end =>
/#{Regexp.escape(@row_sep)}\z/ # safer than chomp!()
}

I felt they were similar enough to say you were recreating it. I can
live with it if you don't agree though.

What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

They did. I said it was too slow and I didn't care for the
interface, though some do prefer it. Pretty much what you just said
to me, so I look forward to using your EvenFasterCSV library on my
next project.

Parsing CSV isn't very difficult.

Yeah, it's not too tough.

I'm a little bothered by how your solution makes me slurp the data
into a String though. Today I was working with a CSV file with over
35,000 records in it, so I'm not too comfortable with that. You
might consider adding a little code to ease that.

Also, I really prefer to work with CSV by headers, instead of column
indices. That's easier and more robust, in my opinion. You might
want to add some code for that too.

Of course, then we're just getting closer and closer to FasterCSV, so
maybe not...

"FasterCSV" is too slow and far too large.

FasterCSV is mostly interface code to make the user experience as
nice as possible. There's also a lot of documentation in there. The
core parser is still way smaller than the standard library's parser.

James Edward Gray II

PeÃ±a, Botp · Aug 17, 2007

T24gQmVoYWxmIE9mIERhdmlkIE11bGxldDoNCiMgaHR0cDovL2Jsb2cuZ3JheXByb2R1Y3Rpb25z
Lm5ldC9hcnRpY2xlcy8yMDA3LzA0LzE2L25vLWxvbmdlci0NCiMgdGhlLWZhc3Rlc3QtZ2FtZS1p
bi10b3duDQoNCmFuZCBqZWcgZ2F2ZSBhIHZhbHVhYmxlIGhpbnQgb24gcHJvZHVjaW5nIGEgZmFz
dCBzY2FubmVyIChiZSBpdCBzY2FubmluZyBjc3Ygb3Igd2hhdGV2ZXIpICAtLS1ieSB1c2luZyB0
aGUgaHVtYmxlIGFuZCB1bmRlcmVzdGltYXRlZCBTdHJpbmdzY2FubmVyLi4uDQoNCmtpbmQgcmVn
YXJkcyAtYm90cA0KDQo=

Bertram Scharpf · Aug 17, 2007

Hi,

Am Freitag, 17. Aug 2007, 09:44:58 +0900 schrieb William James:

That is a dishonest comment.

Coding is a kind of sports to me. Besides that it is not my
decision what you do with your time.

I often recode things that are already written and
well-proofed. Sometimes my code is better, sometimes I learn
from comparing it. Sometimes it brings just new ideas to me.

So go on and do what you like; I think that's still the main
purpose of open source. Though it might not have been an
actually valuable contribution to this list I had fun
reading your solution.

Bertram

bilyy · Aug 20, 2007

Hi,

Am Freitag, 17. Aug 2007, 09:44:58 +0900 schrieb William James:

Coding is a kind of sports to me. Besides that it is not my
decision what you do with your time.

I often recode things that are already written and
well-proofed. Sometimes my code is better, sometimes I learn
from comparing it. Sometimes it brings just new ideas to me.

So go on and do what you like; I think that's still the main
purpose of open source. Though it might not have been an
actually valuable contribution to this list I had fun
reading your solution.

Bertram

Just a pointer to yet another CSV parsing regex: http://snippets.dzone.com/posts/show/4430

Cheers,

b.

Faster CSV parsing	10	Oct 30, 2005
1.9 CSV Parsing Issues	5	Nov 4, 2010
Parsing a CSV file having multiple records in RUBYp	7	Dec 26, 2006
Special characters in csv header using fastercsv	16	Nov 17, 2009
Regexp for CSV header	3	Jun 17, 2009
csv: No fields, or one field?	3	Apr 25, 2012
Changing the quote-character in csv parsing	3	Mar 28, 2006
Parsing CSV and "  "	9	Oct 9, 2008

Fastest CSV parsing?

William James

James Edward Gray II

William James

David Mullet

Tim Hunter

James Edward Gray II

PeÃ±a, Botp

Bertram Scharpf

bilyy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads