Fastest CSV parsing?

W

William James

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end
 
J

James Edward Gray II

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?

James Edward Gray II
 
W

William James

James said:
This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?


That is a dishonest comment.

What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

Parsing CSV isn't very difficult.
"FasterCSV" is too slow and far too large. People don't need
to be installing it on their systems when a few lines of code
will do the job.

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.
 
D

David Mullet

William said:
What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

Point made. However...
Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.

From JEG2's own blog post:

"If your number one concern when working with CSV data in Ruby is raw
speed, you might want to know that FasterCSV is no longer the fastest
option."

http://blog.grayproductions.net/articles/2007/04/16/no-longer-the-fastest-game-in-town

Your code may or may not be faster -- you've offered no comparison.
Regardless, I doubt that JEG2 was trying to stifle your efforts; just
suggesting that you may want to avoid reinventing the wheel.

David
 
T

Tim Hunter

William said:
Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.
Hmmm...and here I was thinking that FasterCSV was free software. Have
you identified some way that James is making money from it? Have you
identified some way that using FasterCSV is hurtful?

William, there's no need to be so angry. We're all here to help each other.
 
J

James Edward Gray II

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?


That is a dishonest comment.

Not honest? I guess I'm not sure how you meant that.

FasterCSV's parser uses a very similar regular expression. Quoting
from the source:

# prebuild Regexps for faster parsing
@parsers = {
:leading_fields =>
/\A(?:#{Regexp.escape(@col_sep)})+/, # for empty leading
fields
:csv_row =>
### The Primary Parser ###
/ \G(?:^|#{Regexp.escape(@col_sep)}) # anchor the match
(?: "((?>[^"]*)(?>""[^"]*)*)" # find quoted fields
| # ... or ...
([^"#{Regexp.escape(@col_sep)}]*) # unquoted fields
)/x,
### End Primary Parser ###
:line_end =>
/#{Regexp.escape(@row_sep)}\z/ # safer than chomp!()
}

I felt they were similar enough to say you were recreating it. I can
live with it if you don't agree though.
What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

They did. I said it was too slow and I didn't care for the
interface, though some do prefer it. Pretty much what you just said
to me, so I look forward to using your EvenFasterCSV library on my
next project.
Parsing CSV isn't very difficult.

Yeah, it's not too tough.

I'm a little bothered by how your solution makes me slurp the data
into a String though. Today I was working with a CSV file with over
35,000 records in it, so I'm not too comfortable with that. You
might consider adding a little code to ease that.

Also, I really prefer to work with CSV by headers, instead of column
indices. That's easier and more robust, in my opinion. You might
want to add some code for that too.

Of course, then we're just getting closer and closer to FasterCSV, so
maybe not...
"FasterCSV" is too slow and far too large.

FasterCSV is mostly interface code to make the user experience as
nice as possible. There's also a lot of documentation in there. The
core parser is still way smaller than the standard library's parser.

James Edward Gray II
 
P

Peña, Botp

T24gQmVoYWxmIE9mIERhdmlkIE11bGxldDoNCiMgaHR0cDovL2Jsb2cuZ3JheXByb2R1Y3Rpb25z
Lm5ldC9hcnRpY2xlcy8yMDA3LzA0LzE2L25vLWxvbmdlci0NCiMgdGhlLWZhc3Rlc3QtZ2FtZS1p
bi10b3duDQoNCmFuZCBqZWcgZ2F2ZSBhIHZhbHVhYmxlIGhpbnQgb24gcHJvZHVjaW5nIGEgZmFz
dCBzY2FubmVyIChiZSBpdCBzY2FubmluZyBjc3Ygb3Igd2hhdGV2ZXIpICAtLS1ieSB1c2luZyB0
aGUgaHVtYmxlIGFuZCB1bmRlcmVzdGltYXRlZCBTdHJpbmdzY2FubmVyLi4uDQoNCmtpbmQgcmVn
YXJkcyAtYm90cA0KDQo=
 
B

Bertram Scharpf

Hi,

Am Freitag, 17. Aug 2007, 09:44:58 +0900 schrieb William James:
That is a dishonest comment.

Coding is a kind of sports to me. Besides that it is not my
decision what you do with your time.

I often recode things that are already written and
well-proofed. Sometimes my code is better, sometimes I learn
from comparing it. Sometimes it brings just new ideas to me.

So go on and do what you like; I think that's still the main
purpose of open source. Though it might not have been an
actually valuable contribution to this list I had fun
reading your solution.

Bertram
 
B

bilyy

Hi,

Am Freitag, 17. Aug 2007, 09:44:58 +0900 schrieb William James:



Coding is a kind of sports to me. Besides that it is not my
decision what you do with your time.

I often recode things that are already written and
well-proofed. Sometimes my code is better, sometimes I learn
from comparing it. Sometimes it brings just new ideas to me.

So go on and do what you like; I think that's still the main
purpose of open source. Though it might not have been an
actually valuable contribution to this list I had fun
reading your solution.

Bertram



Just a pointer to yet another CSV parsing regex: http://snippets.dzone.com/posts/show/4430

Cheers,

b.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,276
Latest member
Sawatmakal

Latest Threads

Top