CSV parsing by Paul Hsieh and Michael B. Allen?

R

Ramon F Herrera

http://groups.google.com/group/comp...w+to+best+parse+a+CSV&rnum=1#4e38340aa824bee0
http://tinyurl.com/29q4kf

Michael & Paul (or anyone who can help):

I have been looking for a solid implementation of CSV parsing and
found the thread above, which includes pointers to programs written by
both of you.

In my first attempt at building them, I failed miserably, and I
noticed that both implementations seem to be written for the Microsoft
side of the world, with perhaps some allowance for a hybrid
environment (Cygwin or something?).

I would like to have both program (each one is valuable and very
useful in its own right) to build and compile neatly under Linux and
all other *ix OSs. In the past, I have given the job of porting
something like this (OSS programs of varied level of "buildability")
into the sure-fire method of GNU config, which (almost) never fails to
compile, build and install. No hassles.

Before I embark on the task of finding the programmer in question, I
would like to know what is the current status of both programs? Where
is the latest source which is closer to Unix (and hopefully devoid of
any Windowcism)?

Any tips and suggestions are most welcome.

TIA,

-Ramon F Herrera


http://groups.google.com/group/comp.unix.programmer/browse_frm/thread/5862e14b3ea250e5/#
http://tinyurl.com/2rknnd
 
W

websnarf

Michael & Paul (or anyone who can help):

I have been looking for a solid implementation of CSV parsing and
found the thread above, which includes pointers to programs written by
both of you.

In my first attempt at building them, I failed miserably, and I
noticed that both implementations seem to be written for the Microsoft
side of the world, with perhaps some allowance for a hybrid
environment (Cygwin or something?).

I would like to have both program (each one is valuable and very
useful in its own right) to build and compile neatly under Linux and
all other *ix OSs. In the past, I have given the job of porting
something like this (OSS programs of varied level of "buildability")
into the sure-fire method of GNU config, which (almost) never fails to
compile, build and install. No hassles.

Before I embark on the task of finding the programmer in question, I
would like to know what is the current status of both programs?

The pointer to my program should be up to date and functioning. The
makefile is meant for WATCOM C/C++. However, creating your own
makefile for *nux should be totally trivial. You just need to compile
and link the modules bstrlib.o and csvparse.o from their corresponding
C files. The official Bstrlib library also contains this parser in
the examples archive which includes a makefile for Linux.

The first version of my parser was actually based on the logic of
Michael's parser, but significantly augmented for correctness.
However, even then it was far from perfect. Last year I did a total
rewrite of the program to match the correct CSV grammar that I
derived. It appears to be bug free, it is fast, and it is easy to
understand. I don't have any idea if Michael's parser is correct
these days (it used to fail at reading quoted CRs), but it certainly
was not at the time he original wrote it. I think his is also limited
in the number of columns that can be correctly read.
[...] Where is the latest source which is closer to Unix (and hopefully
devoid of any Windowcism)?

For mine, only the makefile has "Windowcisms" (actually DOSisms).
The .c and .h sources are portable.
 
M

Malcolm McLean

Ramon F Herrera said:
http://groups.google.com/group/comp...w+to+best+parse+a+CSV&rnum=1#4e38340aa824bee0
http://tinyurl.com/29q4kf

Michael & Paul (or anyone who can help):

I have been looking for a solid implementation of CSV parsing and
found the thread above, which includes pointers to programs written by
both of you.

In my first attempt at building them, I failed miserably, and I
noticed that both implementations seem to be written for the Microsoft
side of the world, with perhaps some allowance for a hybrid
environment (Cygwin or something?).

I would like to have both program (each one is valuable and very
useful in its own right) to build and compile neatly under Linux and
all other *ix OSs. In the past, I have given the job of porting
something like this (OSS programs of varied level of "buildability")
into the sure-fire method of GNU config, which (almost) never fails to
compile, build and install. No hassles.

Before I embark on the task of finding the programmer in question, I
would like to know what is the current status of both programs? Where
is the latest source which is closer to Unix (and hopefully devoid of
any Windowcism)?

Any tips and suggestions are most welcome.
If you don't like Paul's you can try mine. It's on my website under Fuzzy
Logic Trees. However I think Paul has made a more thorough job of his. Mine
was jsut to get get data into the learning tool.
Mine uses nan to signal missing data fields, so there are two inevitable
non-portable line to generate and test nans.
 
M

Michael B Allen

Wow I don't frequent usenet much anymore but happen to today and see my
name come up!

Anyway, last I checked mine compiled fine on a number of platforms. I
recommend just downloading the "libmba" library that it is part of and
running 'make'. Otherwise, if you're trying to compile just the csv.c
file by itself you would need to strip out the error macros.
understand. I don't have any idea if Michael's parser is correct
these days (it used to fail at reading quoted CRs), but it certainly

AFAIK mine is perfectly compliant. Mine handles quoted newlines
or carriage-return+newline. I don't know about a stand-alone
carriage-return. That would be pretty strange input but I think it would
handle that as well.

Mike
 
R

Roland Pibinger

I recommend just downloading the "libmba" library that it is
part of and running 'make'.

BTW, what is the overall status of the "libmba" library. It looks very
promising. Is it production-ready? Is it maintained?
 
R

Robert Gamble

http://groups.google.com/group/comp.lang.c/browse_frm/thread/86a3ddf0...http://tinyurl.com/29q4kf

Michael & Paul (or anyone who can help):

I have been looking for a solid implementation of CSV parsing and
found the thread above, which includes pointers to programs written by
both of you.

In my first attempt at building them, I failed miserably, and I
noticed that both implementations seem to be written for the Microsoft
side of the world, with perhaps some allowance for a hybrid
environment (Cygwin or something?).

I would like to have both program (each one is valuable and very
useful in its own right) to build and compile neatly under Linux and
all other *ix OSs. In the past, I have given the job of porting
something like this (OSS programs of varied level of "buildability")
into the sure-fire method of GNU config, which (almost) never fails to
compile, build and install. No hassles.

Before I embark on the task of finding the programmer in question, I
would like to know what is the current status of both programs? Where
is the latest source which is closer to Unix (and hopefully devoid of
any Windowcism)?

Any tips and suggestions are most welcome.

I have written what I believe is a solid CSV parser that is written in
portable ANSI C89 available at http://sourceforge.net/projects/libcsv/.
It can both write and parse CSV files, handles carriage returns and/or
linefeeds, is fast, well-tested, and actively maintained. It doesn't
require any special libraries or extensions, can be compiled as a
library or an object file to link directly into your program. Perl
and Ruby interfaces also exist for it. It uses a simple callback
mechanism for parsing which is easier to use than any other parser
interface I've seen so far.

Robert Gamble
 
B

Ben Pfaff

Ramon F Herrera said:
I have been looking for a solid implementation of CSV parsing and
found the thread above, which includes pointers to programs written by
both of you.

_The Practice of Programming_ by Kernighan and Pike uses CSV
parsing as an extended example. You might find their code and
their discussion useful and informative.
 
M

Michael B Allen

BTW, what is the overall status of the "libmba" library. It looks very
promising. Is it production-ready? Is it maintained?

It is no longer maintained, no. What we use now is an evolved version
that's no longer OSS. But I don't recall finding any major bugs since
the fork so, by induction, I think libmba should be pretty solid.
I just don't have time for OSS stuff much anymore.

There were some modules that I never really used (cfg and daemon) and as
such they're probably crap. Some modules I use but would like to rewrite
(linkedlist and varray implementations seem inelegant in hindsight). Some
modules I think are exceptionally good (suba is wildly useful). The
hashmap implementation is, I think, one of the best you'll find on the
net. It's small, fast as hell, handles a large range of input and has been
pounded in production (at least the version we use now has). The
csv implementation is very solid mostly because of that flamewar I had
with Paul (Thanks Paul). I get positive feedback about the diff module
once in a while but I don't think we actually use it here at the moment.

Not that I specifically organized things so that it would be possible
to pluck out one or a few modules so that people can eliminate the
library dependency and just incorporate the code into their project
directly (MIT license is once step down from public domain). In many
cases doing so means simply removing those MSGNO macros and changing
allocator_{alloc,realloc,free} to malloc,calloc,realloc,free. I know of
a number of projects that have done that.

Mike
 
W

websnarf

_The Practice of Programming_ by Kernighan and Pike uses CSV
parsing as an extended example. You might find their code and
their discussion useful and informative.

A brief search revealed the following:

http://cm.bell-labs.com/cm/cs/tpop/csvgetline2.c

(This is source code from "The Practice of Programming" book.) From
main he calls straight to csvgetline which then scans the input until
an end of line condition is read, ignoring quotes at this level.
Therefore quoted CR, LFs are not processed correctly. This is
basically where Michael B. Allen started his parser, but at least he
had everything in one tight switch statement.

As further criticism, notice that he makes no attempt to guard against
overflow from the line:

maxline *= 2;

The problem with this will become reading apparent to anyone working
on a a modern 64-bit x86 (that uses 32bit ints) system who wants to
work with enormous files. maxline might easily wrap around if a
single entry was more than 4GB in size. It makes sense if you don't
want to support a single entry of that size, but rather than having a
controlled failure condition, the program just goes off into UB-land.

The program also uses writable statics. This is a no-no to anyone
trying to make modular code. This is not a "practice" anyone should
be following. Multi-threading is useful, and you can't do that if you
are writing to statics. Also there is the possibility that one of the
records itself could be CSV encoded data, so this precludes recursive
calling.

In short this book should be retitled: "Never ever program like
this." Or "How to cost the industry time and long term headaches".
Or "Programming without foresight".
 
C

CBFalconer

.... snip about "Practice of Programming" ...
In short this book should be retitled: "Never ever program like
this." Or "How to cost the industry time and long term headaches".
Or "Programming without foresight".

You are complaining largely because he doesn't treat cr/lf as
specials. They may well not exist in some text systems. They
remain useful to delimit actual lines. There is a large world out
there besides PCs.
 
W

websnarf

(e-mail address removed) wrote:

... snip about "Practice of Programming" ...


You are complaining largely because he doesn't treat cr/lf as
specials.

What the hell are you talking about? That isn't the issue (learn to
read damnit!) Their code actually *DOES* seem to do something about
the various possible CR+LF combinations. They don't deal with
*QUOTED* CR/LFs properly.
[...] They may well not exist in some text systems. They
remain useful to delimit actual lines. There is a large world out
there besides PCs.

That response doesn't even make sense. If they screwed up the CR+LF,
as you thought I was accusing them of, then the conclusion that there
is a large world out there besides PCs would be an argument *against*
their erroneous implementation.

Either way it the code they wrote is such an obvious hack job. They
did not start with a grammar that they mapped to or anything like
that. They just had some poor idea about how CSV files existed and
did a programmatic breakdown that cannot be properly mapped to a
correct CSV grammar. This is truly an example of really bad code.
 
R

Richard Bos

What the hell are you talking about? That isn't the issue (learn to
read damnit!) Their code actually *DOES* seem to do something about
the various possible CR+LF combinations. They don't deal with
*QUOTED* CR/LFs properly.

Since quoted line endings within CSV files are an abomination, anyway -
according to some people legal, but even so a legal abomination - I do
not see this as a particularly bad failure.
Either way it the code they wrote is such an obvious hack job. They
did not start with a grammar that they mapped to or anything like
that. They just had some poor idea about how CSV files existed and
did a programmatic breakdown that cannot be properly mapped to a
correct CSV grammar. This is truly an example of really bad code.

Because, after all is said and done, we all have a lot more respect for
the coding prowess of Paul "C is inferior" Hsieh than for that of
Kernighan and Pike, don't we?

Richard
 
W

websnarf

Since quoted line endings within CSV files are an abomination, anyway -
according to some people legal, but even so a legal abomination - I do
not see this as a particularly bad failure.

An abomination? Listen you retard, CSV is an ingenious file format
*IF* it is completely and correctly parseable. You can literally put
an encoded CSV file as a single entry *IN* a CSV file. That makes it
useful as a recursive serialized output format. If you can't parse
something as simple as a embedded CR+LFs, then you won't be able to
support this. If you think that's esoteric, what if you just want to
do something as simple as storing mail messages?

Listen. These people have the arrogance and gall to call this book
"Practice of Programming". The clearly omitted the *design* part of
programming this parser. So they are not espousing anything
approximating *GOOD* programming. I don't have this book, so if its
meant as a twisted parody, or a farcical send up of just how badly
people do program, well fine. But then the title of the book is
misleading.
Because, after all is said and done, we all have a lot more respect for
the coding prowess of Paul "C is inferior" Hsieh than for that of
Kernighan and Pike, don't we?

Shoot the messenger much? They wrote book, they have a reputation,
and they have written embarassingly bad code. This has nothing to do
with any inferiority complex you might have.
 
R

Richard Heathfield

(e-mail address removed) said:
An abomination? Listen you retard, CSV is an ingenious file format
*IF* it is completely and correctly parseable.

If you cannot make your point without insulting your opponent, then it's
probable that you don't have much of a point.

You owe Richard Bos an apology.
 
C

CBFalconer

Richard said:
Since quoted line endings within CSV files are an abomination,
anyway - according to some people legal, but even so a legal
abomination - I do not see this as a particularly bad failure.


Because, after all is said and done, we all have a lot more respect
for the coding prowess of Paul "C is inferior" Hsieh than for that
of Kernighan and Pike, don't we?

Indubitably. After all, he manages to use the assumption
"sizeof(int) >= 32" regularly. He has also mastered reading (but
unfortunately not absorption). He seems to have missed the
importance of escaped characters.

--
If you want to post a followup via groups.google.com, ensure
you quote enough for the article to make sense. Google is only
an interface to Usenet; it's not Usenet itself. Don't assume
your readers can, or ever will, see any previous articles.
More details at: <http://cfaj.freeshell.org/google/>
 
F

Francine.Neary

Indubitably. After all, he manages to use the assumption
"sizeof(int) >= 32" regularly. He has also mastered reading (but
unfortunately not absorption). He seems to have missed the
importance of escaped characters.

What kind of monster machine was he using for debugging where that
assumption was valid?
 
K

Keith Thompson

Listen. These people have the arrogance and gall to call this book
"Practice of Programming". The clearly omitted the *design* part of
programming this parser. So they are not espousing anything
approximating *GOOD* programming.

Uh huh ...
I don't have this book,
[snip]

I see.

If you obtain a copy of the book and read it (try a library), I might
consider paying attention to your opinion.

You might also want to consider contacting the authors. The book's
web page, <http://cm.bell-labs.com/cm/cs/tpop/>, includes links to
their personal home pages; you should be able to contact them either
directly or through their publisher. If you do choose to contact
them, you might consider being less rude than you are on Usenet.
 
C

Chris Dollin

Keith said:
Listen. These people have the arrogance and gall to call this book
"Practice of Programming". The clearly omitted the *design* part of
programming this parser. So they are not espousing anything
approximating *GOOD* programming.

Uh huh ...
I don't have this book,
[snip]

I see.

If you obtain a copy of the book and read it (try a library), I might
consider paying attention to your opinion.

Given websnarf's tirade, I took the trouble to reread chapter 4 of tPoP
on the train over the weekend. Almost everything I've seen him complain
about [in the code] is addressed in the text, either directly or indirectly.
One can disagree with some of their decisions, but they /are/ decisions,
with contexts and reasons and tradeoffs, not mindless just-do-thiss.

tPoP is not perfect, but it's a useful and illuminating book nevertheless.
 
M

Michael B Allen

Given websnarf's tirade, I took the trouble to reread chapter 4 of tPoP
on the train over the weekend. Almost everything I've seen him complain
about [in the code] is addressed in the text, either directly or indirectly.
One can disagree with some of their decisions, but they /are/ decisions,
with contexts and reasons and tradeoffs, not mindless just-do-thiss.

tPoP is not perfect, but it's a useful and illuminating book nevertheless.

I hate to encourage bad behavior but usenet karma be damned I have to
agree with websnarf about this.

I have the greatest respect for Kernighan and Pike. C and UNIX are
still great application platforms today and I think they will be for a
long time. The simplicity and elegance of C and UNIX is quite frankly
unmatched by anything.

However, they're not doing anyone any good with this code (at least the
code in the link posted by websnarf [1]). In fact I think they're probably
doing more harm than good. Modern code must be reentrant or it's basically
useless. The cited example desperately needs an object to hold the state
of the parse. And for a CSV parser a state machine is clearly the only
way to go. The code may be good from a language lawyer perspective
but the overall organisation and design is just awful. For the sake of C
programmers everywhere I hope the rest of the book isn't anything like it.

Mike

[1] http://cm.bell-labs.com/cm/cs/tpop/csvgetline2.c
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top