How to best parse a CSV data file and do a lookup in C?

J

Johnny Google

Here is an example of the type of data from a file I will have:

Apple,4322,3435,4653,6543,4652
Banana,6934,5423,6753,6531
Carrot,3454,4534,3434,1111,9120,5453
Cheese,4411,5522,6622,6641

The first position is the info (the product) I want to retreive for the
corresponding code. Assuming that the codes are unique for each product
and all code data is on one line.

So - I know the code is '9120' and I want to read the file line by line
and build an array for each line seperating on the commas.

Is there simple way to do this using a library or something already
created for reading simple data files?

Trying to convert from my perl code to C ...

In perl this would be about 10 or so lines...


$prod_code = '9120';

find_product($prod_code);

sub find_product {

$find_code = shift;
foreach $line (read_file($datafile)) {
@data = split /,/,$line; # split the line into an array
foreach $code (@data) {
# find a match from the list of elements
if ($code == $find_code) {
return $data[0]; # return the first element in the list
} # end if
} # end foreach
} # end foreach

} # end sub


I would imagine that this is something that is done often enough that
there is a library or some already built solution? But I am willing to
learn how to do this in long C hand... I just didn't want to re-invent
the wheel.

BTW, I could change the file format to suit a better solution in C if
there is one? (i.e. Apple=1234,5677,3432) using the = instead of just
the first element.

Thanks,

John
 
R

Robert Gamble

Here is an example of the type of data from a file I will have:

Apple,4322,3435,4653,6543,4652
Banana,6934,5423,6753,6531
Carrot,3454,4534,3434,1111,9120,5453
Cheese,4411,5522,6622,6641

The first position is the info (the product) I want to retreive for the
corresponding code. Assuming that the codes are unique for each product
and all code data is on one line.

So - I know the code is '9120' and I want to read the file line by line
and build an array for each line seperating on the commas.

Is there simple way to do this using a library or something already
created for reading simple data files?

Trying to convert from my perl code to C ...

In perl this would be about 10 or so lines...


$prod_code = '9120';

find_product($prod_code);

sub find_product {

$find_code = shift;
foreach $line (read_file($datafile)) { @data = split /,/,$line; # split
the line into an array foreach $code (@data) {
# find a match from the list of elements if ($code == $find_code) {
return $data[0]; # return the first element in the list } # end if
} # end foreach
} # end foreach

} # end sub

<OT>
This is about as inefficient as you can get. You can simply read the file
in once and use a hash to find a product code quickly.

Something like this to read the data:

my %product_hash;

while (defined($_=<FILE>)) { # assuming FILE is open
my($product,@data) = split /,/;
foreach my $code (@data) {
$product_hash{$code} = $product;
}
}

To find a product from a code you would just do:

$product = $product_hash{$code};

This is much faster, uses the builtin hash functionality, and you only
have to read through the file once.
I would imagine that this is something that is done often enough that
there is a library or some already built solution? But I am willing to
learn how to do this in long C hand... I just didn't want to re-invent
the wheel.

The concept is the same as above, write a routine to parse the data
(trivial), and use a hash to store the information. When you want to
find a product from a product code, look it up in the hash. C doesn't
supply a hashing functionality like Perl does so you can either write your
own which is pretty stright-forward (most books on C cover the basics) or
use one of the many freely available hash implementations on the internet
including a couple of libraries written by people that frequent this group
including hashlib by CBFalconer at:

http://cbfalconer.home.att.net/download/
BTW, I could change the file format to suit a better solution in C if
there is one? (i.e. Apple=1234,5677,3432) using the = instead of just
the first element.

Thanks,

John

Rob Gamble
 
J

Johnny Google

I don't need the advice on the perl side ... it was purely for
demonstative purposes - to show the concept - the solution I need is in
C. BTW, when the are only a hand full of records - the difference in
your perl solution and the one I typed on the fly are negligble. Even
if the file was longer - I can't see it taking any measurable amount of
time more to read until found than to read the entire file into a hash
- I've done it both ways over the years.

But the point here is I need to do this in C - not in perl - and not
another concept.

Thanks,

John
 
J

Johnny Google

The main point here is I need help with either an already created
library which handles data files like this - or some C equivalent to

@data = split /,/, $line;


Since I think I know how to read the contents, and loop through the
array in C - and do the strcmp to test if it matches - I just don't
know how to simply put break the line into an array by telling it what
the delimiter is (as in the above split example).

Thanks,

John
 
R

Robert Gamble

The main point here is I need help with either an already created
library which handles data files like this - or some C equivalent to

@data = split /,/, $line;


Since I think I know how to read the contents, and loop through the
array in C - and do the strcmp to test if it matches - I just don't
know how to simply put break the line into an array by telling it what
the delimiter is (as in the above split example).

Thanks,

John

The only thing provided by Standard C for this purpose is strtok which
will probably work fine for your needs. Take a look at that and come back
if you have any questions about it.

Rob Gamble
 
R

Robert Gamble

The only thing provided by Standard C for this purpose is strtok which
will probably work fine for your needs. Take a look at that and come back
if you have any questions about it.

Rob Gamble

Didn't have time to post this earlier:

The prototype for strtok is in string.h and looks like:

char *strtok(char *s, const char *delim);

Here is a sample program demonstrating it's usage on your data:

#include <stdio.h>
#include <string.h>

int main (void) {
char example[] = "Apple,4322,3435,4653,6543,4652";
char *ptr;

ptr = strtok(example,",");
printf("Product: %s\n",ptr);

while ((ptr = strtok(NULL,",")) != NULL)
printf("Next token: %s\n",ptr);

return 0;
}

There are some things to note about strtok:
1. It modifies the first argument and thus cannot be used on constant
strings. If you don't want your data munged, make a copy and pass the
copy to strtok.
2. The function uses a static buffer and is thus not re-entrant.

<OT>
If you are using a POSIX system you can use the thread-safe version,
strtok_r (this is not Standard C) if you need to.
</OT>

Rob Gamble
 
J

Jens.Toerring

Johnny Google said:
Here is an example of the type of data from a file I will have:

The first position is the info (the product) I want to retreive for the
corresponding code. Assuming that the codes are unique for each product
and all code data is on one line.
So - I know the code is '9120' and I want to read the file line by line
and build an array for each line seperating on the commas.
Is there simple way to do this using a library or something already
created for reading simple data files?
Trying to convert from my perl code to C ...
In perl this would be about 10 or so lines...

$prod_code = '9120';

sub find_product {
$find_code = shift;
foreach $line (read_file($datafile)) {
@data = split /,/,$line; # split the line into an array
foreach $code (@data) {
# find a match from the list of elements
if ($code == $find_code) {
return $data[0]; # return the first element in the list
} # end if
} # end foreach
} # end foreach
} # end sub

Here's a program that should do the trick (well, modulo reading a
whole line and building an array of product IDs, but that's not
what seems to be needed when I go by your Perl code). One restri-
ction is that the maximum length of the items you try to read is
known in advance (to be adjusted via the MAX_FIELD_WIDTH macro).
Second restriction is that it won't work correctly in all cases
if the last line doesn't end with a newline. Third restriction is
that no additional whitespace is supposed to appear anywhere in
the input file.

-----------8<----------------------------------------

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_FIELD_WIDTH 128

int main( int argc, char *argv[ ] )
{
FILE *fp;
char prod_info[ MAX_FIELD_WIDTH + 1 ];
char buf[ MAX_FIELD_WIDTH + 1 ];
char fmt[ 30 ];
char c;
int prod_info_comes_next = 1;

if ( argc < 3 )
{
fprintf( stderr, "Missing argument(s), need file name "
"and product ID\n" );
return EXIT_FAILURE;
}

if ( ( fp = fopen( argv[ 1 ], "r" ) ) == NULL )
{
fprintf( stderr, "Can't open file '%s'\n", argv[ 1 ] );
return EXIT_FAILURE;
}

sprintf( fmt, "%%%d[^,\n]%%c", MAX_FIELD_WIDTH );

while ( fscanf( fp, fmt, buf, &c ) == 2 )
{
if ( c != '\n' && c != ',' )
break;

if ( prod_info_comes_next )
strcpy( prod_info, buf );
else
if ( ! strcmp( buf, argv[ 2 ] ) )
{
printf( "Found product: %s\n", prod_info );
return EXIT_SUCCESS;
}

prod_info_comes_next = c == '\n' ? 1 : 0;
}

if ( ! feof( fp ) )
fprintf( stderr, "Invalid input or read error\n" );
else
fprintf( stderr, "Product ID %s not found\n", argv[ 2 ] );

fclose( fp );
return EXIT_FAILURE;
}

-----------8<----------------------------------------

If you want to change that into a function to be used in a larger
program and you want to return the 'prod_info' array make sure
that you either change its type to "static char" (otherwise the
memory for it will vanish the moment you leave the function) or
make it a char pointer and allocate memory for it in the function
(but which then the calling function has to deallocate).

The only slightly tricky bit is the format string for fscanf().
With a value of 128 for MAX_FIELD_WIDTH it will be "%128[^,\n]%c",
making fscanf() read a maximum number of 128 characters as long
as they are neither a comma nor a '\n', plus another character
that must be a comma or a newline.

Regards, Jens
 
J

John Bode

Johnny Google said:
Here is an example of the type of data from a file I will have:

Apple,4322,3435,4653,6543,4652
Banana,6934,5423,6753,6531
Carrot,3454,4534,3434,1111,9120,5453
Cheese,4411,5522,6622,6641

The first position is the info (the product) I want to retreive for the
corresponding code. Assuming that the codes are unique for each product
and all code data is on one line.

So - I know the code is '9120' and I want to read the file line by line
and build an array for each line seperating on the commas.

Is there simple way to do this using a library or something already
created for reading simple data files?

Trying to convert from my perl code to C ...

In perl this would be about 10 or so lines...


$prod_code = '9120';

find_product($prod_code);

sub find_product {

$find_code = shift;
foreach $line (read_file($datafile)) {
@data = split /,/,$line; # split the line into an array
foreach $code (@data) {
# find a match from the list of elements
if ($code == $find_code) {
return $data[0]; # return the first element in the list
} # end if
} # end foreach
} # end foreach

} # end sub


I would imagine that this is something that is done often enough that
there is a library or some already built solution? But I am willing to
learn how to do this in long C hand... I just didn't want to re-invent
the wheel.

BTW, I could change the file format to suit a better solution in C if
there is one? (i.e. Apple=1234,5677,3432) using the = instead of just
the first element.

Thanks,

John


There are library functions to help with this, but even so it's going
to take more than 10 lines. Perl abstracts out a lot of stuff.

If you're going to load the file contents into memory before
searching, you can arrange it in a manner that's much more efficient
to search, either as hashes or trees. If you're going to be searching
on disk, then you'd really want to restructure how the data are stored
(basically implement the tree or hash on disk), but that's probably a
bit beyond the scope of what you're looking for.

So, assuming we're searching the disk directly, here's the C
equivalent of the above code. I always define functions before their
first reference; you don't have to do this, but you need to have at
least a function prototype in scope before it is referenced. THIS
CODE CONTAINS LITTLE TO NO ERROR CHECKING AND IS UNTESTED:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

/*
** Read next line from file. We're making the always
** always dangerous assumption that the target buffer
** is large enough to hold the longest possible line.
*/
int getNextLine(FILE *stream, char *buffer, size_t bufsize)
{
int r;
char fmt_string[10];

sprintf(fmt_string, "%%%d[^\\n]\\n", bufsize);
r = fscanf(stream, fmt_string, buffer);
if (r == EOF)
{
fprintf(stderr, "End of file reached\n");
return 0;
}
else if (r == 0)
{
fprintf(stderr, "Unable to read input line\n");
return 0;
}

return 1;
}

/*
** Split the input line into a name and array of codes.
** Again, we are going to make a dangerous assumption
** that our buffers are always big enough to handle
** everything.
*/
int split(char *buffer, char *product, int *codes, size_t numcodes)
{
char *token;
size_t i = 0;

token = strtok(buffer, ",");
if (!token)
{
fprintf(stderr, "Badly formed input buffer\n");
return 0;
}

strcpy(product, token);

while (i < numcodes && (token = strtok(NULL, ",")))
{
/*
** We're also making the assumption that
** all codes are strictly numeric,
** and none of the file entries contain
** non-numeric characters.
** Read up on strtol() for how to
** detect bad input.
*/
codes[i++] = strtol(token, NULL, 10);
}

return 1;
}

/*
** Find the product for a given code.
*/
int match_code(int *codelist, int find_code, size_t numcodes)
{
size_t i;
for (i = 0; i < numcodes; i++)
{
if (codelist == find_code)
{
return 1;
}
}

return 0;
}

/*
** Search the input file for the product code
*/
int search_file(FILE *stream, int find_code, char *match)
{
char inbuf[133];
int codes[30];
char product[31];
size_t linenum = 0;

while (getNextLine(stream, inbuf, sizeof inbuf))
{
linenum++;
if (!split(inbuf, match, codes, sizeof codes))
{
fprintf(stderr,
"Warning: Could not parse input line %d: \"%s\"\n",
linenum, inbuf);
}
else
{
if (match_code(codes, find_code, sizeof codes))
{
return 1;
}
}
}
return 0;
}

/*
** Main driver.
*/
int main (int argc, char **argv)
{
char filename[256];
FILE *stream;
int find_code;
char product[31];

if (argc >= 3)
{
strcpy(filename, argv[1]);
find_code = strtol(argv[2], NULL, 10);
}
else
{
printf("Gimme the file name: ");
fflush(stdout);
scanf("%255s", filename);
printf ("Gimme the product code: ");
fflush (stdout);
scand("%d", &find_code);
}

stream = fopen(filename, "r");
if (stream)
{
if (match_file(stream, find_code, product))
{
printf ("Matching product for code %d: %s\n", find_code,
product);
}
fclose(stream);
}

return 0;
}
 
P

Paul Hsieh

Ok, Johnny Google. I've written a usable CSV parser here:

http://www.pobox.com/~qed/bcsv.zip

It, in fact, can parse out the complete CSV standard (which includes
quoting so you can put a commas, lead/trailing spaces and in fact
quotes themselves in any given field) and thus should easily be able
to handle that case you are concerned with.

Using strtok() is rarely ever the right solution. strtok() includes a
hidden side effect which makes using it in a reenterable way
impossible. Correct CSV parsing, in fact, does *NOT* reduce to a
simple line-by-line, then strtok with a "," seperator (though it may
match the sample data given by the OP.)

You should use strcspn() instead -- it essentially provides the same
functionality as strtok() without its negative side effects (strcspn()
has no effect on reenterability).
 
R

Robert Gamble

Ok, Johnny Google. I've written a usable CSV parser here:

http://www.pobox.com/~qed/bcsv.zip

It, in fact, can parse out the complete CSV standard (which includes
quoting so you can put a commas, lead/trailing spaces and in fact
quotes themselves in any given field) and thus should easily be able
to handle that case you are concerned with.


Using strtok() is rarely ever the right solution. strtok() includes a
hidden side effect which makes using it in a reenterable way
impossible. Correct CSV parsing, in fact, does *NOT* reduce to a
simple line-by-line, then strtok with a "," seperator (though it may
match the sample data given by the OP.)

It may be rarely the right solution but it will work fine here. I don't
know what is "hidden" about the side effect as it is well documented and I
pointed out the gotchas.

True CSV parsing is admittedly more complicated but based on the sample
data and the code provided by the OP, strtok is a fine solution.
You should use strcspn() instead -- it essentially provides the same
functionality as strtok() without its negative side effects (strcspn()
has no effect on reenterability).

Not even close. The strcspn function can be used to write a strtok-like
function but does not provide the same functionality by itself and is
certainly not a replacement.

Rob Gamble
 
P

pete

Robert said:
It may be rarely the right solution but it will work fine here.
I don't know what is "hidden" about the side effect as it is well
documented and I pointed out the gotchas.

The standard doesn't say anything special about strtok
in terms of reentrancy.

7.1.4 Use of library functions
[#4] The functions in the standard library are not
guaranteed to be reentrant and may modify objects with
static storage duration.
Not even close. The strcspn function can be used to write a
strtok-like
function but does not provide the same functionality by itself and is
certainly not a replacement.

strchr can be used to write strcspn and strspn like functions.
A strchr like function can be written in the C language proper.
A reentrant version of a strtok like function can be written
with an extra parameter to point to the object
which takes the place of the static object in strtok.

char *str_tok_r(char *s1, const char *s2, char **s3);
 
R

Richard Bos

pete said:
The standard doesn't say anything special about strtok
in terms of reentrancy.

What it does say about strtok(), though, makes it harder to use
re-entrantly, and impossible to use in functions which must themselves
be recursive. For example, if your string parsing function uses
strtok(), it cannot call itself to parse a sub-string.

Richard
 
R

Randy Howard

Ok, Johnny Google. I've written a usable CSV parser here:

http://www.pobox.com/~qed/bcsv.zip

It, in fact, can parse out the complete CSV standard (which includes
quoting so you can put a commas, lead/trailing spaces and in fact
quotes themselves in any given field) and thus should easily be able
to handle that case you are concerned with.

Is there actually a "standard" for CSV? I was under the impression that
there was a wide variety of "similar" implementations by various vendors
that had subtle differences and formatting which made it difficult to have
a one size fits all CSV implementation.

http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm

Not trying to argue, just curious how your implementation attempts to
deal with variations in CSV usage.
 
W

websnarf

Randy said:
Paul Hsieh says...

Is there actually a "standard" for CSV? I was under the impression
that there was a wide variety of "similar" implementations by various
vendors that had subtle differences and formatting which made it
difficult to have a one size fits all CSV implementation.

http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm

Not trying to argue, just curious how your implementation attempts to
deal with variations in CSV usage.

I started with the partial standard (csv.txt) written by Robert J.
Lynch originally from www.wotsit.org, then scoured the web for example
csv files to test against. I noted the holes, and appended this
information to that file -- it covers every anomoly I am aware of and
and coincides with StarOffice's StarCalc implementation for every file
I tried. The csv.txt file is included in the zip archive I originally
gave (http://www.pobox.com/~qed/bcsv.zip).

The site you quote
(http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm) appears to
specify nearly the same standard I implemented. They seem to prefer LF
to CR as the line seperator, however, I found the exact opposite -- CR
is the delimiter I saw used in many sample files. So my
parser/specification requires CR as the delimiter, while immediately
following LF's (the DOS/Win convention) are ignored unless quoted. I
would also add that UTF-8 clearly requires no change from any binary
octet ASCII based parser, but UTF-16 requires special encoding. With
UTF-16, you need to first make a choice about the octet encoding (i.e.,
pick an endian) as normal but then quote that properly as binary data
within the CSV format (remembering that comma, quote and the CR LF
characters may show up in this octet stream as data within a single
field.) Of course you could just do the whole thing in
Unicode/widechar_t to begin with, but technically there's no functional
advantage to doing that over UTF-8 parsing (remembering that even
amongst illegal Unicode code points, UTF-8 is a superset of both UTF-16
and Unicode itself.)

As to the potential inability for there to be a truly universal
standard, I would just say that I downloaded many CSV files from the
web with interesting encodings (my parser reads all of them) and tried
to be exhaustive in my handling of anomolies in choosing my
implementation. I did not find that there were actually conflicting
encodings outside of well known bugs of some programs like Borland
Paradox (which cannot even correctly parse some of its own output.) To
work around the Paradox problem, I just made my parser *survive* a
Paradox encoded file, rather than try to definitively extract its data
as originally intended (as I recall, there was no way to do this in a
provably correct way.)
 
M

Michael B Allen

I started with the partial standard (csv.txt) written by Robert J. Lynch
originally from www.wotsit.org, then scoured the web for example csv
files to test against. I noted the holes, and appended this information
to that file -- it covers every anomoly I am aware of and and coincides
with StarOffice's StarCalc implementation for every file I tried. The
csv.txt file is included in the zip archive I originally gave
(http://www.pobox.com/~qed/bcsv.zip).

Which I believe was derived from my (much simpler) csv_row_parse
function here:

http://www.ioplex.com/~miallen/libmba/dl/src/csv.c
http://www.ioplex.com/~miallen/libmba/dl/docs/ref/csv.html#row_parse
The site you quote
(http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm) appears to
specify nearly the same standard I implemented. They seem to prefer LF
to CR as the line seperator, however, I found the exact opposite -- CR
is the delimiter I saw used in many sample files. So my
parser/specification requires CR as the delimiter, while immediately
following LF's (the DOS/Win convention) are ignored unless quoted. I

Mmm. I chose LF. When combined with the CSV_TRIM flag it clips the CR
though. I suppose I should adjust my code to accept either.
would also add that UTF-8 clearly requires no change from any binary
octet ASCII based parser

Although one might be conscientious about possible signedness errors. I
always use unsigned char throughout.
, but UTF-16 requires special encoding. With
UTF-16, you need to first make a choice about the octet encoding (i.e.,
pick an endian) as normal but then quote that properly as binary data
within the CSV format (remembering that comma, quote and the CR LF
characters may show up in this octet stream as data within a single
field.) Of course you could just do the whole thing in
Unicode/widechar_t to begin with, but technically there's no functional
advantage to doing that over UTF-8 parsing (remembering that even
amongst illegal Unicode code points, UTF-8 is a superset of both UTF-16
and Unicode itself.)

I chose to provide an equivalent wchar_t function and there is an
advantage to doing that -- Windows uses wchar_t for Unicode and does
not support UTF-8 very well.

Mike
 
P

Paul Hsieh

Michael B Allen said:

You mistate the history. I was *intending* on basing bcsv on your
original source code, but did *NOT* because I wrote the code before
your made your URL public knowledge. Then when you made your code
available, I looked at it, but rejected it because it obviously did
not parse CSV files correctly. In that thread someone posted that
these files are used for generic databases, thus it does not make
sense to have fixed limits on the row or column limit in any way. I
transformed my solution to a completely incremental one. My solution
is based on first deriving a CSV standard, implementing it, and
checking it against a reasonable (and included) test suite (as well as
a few other files I didn't include.)

Your solution and mine are very different and accomplish different
things -- mine can read and parse *ANY* CSV file, yours can only read
CSV files that have some maximum number of columns and/or rows that
you need to establish before you start parsing the CSV data
(presumably only CSV files that you output).

You echo James Antill's error regarding the "simplicity" of your
solution. Your solution was *WRONG* (I was not able to check James'
since his library only compiles with gnutools). I don't know if your
current solution is correct, but I have no motivation to examine this
(if you've screwed up the \n \r thing, then I guess the answer is no).
My more complicated solution is that way for the sole reason for
making it *CORRECT* -- as far as I know it reads *ALL* CSV files (and
generates results that are comparable to StarOffice). James Antill
thought that that the only reason I made my code complicated was for
performance (owing to the fact that bcsv kicked the living daylights
out of both your solution and his; a quick examination of your current
version shows that this almost certainly continues to be the case) but
this is patently untrue. Its complicated because it *HAS* to be in
order to be correct while having reasonable performance.

I'll leave it to you to discover how I was (and very likely am still)
able to beat your performance (its not really related to the
complexity of my solution). Remember, your original point was to show
that you could outperform a dynamic string library solution (which
James and I implemented) with a raw char * solution. I have no
worries that you will ever be able to substantiate such a preposterous
claim: see http://bstring.sf.net/features.html
Mmm. I chose LF. When combined with the CSV_TRIM flag it clips the CR
though. I suppose I should adjust my code to accept either.

Trimming is not optional; you have to do it. That's something I never
understood about your solution.
Although one might be conscientious about possible signedness errors. I
always use unsigned char throughout.

The only escape or delimiter or whitespace characters are "positive"
-- the other characters are just raw data. Dealing with "signedness"
is up to the UTF-8 consumer, and thus does not involve the CSV parser
at all.
I chose to provide an equivalent wchar_t function and there is an
advantage to doing that -- Windows uses wchar_t for Unicode and does
not support UTF-8 very well.

I am not aware of any CSV files in existence that have been encoded in
wchar_t or UTF-16. UTF-8/UTF-16 parsers/encoders are trivial to
implement (far simpler than a CSV parser). So I have no idea why you
would do this unless your intention was to establish some sort of
alternative pseudo-CSV file format. Does MS-Excel or StarOffice read
such CSV output?
 
M

Michael B Allen

Your solution and mine are very different and accomplish different
things -- mine can read and parse *ANY* CSV file, yours can only read
CSV files that have some maximum number of columns and/or rows that you
need to establish before you start parsing the CSV data (presumably only
CSV files that you output).

Well usually tabular data is fixed width. I really don't see a huge
benifit to being able to parse an aribrary number of columns. Yes, it is
a plus but then again it forces you to allocate memory that needs to be
freed by the user. My csv module doesn't allocate memory at all which
makes the implementation simpler and easier for the user. I could
also just use 'unsigned char *row[16383];' if I really didn't know how
many columns I was getting.
[bcsv is] complicated because it *HAS* to be in
order to be correct while having reasonable performance.

Baaaaalonie! You dumped the FSM switch for case labels with gotos
and used non-portable macros instead of the standard character class
functions. The whole thing was optimized for pure speed and as a result it
was waaaay more complicated. No one would every want to have to debug that
code. I'm not going to be so bold as to claim that my code is "correct"
but if it's not (doubt it), I'm very confident I can fix it easily with
minimal changes without breaking something else.
I'll leave it to you to discover how I was (and very likely am still)
able to beat your performance (its not really related to the complexity
of my solution).
Balonie.

Trimming is not optional; you have to do it. That's something I never
understood about your solution.

Huh? Why is trimming not optional?
I am not aware of any CSV files in existence that have been encoded in
wchar_t or UTF-16. UTF-8/UTF-16 parsers/encoders are trivial to
implement (far simpler than a CSV parser). So I have no idea why you
would do this unless your intention was to establish some sort of
alternative pseudo-CSV file format. Does MS-Excel or StarOffice read
such CSV output?

I wasn't suggesting the file was in a particular encoding. Ultimately on
Windows everyting is (should be) wchar_t so a regular char function is
of limited use. My csv module doesn't care about files or how they're
encoded. It's very reasonable to imagine someone working on Windows
converting input buffers to wchar_t and then want to parse as CSV from
there. You don't even know that the whole file is CSV -- it might just
be a fragment of a more complex file.

Mike
 
P

Paul Hsieh

Michael B Allen said:
Well usually tabular data is fixed width.

Its fixed *AFTER* you read the data, not before.
[...] I really don't see a huge benefit to being able to parse an aribrary
number of columns.

Its called universality, portability, interoperability. Whatever you
want to call it. Otherwise, you cannot advertise a capability of
parsing CSV files, you can only advertise a capability of parsing
*SOME* CSV file, in which case you might as well go back to your
earlier solution which assumed every carriage return terminated a row
(it was even "simpler", it also could only read *SOME* CSV files).
[...] Yes, it is
a plus but then again it forces you to allocate memory that needs to be
freed by the user.

A plus? I only see it as a difference between correct and incorrect.
Containers should scale to the size of their data, where the data size
is at the option of the data creator. Everything else is just a
buffer overflow or arbitrary truncation.
[...] My csv module doesn't allocate memory at all which
makes the implementation simpler and easier for the user. I could
also just use 'unsigned char *row[16383];' if I really didn't know how
many columns I was getting.

But that wouldn't work for a CSV file that contains 16384 columns. It
also uses far *MORE* net memory than my solution on average.
[bcsv is] complicated because it *HAS* to be in
order to be correct while having reasonable performance.

Baaaaalonie! You dumped the FSM switch for case labels with gotos

This has no negative impact on code complexity. There are labels
instead of case statements, gotos instead of breaks/continues, and I
get to drop the two outer wrapping control mechanisms. I also drop
irrelevant state variable updating.
and used non-portable macros instead of the standard character class
functions.

This has no impact on complexity. The CSV file format is clearly a
binary one, that requires implementation of 8-bit bytes. Using the
generality of the stuff in limits.h to make the code more portable to
systems that don't use such representations of chars doesn't help in
any way, since CSV data is only defined on its specific embodiment.
[...] The whole thing was optimized for pure speed and as a result it
was waaaay more complicated.

1. It was waaaay more complicated that your *FIRST* version, which
assumed that *EVERY* line corresponded exactly to one row/field, and
used (f)getc to read the data (and yet you still assumed you would
outperform my solution.) You current code, is inbetween that point
and my code, but you don't provide any kind of assurance of fitness of
your code at all.
2. The performance advantage is solely due to the for-switch() loop
removal. If you knew anything about performance, after having a year
to look at this you would have understood this by now. My splitting
out into so many states allows me to leverage better direct looping,
but I take a corresponding hit on branch mispredictions since I run
around into so many modes. I didn't do this for my health -- its how
I solved the correctness problem.
3. My archive comes with a test suite. Does yours? What attempt have
you made to verify that your code is correct? Can your parser even
pass the test suite that I provide? When you're done with that that
try: http://tinyurl.com/62kku
4. The "speed optimization" was just reading blocks instead of using
fgetc() and not using the typical for-switch() trap, nothing else I
did in the code matters with respect to performance. Because of using
Bstrlib's bstreams, dropping fgetc() didn't cost me any complexity,
and dropping for-switch just doesn't. What complexity is there is
reflected solely in the complexity of a correct CSV parser.
[...] No one would every want to have to debug that code.

1. You don't have to if its correct.
2. You exaggerate the difference of complexity of my code and your
*CURRENT* code. In particular, although I have become quite familliar
with the CSV format, I am not able to determine whether or not your
code is correct by direct examination. I provide far more commenting
in the inside of my loop.
3. I *DID* debug that code. Because each state is clearly labelled,
and the other commenting, it was not hard to solve the problems at
each stage of the debugging. Looking back at it again after a year,
its no less comprehendable.
[...] I'm not going to be so bold as to claim that my code is "correct"

Really? Because I am of mine. Because I tested it, I have a test
suite, and I made an honest and serious effort to make it correct. If
after such an effort, if I can't make less than 400 lines of code work
correctly, then there is something seriously wrong.
but if it's not (doubt it), I'm very confident I can fix it easily with
minimal changes without breaking something else.

The more important point is that you don't know. I'm sure it works on
the 2 test cases you tried with your own CSV generator, without
bothering to obtain a specification.
[...] Why is trimming not optional?

Because otherwise you cannot make it faithfully work in the following
scenario:

Original Data -> CSVencoder -> file -> CVSdecoder -> Original Data.

Where the original data is completely unmolested for any pair of
encoders/decoders. This scenario is kind of important to some people
(like anyone who saves their data to CSV files).
I wasn't suggesting the file was in a particular encoding. Ultimately on
Windows everyting is (should be) wchar_t so a regular char function is
of limited use. My csv module doesn't care about files or how they're
encoded. It's very reasonable to imagine someone working on Windows
converting input buffers to wchar_t and then want to parse as CSV from
there. You don't even know that the whole file is CSV -- it might just
be a fragment of a more complex file.

Ok, well, I viewed CSV as a file format. You clearly think its
something else, and have some reason for tying your data
representation to Windows.
 
R

Randy Howard

I started with the partial standard (csv.txt) written by Robert J.
Lynch originally from www.wotsit.org, then scoured the web for example
csv files to test against.

[much snipped]

Thank you, and a reasonable approach. I get so used to the "chapter and
verse" use of "standard" in this group. I suspect you can understand why
I asked.
I noted the holes, and appended this
information to that file -- it covers every anomoly I am aware of and
and coincides with StarOffice's StarCalc implementation for every file
I tried.
The csv.txt file is included in the zip archive I originally
gave (http://www.pobox.com/~qed/bcsv.zip).

Have you considered sending your updated version for use on wotsit.org?
 
W

websnarf

Have you considered sending your updated version for use on
wotsit.org?

Yeah, but its not really in a publish-worthy form. If I find some time
to look at it more extensively, I might do so.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top