reading a csv file

S

Samier

hi,

im trying to asign the value of a csv file to a two dimensional array.
The csv file has multiple lines and each line in the file gets its own
line in the array. Now I did read a some posts on that topic in this
group and came up with the following idea for a function

------
int read_in_file (FILE *csvfile, char *b[MAXI][MAXI] )
{
char zeile[MAXI] ;
int i = 0, j = 1 ;

while (fgets(zeile, MAXI, csvfile) != NULL)
{
b[0] = strtok (zeile, ",") ;
while ( (b[j] = strtok (NULL, ",")) != NULL )
{
printf ("b[%d][%d] = %s ", i, j, b[j]) ;
j++ ;
}
printf("\n\n") ;
j = 1 ;
i++ ;
printf ("\n In function -->\t%s\n", b[1][1]) ;
return 0 ;
}
---

Actually this works for every line itself just fine. But as soon as a
second line is read the old information is lost and presumably
overwritten by the new line. This is shown by last printf statement
for every line greater than 1.

Before I try a solution with malloc to make the information somehow
permanent I would like to know what this group thinks would be a
better solution.

Thanks in advance
sake.
 
E

Eric Sosman

hi,

im trying to asign the value of a csv file to a two dimensional array.
The csv file has multiple lines and each line in the file gets its own
line in the array. Now I did read a some posts on that topic in this
group and came up with the following idea for a function

------
int read_in_file (FILE *csvfile, char *b[MAXI][MAXI] )
{
char zeile[MAXI] ;
int i = 0, j = 1 ;

while (fgets(zeile, MAXI, csvfile) != NULL)
{
b[0] = strtok (zeile, ",") ;
while ( (b[j] = strtok (NULL, ",")) != NULL )
{
printf ("b[%d][%d] = %s ", i, j, b[j]) ;
j++ ;
}
printf("\n\n") ;
j = 1 ;
i++ ;
printf ("\n In function -->\t%s\n", b[1][1]) ;
return 0 ;
}
---

Actually this works for every line itself just fine. But as soon as a
second line is read the old information is lost and presumably
overwritten by the new line. This is shown by last printf statement
for every line greater than 1.

Before I try a solution with malloc to make the information somehow
permanent I would like to know what this group thinks would be a
better solution.


Your diagnosis of the problem is correct: The pointers
returned by strtok() and assigned to b[j] point into the
zeile[] array. When zeile[] is overwritten by the next input
line, the old pointers in b[][] still point to their old
positions in zeile[], but the contents of those positions
have been changed. Even worse: When the function returns, the
zeile[] array ceases to exist, so all the pointers in b[][]
will become invalid.

What you must do is copy the data, putting it somewhere
that will not disappear when the function returns, and where
it will not be overwritten by new input lines. There are lots
of ways to do this; two that come to mind are

- Each time strtok() locates another field within zeile[],
malloc() enough space to hold the field (don't forget the
final '\0'), copy the field into the new space, and set
b[j] to point to that new space.

- Each time you read a new line, malloc() enough space to
hold the entire line (again, remember the '\0') and copy
the line from zeile[] to the new space. Then use strtok()
on the new space instead of on zeile[], so the b[j]
pointers will point into the long-lived new space instead
of into the short-lived zeile[].

A few other observations, not related to the management of
the memory:

- fgets() will stop when it has read and stored the '\n' at
the end of a line, *or* when it runs out of room -- which
could happen if a line longer than MAXI-1 characters
appears. You might want to make sure that fgets() did not
stop prematurely by searching for an '\n' in zeile[] --
the strchr() function is an easy way to do this.

- fgets() stores the '\n', so the rightmost field in the
line will include that '\n'. If the line was "A,B,C\n",
the three fields will be "A", "B", and "C\n", and I
imagine you would prefer the last field to be "C" with
no newline. One way to get rid of the '\n' is to store
a '\0' at its location (just after you've made sure it's
actually there; see above). Another way is to use ",\n"
instead of "," as the second argument to strtok().

- strtok() will not work well for some kinds of CSV data.
It will separate "A,B,C" into "A", "B", "C", but if the
input is "X,,Y" strtok() will return "X" and "Y" with no
indication of the empty field between them. As an extreme
case, it will divide "A,,,,,,,,,,,,,,,,,,,,,,,,,Z" into
"A" and "Z" and nothing else.

- strtok() is also ignorant of quoting conventions that many
CSV variants use. If the input is

0140364749,"Lad, A Dog","Terhune, Albert Payson"

strtok() will give the five fields "0140364749", "\"Lad",
" A Dog\"", "\"Terhune", and " Albert Payson\"". If you
want to handle quoted fields (and empty fields), strtok()
is not the right tool for the job.

- Your function always returns zero, with no indication of
success or failure. When fgets() returns NULL, it might
be because it has read all the way to end-of-file, or it
might be because an I/O error occurred. You can use the
feof() and/or ferror() functions to discover what happened.

- Your function also gives the caller no clue about how many
lines were read, nor about how many fields were extracted
from each. The caller might fill the b[][] array with NULL
values before calling the function, and then discover the
actual data "dimensions" by detecting NULLs afterward, but
since your function knows how many lines were read (it's the
value of `i' after the outer loop finishes), it would be
helpful to pass that information back to the caller instead
of making the caller rediscover it. As for the field count
in each line, perhaps your function could provide the NULL
values (from b[j] through b[MAXI-1]) rather than
requiring every caller to do it.

- The inner loop simply begs to be written as a `for' instead
of as a `while'. The outer loop might benefit from the
same change, too.

I hope this helps.
 
D

Default User

Samier said:
Before I try a solution with malloc to make the information somehow
permanent I would like to know what this group thinks would be a
better solution.

Yes, to use the code you've shown you'll need to copy over the data.
The strtok() function works by manipulating the original string,
punching in null characters where the separators were found. What you
end up with is a chopped up version of the original string. Then the
next line writes over the buffer. Usually new, smaller strings are
create with malloc, then the returns from strtok() copied to them.

One big concern is that you don't have a general CSV reader. It might
work for your particular problem. The biggest reason that strtok() is
unsuitable for the task comes with empty fields. In a CSV file, those
are represented by consecutive commas:

1,2,,3,4

The strtok() function conflates consecutive delimiters. You would only
get three fields from the above, rather than four (one empty).

The second problem has to do with fields that contain delimiters. Those
are disambiguated through the use of quotes:

one, "two, three", four

This might or might not be a problem.

If you aren't particularly looking to write your own, but just need
one, there are some out there. Here's one I found. I have not tried
this in any way, so caveats apply.

<http://www.ioplex.com/~miallen/libmba/dl/src/csv.c>




Brian
 
J

Jens Thoms Toerring

Samier said:
im trying to asign the value of a csv file to a two dimensional array.
The csv file has multiple lines and each line in the file gets its own
line in the array. Now I did read a some posts on that topic in this
group and came up with the following idea for a function

It looks already rather strange that you use only 'MAXI' for
the lines maximum length when you on the other hand assume
that there could be as many as 'MAXI' entries in a single
line (or why is the second dimension of the 'b' array also
'MAXI'?). Since all the entries will be separated by commas
you would need at least '2 * MAXI' chars for a line that has
'MAXI' entries, even if each entry should consist of only a
single char...
int i = 0, j = 1 ;
while (fgets(zeile, MAXI, csvfile) != NULL)
{
b[0] = strtok (zeile, ",") ;
while ( (b[j] = strtok (NULL, ",")) != NULL )
{
printf ("b[%d][%d] = %s ", i, j, b[j]) ;
j++ ;
}
printf("\n\n") ;
j = 1 ;
i++ ;
printf ("\n In function -->\t%s\n", b[1][1]) ;
return 0 ;
}
---

Actually this works for every line itself just fine. But as soon as a
second line is read the old information is lost and presumably
overwritten by the new line. This is shown by last printf statement
for every line greater than 1.

An the reason is simply that your 'b' array is an array of pointers
and all that you do is set these pointers to positions within the
'zeile' array. The moment the next line is read in the old content
of 'zeile' is overwritten and the pointers into it you just had set
don't make sense anymore - they now point somewhere into the new
content of 'zeile' and everything of the line you had read in be-
fore is forgotten.

And, probably worse, the moment you leave the function the 'zeile'
array goes out of scope (vanishes) and then the pointers you spent
a lot of work setting up suddenly point to memory that doesn't be-
long to you anymore! Thus, once you get out if this function what-
ever you set the elements of 'b' to is garbage - and you're not
allowed to use them for anything at all. Using the analogy of
pointers with house numbers you end up with a list of house
numbers in streets that don't exist anymore. What you need is
copies of the houses, and for these you need enough room, not
just a piece of paper.
Before I try a solution with malloc to make the information somehow
permanent I would like to know what this group thinks would be a
better solution.

You could, of course, use a three-dimensional char array (not just
an array of char pointers) with enough room for the longest entry
you expect to read from the file. I.e if the longest possible entry
is MAX_LEN chars long you would need an array line

char b[ MAXI ][ MAXI ][ MAX_LEN ];

Then you would have to use a function like strcpy() to copy the
strings from the 'zeile' array into the memory belonging to the
(three-dimensional) array (and then you should check carefully
if your assumption about the maximum length of the entries was
indeed correct).

That method has it's limitations - you can only define arrays
of limited sizes. I have no idea how large 'MAXI' is and how
long the entries in that csv file might be. If 'MAXI' and the
maximum length of entries in the csv file are say 10 then you
will need just 1 kB of memory and things will work fine. But if
it's instead 100 then you already would need 1 MB and that might
be too much (I don't remember at the moment how large an array
can be as guaranteed by the standard and then there's also the
question if the C89 or the C99 standard applies - I think it
was 64 kB in C89 - in your case). Then there's no way around
using dynamical memory allocation.

If you don't want also the sizes of the 2 dimensions of 'b' to be
adapted automatically to what's really in the file (which might
be the most elegant solution) it's relatively simple. You then
need to replace the pointer assignments
b[0] = strtok (zeile, ",") ;
while ( (b[j] = strtok (NULL, ",")) != NULL )


with a memory allocation plus a strcpy() call - something simi-
lar to the following

while ( i < MAXI && fgets( zeile, sizeof zeile, csvfile ) != NULL )
{
char *tmp;
int j = 1;

tmp = strtok( zeile, ",' );
if ( ! ( b[ i ][ 0 ] = malloc( strlen( tmp ) + 1 ) ) )
{
/* Release all memory already allocated and bail out */
}
strcpy( b[ i ][ 0 ], tmp );

while ( j < MAXI && ( tmp = strtok( NULL, "," ) ) != NULL )
{
if ( ! ( b[ i ][ j ] = malloc( strlen( tmp ) + 1 ) )
{
/* Release memory and bail out */
}
strcpy( b[ i ][ j ], tmp );
j++;
}

if ( j < MAXI )
b[ i ][ j ] = NULL;

i++;
}

if ( i < MAXI )
b[ i ][ 0 ] = NULL;

Some notes: you must check that there are not more entries in a
line than there are elements in a row of the 'b' array and not
more lines than the 'b' array has columns (such a check is already
missing in your original program). And the last element in a row
must be set to NULL (unless there were exactly 'MAXI' entries in
that line) since that's the only way to figure out later how many
valid elements there are in a row. The same holds for the first
element of the first row that didn't get assigned data to keep
track of how many valid columns there are in the array.

If you allocate memory this way it's then also your responsibi-
lity to release the allocated memory again when it's not needed
anymore, i.e. when you're done with the 'b' array. You need then
to do something like

for ( i = 0; i < MAXI && b[ i ][ 0 ] != NULL; i++ )
for ( j = 0; j < MAXI && b[ i ][ j ] != NULL; j++ )
free( b[ i ][ j ];

Regards, Jens
 
R

Rich Webb

Yes, to use the code you've shown you'll need to copy over the data.
The strtok() function works by manipulating the original string,
punching in null characters where the separators were found. What you
end up with is a chopped up version of the original string. Then the
next line writes over the buffer. Usually new, smaller strings are
create with malloc, then the returns from strtok() copied to them.

One big concern is that you don't have a general CSV reader. It might
work for your particular problem. The biggest reason that strtok() is
unsuitable for the task comes with empty fields. In a CSV file, those
are represented by consecutive commas:

1,2,,3,4

The strtok() function conflates consecutive delimiters.

This is only the case if the delimiters (here, assuming that "," is
passed each time as the second argument) occur before any non-delimiter
characters. That is, ",,,a,,b,c" and "a,,b,c" each return a pointer to
'a', over-write the comma following 'a' with '\0' and internally
maintain a pointer to the second comma after 'a'. The next call (with
NULL as the first argument) over-writes that second comma with '\0' and
returns a pointer to it (now an empty string), keeping for itself a
pointer to 'b'.

As you say, strtok() isn't by itself the answer to all parsing issues.
 
D

Dr Malcolm McLean

Before I try a solution with malloc to make the information somehow
permanent I would like to know what this group thinks would be a
better solution.
You're not really programming in C until you're using dynamic memory.
Malloc is essential to give generality to functions, allowing them to
operate of datasets of unknown size. A csv reader is a classic
example. Whilst often you will know the maximum dimensions in advance,
hardcoding these into the program makes it clumsy. It wastes memory,
and it makes it hard to know how many matrix entries are in fact
valid. Also, once you have a general csv file loader, you can cut and
paste it into any program that reads csv files.

To make your program work you would have to declare

static char zeile[MAXLINES][MAXLINELENGTH];

(define the MAXLINES and MAXLINELWENGTH symbols yourself).

this creates a 2d buffer big enough to store the characters in the CSV
file. You then use strtok to break it up. However another call to the
csv loading function will overwrite the buffer. It may the best answer
as an intermediate step to learning how to use malloc(), but it's not
a good way of writing the finished program.
 
C

chrisbazley

Before I try a solution with malloc to make the information somehow
permanent I would like to know what this group thinks would be a
better solution.

There have been some excellent replies already but I thought I would
throw my own thoughts into the mix. You shouldn't force a particular
memory allocation method on someone writing a program which uses your
function, which you would effectively do by calling malloc inside it.
They might be using their own heap implementation, or no heap at all.

Whenever I am writing a highly-generic function which requires an
output buffer of unknown size, I follow the precedent of the standard
library function 'snprintf'. This provides a straightforward and
flexible interface that everybody understands.

The great thing about 'snprintf' is that the caller can use it in many
ways:

1) Allocate a buffer large enough for typical output. Call the
function to fill the buffer. If buffer overflow occurred then throw
away the buffer and report an error to the user.

2) Call the function with buffer size 0 to find out the required size.
Allocate a buffer of the correct size. Call again to fill the buffer.

3) Allocate a buffer large enough for typical output. Call the
function to fill the buffer. If buffer overflow occurred then throw
away the buffer, allocate one of the correct size, and call again to
fill it. Otherwise shrink the buffer to the right size.

The first usage is simplest and would be fine for many applications,
but arbitrary fixed limits can be irritating. The second usage gets
rid of the fixed limit and provides predictable performance but means
that you always have to parse the input twice. The third method is an
ideal compromise in my opinion, because it avoids the cost of two
passes in most cases, whilst still coping with arbitrary-sized output.
It is especially efficient if you use an object with 'auto' storage
class as the initial buffer because you may be able to avoid
allocating any heap blocks.

HTH,
 
P

Phil Carmody

Joe Wright said:
First, .csv (Comma Separated Values) formats are popular but not
really standardized.

Also beware that the escaping/quoting rules are completely mental.
If you want certain characters in a field, then you need to quote
the whole field, and if you want to have quote characters, then
you have to reduplicate them. Or something like that.

Phil
 
E

Eric Sosman

[...]
The strtok() function conflates consecutive delimiters.

This is only the case if the delimiters (here, assuming that "," is
passed each time as the second argument) occur before any non-delimiter
characters. That is, ",,,a,,b,c" and "a,,b,c" each return a pointer to
'a', over-write the comma following 'a' with '\0' and internally
maintain a pointer to the second comma after 'a'. The next call (with
NULL as the first argument) over-writes that second comma with '\0' and
returns a pointer to it (now an empty string), keeping for itself a
pointer to 'b'.

Would you care to place a small wager? (In other words,
you're wrong, R-O-N-G, wrong. With "," as its second argument,
strtok[*] breaks "a,,b,c" into the three tokens "a", "b", "c",
with nothing, not even an empty string, between "a" and "b".)

[*] The function described in the C Standard, not something
one of our local loonies might cook up.
 
R

Rich Webb

[...]
The strtok() function conflates consecutive delimiters.

This is only the case if the delimiters (here, assuming that "," is
passed each time as the second argument) occur before any non-delimiter
characters. That is, ",,,a,,b,c" and "a,,b,c" each return a pointer to
'a', over-write the comma following 'a' with '\0' and internally
maintain a pointer to the second comma after 'a'. The next call (with
NULL as the first argument) over-writes that second comma with '\0' and
returns a pointer to it (now an empty string), keeping for itself a
pointer to 'b'.

Would you care to place a small wager? (In other words,
you're wrong, R-O-N-G, wrong. With "," as its second argument,
strtok[*] breaks "a,,b,c" into the three tokens "a", "b", "c",
with nothing, not even an empty string, between "a" and "b".)

[*] The function described in the C Standard, not something
one of our local loonies might cook up.

Actually, pulling out the chapter and verse instead of my <cough>
somewhat faulty memory, you're quite correct.

I'll admit, though, to being quite amused by the hair-on-fire reply.
 
A

Andrew Poelstra

On 22 Mar 2010 20:58:30 GMT, "Default User"<[email protected]>
wrote:
[...]
The strtok() function conflates consecutive delimiters.

This is only the case if the delimiters (here, assuming that "," is
passed each time as the second argument) occur before any non-delimiter
characters. That is, ",,,a,,b,c" and "a,,b,c" each return a pointer to
'a', over-write the comma following 'a' with '\0' and internally
maintain a pointer to the second comma after 'a'. The next call (with
NULL as the first argument) over-writes that second comma with '\0' and
returns a pointer to it (now an empty string), keeping for itself a
pointer to 'b'.

Would you care to place a small wager? (In other words,
you're wrong, R-O-N-G, wrong. With "," as its second argument,
strtok[*] breaks "a,,b,c" into the three tokens "a", "b", "c",
with nothing, not even an empty string, between "a" and "b".)

[*] The function described in the C Standard, not something
one of our local loonies might cook up.

Actually, pulling out the chapter and verse instead of my <cough>
somewhat faulty memory, you're quite correct.

I'll admit, though, to being quite amused by the hair-on-fire reply.

Edward Nilges has been around here lately and I'm sure Eric
is just a little edgy.
 
D

Default User

Rich said:
On Tue, 23 Mar 2010 08:34:59 -0400, Eric Sosman
With "," as its second argument,
strtok[*] breaks "a,,b,c" into the three tokens "a", "b", "c",
with nothing, not even an empty string, between "a" and "b".)

[*] The function described in the C Standard, not something
one of our local loonies might cook up.

Actually, pulling out the chapter and verse instead of my <cough>
somewhat faulty memory, you're quite correct.

The conflation of adjacent delimiters can be good or bad, depending on
what you're doing. In this case, not too helpful. If you're breaking
lines of text into words, then it can be useful.



Brian
 
R

Richard Harnden

Also beware that the escaping/quoting rules are completely mental.
If you want certain characters in a field, then you need to quote
the whole field, and if you want to have quote characters, then
you have to reduplicate them. Or something like that.

Yeah, you have to double-up each quote you actually want. The whole
field is quote-delimited too.

Quoted strings in csv are kind of like a comma-speparated-list - only
without the commas: everywhere there should have been a comma is where a
quote would actually go.

for eg:

If you wanted: one " two "" three """

quote every sub-string and put a comma where the quotes really go:
"one "," two ",""," three ","","",""

loose the commas, and:
"one "" two """" three """""""

What could be simpler?

It would be easier if it were possible to escape the quotes that don't
delimit the field. csv doesn't have any concept of esacping, unfortunatly.
 
S

Samier

What great replies, thanks a lot to all.
Not only do I have a clue now how to solve the solution but also know
about the limitations of strtok and got some usefull hints about the
general function design

Have fun
sake
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top