Question regarding fgets and new lines

E

Eric Sosman

CBFalconer said:
Why not? If you malloc something, you know you need to free it
when no longer needed. If you use ggets, you know you need to free
the line when no longer needed. This is not a massive memory
leap. Meanwhile you don't have to worry about buffer sizes, etc.

FWIW, I took a somewhat different tack in my own gets()
replacement (I guess everybody writes one, sooner or later).
Mine follows the precedent of things like getenv(): the returned
pointer is only valid until the next call, when the buffer it
points to may be overwritten and/or moved or freed.

This approach has some disadvantages: for example, it would
be a pain to make it thread-friendly. On the other hand, it
localizes all the memory management inside the function, and the
signature `char *getline(FILE*)' is simple enough that even I can
remember it. (The older and feebler my gray cells get, the more
I value simplicity ...)

I don't have a convenient place to post the code, but I'll be
happy to mail it to anyone who's interested.
 
R

Richard Heathfield

Eric Sosman said:

I don't have a convenient place to post the code, but I'll be
happy to mail it to anyone who's interested.

I'll be glad to host it for you if you wish. Email works, I think. (If not,
please let me know!!)
 
R

Richard Tobin

[/QUOTE]
Because responsibilities become unclear. Simple rules like 'whoever
allocates something must deallocate it' don't work any more.

That's not the simple rule. The simple rule is "whoever allocates
something must deallocate it, oh and by the way we've got to have some
way for the user to allocate this object whose size he doesn't know,
maybe he should pass in a size and we'll return null if it's not big
enough, or maybe we'll have another function telling him how big it's
going to be...".

-- Richard
 
C

CBFalconer

Eric said:
.... snip ...

FWIW, I took a somewhat different tack in my own gets()
replacement (I guess everybody writes one, sooner or later).
Mine follows the precedent of things like getenv(): the returned
pointer is only valid until the next call, when the buffer it
points to may be overwritten and/or moved or freed.

This approach has some disadvantages: for example, it would
be a pain to make it thread-friendly. On the other hand, it
localizes all the memory management inside the function, and the
signature `char *getline(FILE*)' is simple enough that even I can
remember it. (The older and feebler my gray cells get, the more
I value simplicity ...)

When I designed ggets I considered that signature, but I could see
no way of returning appropriate errors for both FILE problems, EOF,
and memory allocation problems.

I second the motion about easily remembered signatures.
 
B

Ben Bacarisse

Bill Reid said:
I need to read in a comma separated file, and for this I was going to
use fgets. I was reading about it at http://www.cplusplus.com/ref/ and
I noticed that the document said:

"Reads characters from stream and stores them in string until (num -1)
characters have been read or a newline or EOF character is reached,
whichever comes first."

My question is that if it stops at a new line character (LF?) then how
does one read a file with multiple new line characters?

Another question. The syntax is:

char * fgets (char * string , int num , FILE * stream);

but you have to allot a size for the string before this. Would you just
use the same num as used in the fgets? So char stringexample[num] ?
OK, I've read the other responses to this and they were...shall we
say, regrettable? Except for "pathological" cases, here's all you
need to do:

#define LINEMAX 512

char csv_line[LINEMAX];
FILE *csv_fptr;

<get or create a string here that is the path to the CSV file>

if((csv_fptr=fopen(csv_filepath,"r"))!=NULL) {

while((fgets(csv_line,LINEMAX,csv_fptr))!=NULL) {

<you can parse out the data from each csv_line right here>

}

fclose(csv_fptr);
}

else printf("\nCouldn't open %s",csv_filepath);

And you're done!

I may be missing your point, but CSV files are not "line oriented" so
this does not seem to be the obvious solution. Do you consider CSV
records with embedded newlines to be "pathological" and can thus be
ignored (like you do for long lines) or are you saying that the loop
pattern above can be extended to deal with these easily?

When I last had to do this (not in C so I won't post the code) it
seemed easier in the long run to use a small state machine as
another poster has suggested.
 
W

websnarf

Roland said:
but you have to allot a size for the string before this. Would you just
use the same num as used in the fgets? So char stringexample[num] ?

Somehow you are just supposed to know the length. You have to guess --
usually you just overestimate or something like that. If its too small
then you get truncated results.

Not necessarily. You only need to know if you are done (if the line is
entirely read) or not.

Right, but if you read a '\0' from your input then knowing this is not
as easy as it sounds.
[...] If not, read again until the rest of the line
is read. Your code basically becomes a loop. Just assume that the
buffer is always too small to read the line in one pass.

If your code is a loop, first of all how/where are you storing each
iteration and second of all, why not write a raw loop around fgetc() in
the first place?
Live with, not against, your limits.

WTF? First of all, the perception that you should use fgets() because
you are limited to using just that is completely bogus. The C language
is general enough, that suffering through the weaknesses of fgets() is
completely unnecessary. Second of all, there is a name for people who
simply accept limitations without question; they are called sheep.
 
W

websnarf

Roland said:
"The storage has been allocated within fggets ... Freeing of assigned
storage is the callers responsibility".

This programming style is not used by the Standard C library

True, if you ignore the existence of calloc, realloc and malloc
themselves (or even fopen).
[...] (and other well-known libraries).

With the exception of scientific/numeric (and some crypto) libraries
this is almost certainly false. I would claim that any ADT library for
C which declares the containers uses this paradigm.
[...] I'd be reluctant to use it in my programs.

There are other grounds for being unsatisfied with that code, but
merely its use of implicit allocation is not one of them. If you are
looking for a completely clean, maximally flexible and portable line
input library you can get one here:

http://www.pobox.com/~qed/userInput.html

If you have some clever way of encoding the input incrementally as you
go without storing to memory (such as crypto-hashing passwords) then
you can even avoid the use of malloc if you want.
 
W

websnarf

Keith said:
Exactly. For any resource, there needs to be a way to allocate it and
a way to release it. For raw chunks of memory, the allocation and
deallocation routines are "malloc" and "free". For stdio streams,
they're called "fopen" and "fclose". For the ggets interface (if I
understand it correctly), they're called "ggets" and "free".

It might not have been a bad idea to have a special purpose
deallocation, say "ggets_release"; it would be a simple wrapper around
"free", but it would leave room for more complex actions in a future
version. But I don't think it's really necessary.

Actually renaming "ggets()" to getsalloc() would be the most
consistent. After all you feed results of (m/c/re)alloc to free to
reclaim the memory, you can just classify getsalloc() into that same
category and just say that anything that you obtained that ends in
"-alloc" you should send to free, without increase in mental impact at
all.

Use of an auxilliary free-function, like fclose() is usually supplied
when the contents of the data you are handling are (for practical
purposes) opaque.

If he wanted a custom free function, he might as well do tricky things
like hide the length (which would cost 0 to obtain) in before the
actual char * pointer, and provide other clever manipulation functions
that could leverage this information. Of course, you can see where
this naturally would end up going.
 
E

Eric Sosman

CBFalconer said:
Eric Sosman wrote:

... snip ...



When I designed ggets I considered that signature, but I could see
no way of returning appropriate errors for both FILE problems, EOF,
and memory allocation problems.

Not sure what you mean by "FILE problems," but yes: the
ultra-simple signature loses the ability to distinguish between
EOF and realloc() failure. A caller who cares can write

while ((buff = getline(stream)) != NULL) {
/* process the line */
}
if (feof(stream)) {
/* drained the entire file */
} else if (ferror(stream)) {
/* I/O error */
} else {
/* ran out of memory */
}

But even that's not a panacea: A further shortcoming of my
method is that when it returns NULL to indicate I/O error or
realloc() failure, it thereby "loses" (but doesn't "leak")
any characters that may already have been read before the
event. That hasn't turned out to be a problem, because it's
never seemed worth while to try to parse the partial line in
the face of the failure; that is, instead of the above I
usually find myself writing

while ((buff = getline(stream)) != NULL) {
/* process the line */
}
if (! feof(stream)) {
/* some kind of failure: terminate with regrets */
}
I second the motion about easily remembered signatures.

It can, of course, be overdone. Sometimes I wonder whether
I'm running afoul of Einstein's famous remark that things should
be made as simple as possible, but no simpler.
 
B

Bill Reid

Ben Bacarisse said:
Bill Reid said:
I need to read in a comma separated file, and for this I was going to
use fgets. I was reading about it at http://www.cplusplus.com/ref/ and
I noticed that the document said:

"Reads characters from stream and stores them in string until (num -1)
characters have been read or a newline or EOF character is reached,
whichever comes first."

My question is that if it stops at a new line character (LF?) then how
does one read a file with multiple new line characters?

Another question. The syntax is:

char * fgets (char * string , int num , FILE * stream);

but you have to allot a size for the string before this. Would you just
use the same num as used in the fgets? So char stringexample[num] ?
OK, I've read the other responses to this and they were...shall we
say, regrettable? Except for "pathological" cases, here's all you
need to do:

#define LINEMAX 512

char csv_line[LINEMAX];
FILE *csv_fptr;

<get or create a string here that is the path to the CSV file>

if((csv_fptr=fopen(csv_filepath,"r"))!=NULL) {

while((fgets(csv_line,LINEMAX,csv_fptr))!=NULL) {

<you can parse out the data from each csv_line right here>

}

fclose(csv_fptr);
}

else printf("\nCouldn't open %s",csv_filepath);

And you're done!

I may be missing your point,

There's no "maybe" about it...
but CSV files are not "line oriented"

Except the millions that are, and was the whole idea originally behind
a "CSV" file in the first place...you may be thinking of a "CSV" file
format that co-opted the name but not the spirit or the actual format
(I'm aware of these files), but the OP gave no indication that that
was the type of file he was dealing with...
so
this does not seem to be the obvious solution.

Of course not, because it works, is fast and simple, and is
blindingly obvious...therefore, it must be "wrong"...
Do you consider CSV
records with embedded newlines to be "pathological"

Sure, whatever, they aren't what I typically process and most
likely not what the OP wants to process, so if there are no "embedded
newlines" in the particular CSV files that I or anybody else wants to
process, who cares what name I call those that do? How about
"irrelevant", is that better? "Misnamed"? "Deceptive"?
and can thus be
ignored (like you do for long lines)

Oh, I'm ignoring even more than that, that's me all over, if it doesn't
apply, I just "ignore" it, I'm "funny" that way...
or are you saying that the loop
pattern above can be extended to deal with these easily?
Since fgets() terminates on a newline, you might be better off using
fgetc() instead, and then you have to chew up extra cycles counting
the commas rather than assuming you've got a full record for each
line...I do use other processing loops that employ fgetc() for truly
"pathological" (known malformatted, foreign, suspect) CSV files and
employ "sanity checks" prior to actually reading them, usually in
careful interactive context...

But since I usually know the maximum size and data types for the
files I am dealing with (many times because I wrote them myself!),
why would I slow down and complicate my program for something
I know for a fact I will not encounter?
When I last had to do this (not in C so I won't post the code) it
seemed easier in the long run to use a small state machine as
another poster has suggested.
I don't know exactly what you "had to do" so I can't comment. Since
you seem to be hung up on "embedded newlines" and "long lines" you
were clearly dealing with different data types than I am, and probably
the OP. Like most posters in this thread, you chose to answer a
question that wasn't asked, and then STILL failed to provide an answer!
 
C

Chris Torek

[much snippage]

...you may be thinking of a "CSV" file format that co-opted the
name ...

So, you are talking about CSV files, and he is talking about CSV
files, which are completely different things?

"Not Claw, Claw!"
 
B

Bill Reid

Chris Torek said:
[much snippage]

...you may be thinking of a "CSV" file format that co-opted the
name ...

So, you are talking about CSV files, and he is talking about CSV
files, which are completely different things?

"Not Claw, Claw!"

Exactly!

He's most certainly talking about just randomly exporting nonsense
from his weird Uncle Billy's "Excel" program as a *.csv file, such as:

123.45,678.9,135.45,"Hi there
this is a new line","Hi again sucker",680.24
791.23,579.35,,,,

Note carefully, however, that like all "CSV" files since the dawn
of man trying to portably export data by studiously avoiding binary
formats, it IS actually "line-oriented"; there were two rows of a
maximum of six columns each in the Excel file. The only reason it
"exported" as three lines was because of the "embedded newline" in
the cell at the fourth column of the first row; other than that, it follows
the grand tradition of a "field" being delimited by a comma (or
whatever) and the "record" being delimited by a newline.

If I wanted to, I guess I could have inserted the complete text for
"War and Peace" into that column, and thus over-flowed my
512-character buffer by a factor of about a million. For that
matter, I could have inserted the complete text for "Paradise
Lost" in the cell at the first column of the second row, even though
I put a float number in that column in the first row, and what the hell
would I do to "process" THAT data? I don't even think a
"state machine" would save my bacon at that point...

I will leave those problems to people oh so much smarter than I.
I will just continue to humbly bumble through "CSV" files that list
crap like the average home price for the last 30 years for the
top 40 metropolitan areas in the USA, and pray to God some
joker didn't insert the text for the "Kama Sutra" in the column for "1999"
at the row for "Miami/Dade County, FLA"...
 
B

Ben Bacarisse

Bill Reid said:
Chris Torek said:
[much snippage]
... but CSV files are not "line oriented"

...you may be thinking of a "CSV" file format that co-opted the
name ...

So, you are talking about CSV files, and he is talking about CSV
files, which are completely different things?

"Not Claw, Claw!"

Exactly!

He's most certainly talking about just randomly exporting nonsense
from his weird Uncle Billy's "Excel" program as a *.csv file, such as:

123.45,678.9,135.45,"Hi there
this is a new line","Hi again sucker",680.24
791.23,579.35,,,,

I was not aware that there was any other kind of CSV file but I will
accept you know more about them I do. However it was not clear from
the OP's post whether he meant "your" kind of CSV files or what I
will call, for want of a more formal definition, RFC 4180 CSV files.

I have obviously unset you, and for that I am sorry, but please don't
try to suggest that I am a fan of this dreadful format. (Or that I
have a weird uncle Billy in Redmond.)
 
B

Ben Bacarisse

Bill Reid said:
Ben Bacarisse said:
Bill Reid said:
I need to read in a comma separated file, and for this I was going to
use fgets. I was reading about it at http://www.cplusplus.com/ref/ and
I noticed that the document said:

"Reads characters from stream and stores them in string until (num -1)
characters have been read or a newline or EOF character is reached,
whichever comes first."

My question is that if it stops at a new line character (LF?) then how
does one read a file with multiple new line characters?

Another question. The syntax is:

char * fgets (char * string , int num , FILE * stream);

but you have to allot a size for the string before this. Would you just
use the same num as used in the fgets? So char stringexample[num] ?

OK, I've read the other responses to this and they were...shall we
say, regrettable? Except for "pathological" cases, here's all you
need to do:

#define LINEMAX 512

char csv_line[LINEMAX];
FILE *csv_fptr;

<get or create a string here that is the path to the CSV file>

if((csv_fptr=fopen(csv_filepath,"r"))!=NULL) {

while((fgets(csv_line,LINEMAX,csv_fptr))!=NULL) {

<you can parse out the data from each csv_line right here>

}

fclose(csv_fptr);
}

else printf("\nCouldn't open %s",csv_filepath);

And you're done!

I may be missing your point,

There's no "maybe" about it...
but CSV files are not "line oriented"

Except the millions that are, and was the whole idea originally behind
a "CSV" file in the first place...you may be thinking of a "CSV" file
format that co-opted the name but not the spirit or the actual format
(I'm aware of these files), but the OP gave no indication that that
was the type of file he was dealing with...

Nor any that it was not. Your CSV files are "clean" and the ones I've
had to parse all had embedded newlines in them so we came to the question
from different angles.

I don't know exactly what you "had to do" so I can't comment. Since
you seem to be hung up on "embedded newlines" and "long lines" you
were clearly dealing with different data types than I am, and probably
the OP. Like most posters in this thread, you chose to answer a
question that wasn't asked, and then STILL failed to provide an
answer!

Actually, no, I did not even answer a question that was not asked -- I
just commented and, you are quite right, I added little value. If the
OP gets back to say that his/her CSV files do have embedded newlines I'll
translate my old code and post it.
 
R

Robert Gamble

Bill said:
Chris Torek said:
[much snippage]
... but CSV files are not "line oriented"

...you may be thinking of a "CSV" file format that co-opted the
name ...

So, you are talking about CSV files, and he is talking about CSV
files, which are completely different things?

"Not Claw, Claw!"

Exactly!

He's most certainly talking about just randomly exporting nonsense
from his weird Uncle Billy's "Excel" program as a *.csv file, such as:

123.45,678.9,135.45,"Hi there
this is a new line","Hi again sucker",680.24
791.23,579.35,,,,

Note carefully, however, that like all "CSV" files since the dawn
of man trying to portably export data by studiously avoiding binary
formats, it IS actually "line-oriented";

Nonsense, CSV is a binary format and always has been. You cannot
properly parse a CSV file by assuming that each line is its own record.
there were two rows of a
maximum of six columns each in the Excel file. The only reason it
"exported" as three lines was because of the "embedded newline" in
the cell at the fourth column of the first row; other than that, it follows
the grand tradition of a "field" being delimited by a comma (or
whatever) and the "record" being delimited by a newline.

I am not aware of this grand tradition you speak of and I have
processed numerous CSV files from multiple sources and applications
over the last few years, all of which handled embedded commas and
newlines. You may be confusing CSV with comma-delimited files, a text
format in which all commas are row separators, records are delimited
with newlines, and commas and newlines cannot exist in the actual data.
Comma-delimited files are significantly easier to process than real
CSV files but suffer from obvious disadvantages that are shared with
any character-delimited format.

Check out http://en.wikipedia.org/wiki/Comma-separated_values and
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm for the details of
the CSV format and how it is used by real-world applications.
If I wanted to, I guess I could have inserted the complete text for
"War and Peace" into that column, and thus over-flowed my
512-character buffer by a factor of about a million. For that
matter, I could have inserted the complete text for "Paradise
Lost" in the cell at the first column of the second row, even though
I put a float number in that column in the first row, and what the hell
would I do to "process" THAT data? I don't even think a
"state machine" would save my bacon at that point...

It's not an overly simple task but it can be done without extraordinary
effort. Excel does it, and I have written a real CSV parser in pure
ANSI C which I recently made available at
http://sourceforge.net/projects/libcsv.

Robert Gamble
 
K

Keith Thompson

CBFalconer said:
When I designed ggets I considered that signature, but I could see
no way of returning appropriate errors for both FILE problems, EOF,
and memory allocation problems.

I second the motion about easily remembered signatures.

Well, you *could* return as many unique error conditions as you like
with a simple char* return value:

const char *const gl_EOF = "getline: EOF";
const char *const gl_io_error = "getline: I/O error";
const char *const gl_malloc_failed = "getline: malloc failed";
...

Each error code is a unique pointer value. One problem is that
there's no good way to detect success other than by comparing the
result to each possible error code (a set that can easily change in
later revisions of the function); this can be alleviated somewhat by
putting all the error codes into an array, but it's still inconvenient
for the user. You could provide an auxiliary function that tells you
whether the char* value returned by getline() points to an error
message or not.

Another drawback is that you can't return both error information and
actual data.

But if the user forgets to check the result, the value returned looks
like an error message which could make it easier to track down bugs.

Another approach is to implement something resembling errno, but that
can make it difficult to tie an error to a specific call.
 
E

Eric Sosman

Keith said:
Well, you *could* return as many unique error conditions as you like
with a simple char* return value:

const char *const gl_EOF = "getline: EOF";
const char *const gl_io_error = "getline: I/O error";
const char *const gl_malloc_failed = "getline: malloc failed";
...

Given the availability of feof() and ferror(), and the
sub-rosa knowledge that the only other possible failure mode
is NULL from realloc(), such a dodge didn't seem necessary.
Purists might argue (with some justification) that the sub-rosa
knowledge is a Bad Thing; my feeling was that enumerating all
three "exceptional" possibilities in the documentation (thus
making them part of the "interface contract") was good enough.

If I'd felt it desirable to return more information about
the failure modes, I'd probably have abandoned the approach of
overloading both success and all those failure modes onto a
single returned value. Instead, I'd have returned the "payload"
in one place and a "status" in another -- almost all (Open)VMS
facilities worked this way, and reasonably well. This approach
would also have allowed me to return the partial line payload
that preceded an error, instead of just discarding what had
been read, and that would have been a Good Thing. (Principle:
low-level routines shouldn't make higher-level decisions.) But
it didn't "feel" worth while to clutter the interface to preserve
information I really couldn't see myself making use of.

Different people design different interfaces for the same
task! Different people decompose the same task in different ways!
Ultimately, it comes down to what might be called "taste" (there's
just no point in arguing with Gus), or to put it on a more respectable
footing it comes down to a guess about the likely usage scenarios for
the new facility. There might (or might not) be the germ of a thesis
topic for someone who wants to make a study of how different programmers
approach similar problems: were you corrupted by an early exposure to
Forth, was your mother frightened by a COBOL compiler?
[...]
Another approach is to implement something resembling errno, but that
can make it difficult to tie an error to a specific call.

Usually, when I want to preserve more information than I "tasted"
was appropriate for getline(), I'll have the function return a status
code and pass the payload through an additional argument:

#define GETLINE_OK 0
#define GETLINE_EOF (-1)
#define GETLINE_ERR (-2)
#define GETLINE_NOMEM (-3)
int getline(FILE *stream, char **bufptr);

.... and the caller would write

char *line;
int status;
while ((status = getline(stream, &line)) == GETLINE_OK)

Sometimes it's convenient to make the "status" value be NULL
for success or else a `const char*' pointing to an error message:

char *status;
if ((status = somefunc(args)) != NULL) {
fprintf (stderr, "somefunc: %s\n", status);
exit (EXIT_FAILURE);
}

"There are nine and sixty ways of constructing tribal lays,
And every single one of them is right!" -- Rudyard Kipling
 
C

CBFalconer

Eric said:
.... snip ...

If I'd felt it desirable to return more information about
the failure modes, I'd probably have abandoned the approach of
overloading both success and all those failure modes onto a
single returned value. Instead, I'd have returned the "payload"
in one place and a "status" in another -- almost all (Open)VMS
facilities worked this way, and reasonably well. This approach
would also have allowed me to return the partial line payload
that preceded an error, instead of just discarding what had
been read, and that would have been a Good Thing. (Principle:
low-level routines shouldn't make higher-level decisions.) But
it didn't "feel" worth while to clutter the interface to preserve
information I really couldn't see myself making use of.

Precisely what ggets does.
 
R

Richard Bos

Because responsibilities become unclear. Simple rules like 'whoever
allocates something must deallocate it' don't work any more.

Sure they do. The rule "*alloc() are the only three functions which
allocate something" doesn't work any more; that's all.
Ok, that's symmetric.


That's unsymmetric. The user can easily forget the 'free'.

Then the user had better do his homework.
It's all about style. Maybe someone can tell the story why strdup was
excluded from the C Standard (I'm not a C historian and don't want to
become one).

Then _you_ had better do your homework. Start with the Rationale, as
indicated upthread.

Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
474,263
Messages
2,571,062
Members
48,769
Latest member
Clifft

Latest Threads

Top