fgets() and embedded null characters

David Mathog · Mar 16, 2005

Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.

So two questions:

1. Why did the folks who wrote fgets() have a successful
read return a pointer to the storage buffer (which the
calling routine already knew in any case) instead of the
number of characters read (which often cannot determine at
all after the fact if there are embedded nulls in the input)?

2. Can somebody please supply a pointer to a function
written in ANSI C that:

A) reads from a stream (like fgets)
B) stores to a preallocated buffer (like fgets)
C) accepts the size of the buffer (like fgets)
D) returns the number of characters read (unlike fgets)
E) sets read status, ideally in an integer combining
status bits more or less like these:
1 EOF
2 LINETOOBIG (instead of having to check the last byte)
4 READERROR (any other kind of READ error)
(read status = 1 with a nonzero returned length would
not be an error, it just indicates that all input data
has been consumed.)

If need be I can roll my own from fgetc, but I'd rather not reinvent
this wheel.

Thanks,

David Mathog
(e-mail address removed)

Eric Sosman · Mar 16, 2005

David said:
Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.

As an aside, a file containing '\0' characters is not
suitable for reading with a text stream. Section 7.19.2
paragraph 2 describes the "expected form" of a text stream:
printing characters and a small group of control characters,
plus a few other conventions. If you write a '\0' to a
text stream it's not guaranteed that you can read it back,
not even if you use getc().

If the data can include '\0' (more generally, if it
doesn't follow the expected conventions for text), you can
use a binary stream. But then one must question the wisdom
of using fgets(), which is specifically designed for textual
input in units of lines. The Standard doesn't prohibit using
fgets() with a binary stream, but fread() might be better.

So two questions:

1. Why did the folks who wrote fgets() have a successful
read return a pointer to the storage buffer (which the
calling routine already knew in any case) instead of the
number of characters read (which often cannot determine at
all after the fact if there are embedded nulls in the input)?

"It was just one of those things,
Just one of those crazy flings,
One weird design to raise Hell with strings,
Just one of those things."

.... and there are plenty of other examples of library functions
that echo back what you already know instead of telling you
something useful. The folks who invented fgets() (and gets(),
and strcat(), and ...) lacked our twenty-twenty hindsight.

2. Can somebody please supply a pointer to a function
written in ANSI C that:

A) reads from a stream (like fgets)
B) stores to a preallocated buffer (like fgets)
C) accepts the size of the buffer (like fgets)
D) returns the number of characters read (unlike fgets)
E) sets read status, ideally in an integer combining
status bits more or less like these:
1 EOF
2 LINETOOBIG (instead of having to check the last byte)
4 READERROR (any other kind of READ error)
(read status = 1 with a nonzero returned length would
not be an error, it just indicates that all input data
has been consumed.)

If need be I can roll my own from fgetc, but I'd rather not reinvent
this wheel.

I don't know of a function with quite this specification,
although somebody may have written one (it seems everybody
eventually writes himself an fgets() replacement). If you
wind up rolling your own, I'd suggest getc() instead of fgetc().
Also, while conditions A-D seem entirely reasonable, point E
seems more involved than it needs to be: it would seem that
most calls would need to be accompanied by a bunch of bit-testing,
increasing the "clunkiness" of the interface. Note that the
feof() and ferror() functions can already discriminate cases
E1 and E4; is it really worth while to call out E2 separately?
Absent dynamic allocation you need *some* way of discriminating
between "line too long" and "line that just fits but ends with
EOF instead of newline," but perhaps a simple convention about
using or not using the last spot in the buffer might handle it
with a slimmer interface.

CBFalconer · Mar 16, 2005

David said:
Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.

So two questions:

1. Why did the folks who wrote fgets() have a successful
read return a pointer to the storage buffer (which the
calling routine already knew in any case) instead of the
number of characters read (which often cannot determine at
all after the fact if there are embedded nulls in the input)?

Because somebody wrote it that way about 30 years ago, and a change
would break all sorts of existing code.

2. Can somebody please supply a pointer to a function
written in ANSI C that:

A) reads from a stream (like fgets)
B) stores to a preallocated buffer (like fgets)
C) accepts the size of the buffer (like fgets)
D) returns the number of characters read (unlike fgets)
E) sets read status, ideally in an integer combining
status bits more or less like these:
1 EOF
2 LINETOOBIG (instead of having to check the last byte)
4 READERROR (any other kind of READ error)
(read status = 1 with a nonzero returned length would
not be an error, it just indicates that all input data
has been consumed.)

If need be I can roll my own from fgetc, but I'd rather not
reinvent this wheel.

Bingo. Except you would be well advised to use getc rather than
fgetc. BTW, if your files have '\0' (nul, not null) chars in them,
they are not textfiles, and you will need to face the non-portable
treatment of line endings.

David Mathog · Mar 16, 2005

Eric said:
As an aside, a file containing '\0' characters is not
suitable for reading with a text stream. Section 7.19.2
paragraph 2 describes the "expected form" of a text stream:
printing characters and a small group of control characters,
plus a few other conventions. If you write a '\0' to a
text stream it's not guaranteed that you can read it back,
not even if you use getc().

Sure. Unfortunately in the real world I sometimes encounter
files that do contain embedded null characters but are
otherwise normal text files.

Both responses so far said to use getc instead of fgetc,
is that for speed?

Here's a first pass at this function. Before everybody jumps
on the name please note that super_fgets()
doesn't imply that it is better than fgets(), just that it does more.
And no I have not tested it very thoroughly yet.

Ideally it would read at an even lower level than (f)getc so that the
secondary tests for EOF vs. read error wouldn't be necessary.
It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF) and SFG_EMBEDDED_NULL. It does not correct these,
just warns that they exist. The test for the trailing \r is nearly
free but the test for embedded NULL will slow things down a bit.
However less I think than testing for the embedded null characters
after this routine is called, since the character will already be
loaded in a CPU register.

/* super_fgets() status bits, put in a header fiole */
#define SFG_EOF 1 /* input terminated by End of File */
#define SFG_EOL 2 /* input terminated by End of line (\n) */
#define SFG_CRLF 4 /* input terminated by CRLF (\r\n) \r remains! */
#define SFG_EMBEDDED_NULL 8 /* embedded NULL characters are present */
#define SFG_BUFFER_OVERFLOW 16 /* input buffer full */
#define SFG_READERROR 32 /* unrecoverable read error */

/* super_fgets is implemented at the getc level. It does the following:
A: reads from a stream (like fgets)
B: accepts a preallocated buffer (like fgets)
C: accepts the size of that preallocated buffer (like fgets)
D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)
Input is terminated by either EOL (\n) or EOF.
E: sets the position of the terminating null =
number of characters read (size_t)
D: sets a status integer where the bits are as
defined in the table far above (SFG_*)

Limitations: not a drop in fgets() replacement!

*/

void super_fgets(char *string, size_t size, FILE *stream,
size_t *cterm, unsigned int *status){

size_t icterm; /* internal cterm value */
unsigned int istatus; /* internal status value */
size_t lastslot; /* the last character cell in the buffer */
int readthis; /* the character which was read */

icterm = 0;
istatus = 0;
lastslot = size-1;

while(1){

if(icterm == lastslot){
istatus |= SFG_BUFFER_OVERFLOW;
break;
}

readthis=fgetc(stream);

if(readthis == EOF){
/* either the end of the file or a
read error, figure out which */
if(feof(stream)){ istatus |= SFG_EOF; }
else { istatus |= SFG_READERROR; }
break;
}

if(readthis == '\n'){
/* LF is a line terminator, return what has been read so far,
NOTE, the \n is NOT returned!!! On \r\n terminated input
files the trailing \r may be present, check and
signal that too. */

istatus |= SFG_EOL;
if( (icterm>0) && (string[icterm-1]=='\r')) istatus |= SFG_CRLF;
break;
}

/* warn about embedded null characters */
if(readthis == '\0')istatus |= SFG_EMBEDDED_NULL;

string[icterm] = readthis;
icterm++;

}
string[icterm]='\0';
*status = istatus;
*cterm = icterm;
return;

}

Regards,

David Mathog
(e-mail address removed)

Dave Vandervies · Mar 16, 2005

[much snippage]

Here's a first pass at [a fgets replacement]

....and some of my comments after a first pass at reading it.

It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF)

The way you state it makes it seem as if you're aware of this, but it's
worth explicitly noting that if you're reading a well-formed text file
that was opened correctly (that is, not in binary mode), you won't see
CRLF line endings.

#define SFG_BUFFER_OVERFLOW 16 /* input buffer full */

A different name (possibly SFG_BUFFER_FULL) would be more accurate for
this one, since you don't actually overflow the buffer (unless you're
given a too-small size).

D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)

fgets does do this; it reads at most (size-1) bytes and always gives
back a '\0'-terminated string.

D: sets a status integer where the bits are as
defined in the table far above (SFG_*)

void super_fgets(char *string, size_t size, FILE *stream,
size_t *cterm, unsigned int *status){

It would probably be better to return the status instead of storing it
through a pointer you get. That would allow the common idiom of doing
something like:
--------
if((ret=super_fgets(buf,sizeof buf,stdin,&last))!=SFG_EOL)
{
/*Something's not quite right, deal with it*/
}
--------
(or even:
--------
if(super_fgets(buf,sizeof buf,stdin,&last)!=SFG_EOL)
{
/*This trivial sample program isn't worth doing proper error-handling
in, as that would clutter the program and obscure the real point
*/
exit(EXIT_FAILURE);
}
--------
) rather than having to do (no harder, really, but less common and
therefore less immediately recognizeable)
--------
super_fgets(buf,sizeof buf,stdin,&last,&ret);
if(ret!=SFG_EOL)
{
/*Something's not quite right, deal with it*/
}
--------

while(1){

if(icterm == lastslot){
istatus |= SFG_BUFFER_OVERFLOW;
break;
}

My first reaction when I see something like this is that the check for
whether to break should be folded into the loop condition, since things
are often clearer that way, but it's not immediately clear whether that
can be done here without making readability worse rather than better.

/* LF is a line terminator, return what has been read so far,
NOTE, the \n is NOT returned!!!

Which is "better" depends more on your own preferences than anything
else, since either way you're providing enough information to trivially
reconstruct the other, but my preference would be to leave a '\n'
that's read from the stream in the buffer rather than removing it, so
that what the caller fgets is exactly what was read from the stream.
(This makes read-and-dump simpler, and if you're already doing more
than read-and-dump with it it's trivial to add removing the newline,
if you don't want it, to what you're doing.)

dave

Flash Gordon · Mar 16, 2005

David said:
Sure. Unfortunately in the real world I sometimes encounter
files that do contain embedded null characters but are
otherwise normal text files.

Both responses so far said to use getc instead of fgetc,
is that for speed?

Yes. The idea is that getc will be a macro that can evaluate its
parameter more than once where as fgetc (even if implemented as a macro
as well as a function) has to behave like a function. So getc can take
more shortcuts and be more efficient.

Here's a first pass at this function. Before everybody jumps
on the name please note that super_fgets()
doesn't imply that it is better than fgets(), just that it does more.
And no I have not tested it very thoroughly yet.

Ideally it would read at an even lower level than (f)getc so that the
secondary tests for EOF vs. read error wouldn't be necessary.

You could use fread and buffer stuff yourself, but then people could not
mix calls to your super_fgetc and the standard functions. The same
applies to non-standard lower level functions such as the read function
provided as an extension on some implementations. So using getc is the
probably best way.

Anyway, I would be very surprised if the test for whether it was an end
of file or a read error will be a significant factor in the performance.

It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF)

No chance of old MAC text files (I think it is) with LFCR for the line
termination?

> and SFG_EMBEDDED_NULL. It does not correct these,
just warns that they exist. The test for the trailing \r is nearly
free but the test for embedded NULL will slow things down a bit. However
less I think than testing for the embedded null characters
after this routine is called, since the character will already be
loaded in a CPU register.

I agree that, for your purpose, this is probably the way to go.

/* super_fgets() status bits, put in a header fiole */
#define SFG_EOF 1 /* input terminated by End of File */
#define SFG_EOL 2 /* input terminated by End of line (\n) */

I would not bother with a status code for reaching the end of line as
this is the normal situation. Unless you are worried about whether the
last line in the

#define SFG_CRLF 4 /* input terminated by CRLF (\r\n) \r remains! */
#define SFG_EMBEDDED_NULL 8 /* embedded NULL characters are present */
#define SFG_BUFFER_OVERFLOW 16 /* input buffer full */
#define SFG_READERROR 32 /* unrecoverable read error */

/* super_fgets is implemented at the getc level. It does the following:
A: reads from a stream (like fgets)
B: accepts a preallocated buffer (like fgets)
C: accepts the size of that preallocated buffer (like fgets)
D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)
Input is terminated by either EOL (\n) or EOF.
E: sets the position of the terminating null =
number of characters read (size_t)
D: sets a status integer where the bits are as
defined in the table far above (SFG_*)

Limitations: not a drop in fgets() replacement!

*/

void super_fgets(char *string, size_t size, FILE *stream,
size_t *cterm, unsigned int *status){

I would make the function return the status rather than passing a
pointer to the status variable.

size_t icterm; /* internal cterm value */
unsigned int istatus; /* internal status value */

I would not bother with having these as local variable. I would just
work with *cterm (and *status if you want to return status that way).
The compiler should be able to handle optimising the accesses.

size_t lastslot; /* the last character cell in the buffer */

I would not bother with this variable either.

int readthis; /* the character which was read */

icterm = 0;
istatus = 0;
lastslot = size-1;

while(1){

Why not use a for so that initialisation and increment can be
encapsulated at one point?

if(icterm == lastslot){
istatus |= SFG_BUFFER_OVERFLOW;
break;
}
readthis=fgetc(stream);

if(readthis == EOF){
/* either the end of the file or a
read error, figure out which */
if(feof(stream)){ istatus |= SFG_EOF; }
else { istatus |= SFG_READERROR; }
break;
}

if(readthis == '\n'){

Why separate if statements, especially as they are mutually exclusive?
either "else if" or doing something with a switch would be clearer in my
opinion.

/* LF is a line terminator, return what has been read so far,
NOTE, the \n is NOT returned!!! On \r\n terminated input
files the trailing \r may be present, check and
signal that too. */

istatus |= SFG_EOL;
if( (icterm>0) && (string[icterm-1]=='\r')) istatus |= SFG_CRLF;
break;
}

/* warn about embedded null characters */
if(readthis == '\0')istatus |= SFG_EMBEDDED_NULL;

string[icterm] = readthis;
icterm++;

}
string[icterm]='\0';
*status = istatus;
*cterm = icterm;
return;

}

Starting from yours I would as a first hack change it to:

unsigned int super_fgets(char *string, size_t size, FILE *stream,
size_t *cterm)
{
int readthis; /* the character which was read */
unsigned int status = 0;

for (*cterm = 0; *cterm < size-1; ++*cterm) {

readthis = fgetc(stream);

switch (readthis) {
case EOF:
/* either the end of the file or a
read error, figure out which */

string[cterm]='\0';

if (feof(stream))
return status | SFG_EOF;
else
return status | SFG_READERROR;

case '\n':
/* LF is a line terminator, return what has been read so
far.
NOTE, the \n is NOT returned!!! On \r\n terminated input
files the trailing \r may be present, check and
signal that too. */

string[cterm]='\0';

if ((cterm>0) && (string[cterm-1]=='\r'))
status |= SFG_CRLF;

return status | SFG_EOL;

case '\0':
status |= SFG_EMBEDDED_NULL;
/* fall through to default case */

default:
string[cterm] = readthis;
break;

}
}

string[cterm] = '\0';
return status | SFG_BUFFER_OVERFLOW;
}

This is completely untested, but I think a bit tidier than yours. If I
was designing from scratch, and not doing it late at night, I might do
it differently.

Flash Gordon · Mar 16, 2005

Dave said:
[much snippage]

Here's a first pass at [a fgets replacement]

Click to expand...

...and some of my comments after a first pass at reading it.

It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF)

Click to expand...

The way you state it makes it seem as if you're aware of this, but it's
worth explicitly noting that if you're reading a well-formed text file
that was opened correctly (that is, not in binary mode), you won't see
CRLF line endings.

That only applies on a DOS/Windows type system. On Unix, even with the
file opened as a text stream, the CR will still be left on since Unix
uses just an LF to indicate the end of line.

However, in this case, since the file may have nul characters, I would
read it as binary anyway. For the types of SW I deal with I would tread
LF and CRLF identically, but that may not be appropriate for the OP.

<snip>

Dave Vandervies · Mar 17, 2005

Dave Vandervies wrote:

That only applies on a DOS/Windows type system. On Unix, even with the
file opened as a text stream, the CR will still be left on since Unix
uses just an LF to indicate the end of line.

If it's on a unix system and has a CR, it's not a well-formed text file.
(In that case, it's most likely an incorrectly imported file from
another system.)

(I believe older MacOS systems used CR-only as their line delimiter.
A MacOS program opening such a file in text mode would get the appropriate
translation to '\n' for end-of-line done for it by the library; copying
the file to a Unixish (or DosWindowsish) system without translating
appropriately would give you something other than a well-formed text
file, just as copying between Unixish and DosWindowsish systems without
translating line-break conventions would.)

dave

CBFalconer · Mar 17, 2005

David said:
Sure. Unfortunately in the real world I sometimes encounter
files that do contain embedded null characters but are
otherwise normal text files.

Then they are not text files. They are binary files, and should be
so treated.

Both responses so far said to use getc instead of fgetc,
is that for speed?

Yes. getc can be a macro, and can operate directly on the system
buffers, and thus avoid the overhead of a system function call.

Here's a first pass at this function. Before everybody jumps
on the name please note that super_fgets()
doesn't imply that it is better than fgets(), just that it does more.
And no I have not tested it very thoroughly yet.

Ideally it would read at an even lower level than (f)getc so that the
secondary tests for EOF vs. read error wouldn't be necessary.
It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF) and SFG_EMBEDDED_NULL. It does not correct these,
just warns that they exist. The test for the trailing \r is nearly
free but the test for embedded NULL will slow things down a bit.
However less I think than testing for the embedded null characters
after this routine is called, since the character will already be
loaded in a CPU register.

/* super_fgets() status bits, put in a header fiole */
#define SFG_EOF 1 /* input terminated by End of File */
#define SFG_EOL 2 /* input terminated by End of line (\n) */
#define SFG_CRLF 4 /* input terminated by CRLF (\r\n) \r remains! */
#define SFG_EMBEDDED_NULL 8 /* embedded NULL characters are present */
#define SFG_BUFFER_OVERFLOW 16 /* input buffer full */
#define SFG_READERROR 32 /* unrecoverable read error */

Better to define an enumeration, so those values and names will
show up in a debugger.

typedef enum sfgRESULT {SFGOK, SFGEOF, SFGEOL, SFGCRLF=4,
SFGNULL=8, ..... } SFGRESULT;

/* super_fgets is implemented at the getc level. It does the following:
A: reads from a stream (like fgets)
B: accepts a preallocated buffer (like fgets)
C: accepts the size of that preallocated buffer (like fgets)
D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)

fgets always terminates with a '\0'. It omits the \n when the full
line doesn't fit the buffer.

Input is terminated by either EOL (\n) or EOF.

Bad idea. EOF doesn't fit into a char. Thats why getc etc return
int. What about a MAC file, which terminates lines with \r and no
\n. What about systems that don't terminate lines with anything.

E: sets the position of the terminating null =
number of characters read (size_t)
D: sets a status integer where the bits are as
defined in the table far above (SFG_*)

Limitations: not a drop in fgets() replacement!

Too complex an interface. You yourself will forget how to call it
in a short while. Hell, I can never remember whether strcpy copies
to or from the first parameter. KISS. See:

<http://cbfalconer.home.att.net/download/ggets.zip>

where I managed to make the interface simple enough for even me to
remember it. Then explain why your routine is a significant
advance on fread. Quickly now, does the file parameter come first
or last? Why?

TTroy · Mar 17, 2005

Dave said:
[much snippage]

Here's a first pass at [a fgets replacement]

Click to expand...

...and some of my comments after a first pass at reading it.

It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF)

Click to expand...

The way you state it makes it seem as if you're aware of this, but it's
worth explicitly noting that if you're reading a well-formed text file
that was opened correctly (that is, not in binary mode), you won't see
CRLF line endings.

#define SFG_BUFFER_OVERFLOW 16 /* input buffer full

Click to expand...

*/

A different name (possibly SFG_BUFFER_FULL) would be more accurate for
this one, since you don't actually overflow the buffer (unless you're
given a too-small size).

D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)

Click to expand...

fgets does do this; it reads at most (size-1) bytes and always gives
back a '\0'-terminated string.

ITYM almost always, because when fgets returns NULL, the string is not
guaranteed to be nul terminated (much like gets or scanf() with %s).

websnarf · Mar 17, 2005

David said:
Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.

Some time ago, I was involved in a flamefest in this newsgroup that
essentially claimed that your situation did not exist in the real
world. That is to say, they think you are lying.

So two questions:

1. Why did the folks who wrote fgets() have a successful
read return a pointer to the storage buffer (which the
calling routine already knew in any case) instead of the
number of characters read (which often cannot determine at
all after the fact if there are embedded nulls in the input)?

The C language designers were "hackers". It did the job they wanted it
to do at the time. They had no consideration for future ramifications.
In fact they claim they had no expectation that C would be used
outside of Unix and some utilities for said OS. gets() is of course
far worse, but thinking about that will help you understand just how
little forethought was put into the design of the C language.

Thankfully in the 3 decades since they first designed this language,
the ANSI C committee have fixed all the warts of the language and no
longer do we have to endure the embarassment that is the C language
library and ... oh wait, sorry no, that was just a dream I had,
nevermind.

2. Can somebody please supply a pointer to a function
written in ANSI C that:

A) reads from a stream (like fgets)
B) stores to a preallocated buffer (like fgets)
C) accepts the size of the buffer (like fgets)
D) returns the number of characters read (unlike fgets)
E) sets read status, ideally in an integer combining
status bits more or less like these:
1 EOF
2 LINETOOBIG (instead of having to check the last byte)
4 READERROR (any other kind of READ error)
(read status = 1 with a nonzero returned length would
not be an error, it just indicates that all input data
has been consumed.)

If need be I can roll my own from fgetc, but I'd rather not reinvent
this wheel.

Ask, and ye shall receive:

http://www.azillionmonkeys.com/qed/userInput.html

It doesn't exactly do the things you ask for, but rather is a general
enough framework for your to easily write your own callbacks and
wrappers which have exactly the behaviors your desire.

Keith Thompson · Mar 17, 2005

Some time ago, I was involved in a flamefest in this newsgroup that
essentially claimed that your situation did not exist in the real
world. That is to say, they think you are lying.

I don't recall any such discussion here. Can you provide a citation?
Could you have misinterpreted something?

It doesn't surprise me that a program using fgets() would have
problems reading a file with embedded nuls. (Arguably such a file is
not a text file, and you therefore shouldn't be using fgets() to read
it, but determining that before you try to read the file can be
difficult.)

Flash Gordon · Mar 17, 2005

Dave said:
If it's on a unix system and has a CR, it's not a well-formed text file.
(In that case, it's most likely an incorrectly imported file from
another system.)

The OP has already said that the files may contain null characters and
so are badly formed, therefore it is entirely possibly that the CRs in
the file are another sign of badly formed files.

(I believe older MacOS systems used CR-only as their line delimiter.

Possibly. I'm sure there is one system which used LFCR as well.

A MacOS program opening such a file in text mode would get the appropriate
translation to '\n' for end-of-line done for it by the library; copying
the file to a Unixish (or DosWindowsish) system without translating
appropriately would give you something other than a well-formed text
file, just as copying between Unixish and DosWindowsish systems without
translating line-break conventions would.)

I've made a lot of use of FTP in binary and text modes between *nix and
Windows so I know what is meant to be done. However, the OP knows he is
dealing with badly formed text files, so we cannot infer what system he
is on based on what the files contain. In his position I would open the
files in binary mode and handle line termination myself.

Chris Croughton · Mar 17, 2005

The OP has already said that the files may contain null characters and
so are badly formed, therefore it is entirely possibly that the CRs in
the file are another sign of badly formed files.

Have you considered that the files may be on a shared filesystem,
accessed by both Unix and Win/DOS? Since NFS, for example, has no idea
of what is accessing the files or whether they are supposed to be 'text'
or 'binary' it can't do any translation.

Real World(tm) programs which are well-written will take account of
that, and where they can they will handle all common line endings (CR,
LF, CRLF, possibly LFCR) transparently (Vim, for instance, will try to
recognise the type of file, and will allow the user to change it if it
gets it wrong).

Possibly. I'm sure there is one system which used LFCR as well.

I've made a lot of use of FTP in binary and text modes between *nix and
Windows so I know what is meant to be done. However, the OP knows he is
dealing with badly formed text files, so we cannot infer what system he
is on based on what the files contain. In his position I would open the
files in binary mode and handle line termination myself.

With the NUL characters it's messy, because any string containing them
will be bound to have problems elsewhere. For just Win/DOS and *ix
files the easiest thing is to open in text mode, where CRLF will either
be replaced by \n automatically or will give \r\n, and detect the latter
and replace it. If input is in a loop it's easy enough to detect CR as
a line ending as well. Note that in the following I (a) define CR and
LF as the explicit hex ASCII values; (b) don't return a flag as to which
line ending was present; (c) don't differentiate between EOF and error.
Those things, if wanted, are left as an exercise for the reader. I've
included the test program, a hex dump of my test file, and the output...

#include <stdio.h>
#include <ctype.h>

#define CR 0x0D
#define LF 0x0A

/**
* Get a line from the input stream *fp, up to size-1 characters nul
* terminated. Handles line terminators LF, CR, CRLF and LFCR,
* treats them all as \n. Returns number of characters in the
* buffer, not including trailing NUL (0x00) character (returns zero
* for EOF at start of line).
* @param fp file pointer to open input stream.
* @param buff pointer to buffer.
* @param size size of input buffer, including trailing nul.
* @return number of characters read.
*/
size_t getLine(FILE *fp, char *buff, size_t size)
{
size_t n = 0;
int c;
while (n+1 < size && (c = getc(fp)) != EOF)
{
if (c == CR)
{
if ((c = getc(fp)) != LF && c != EOF)
ungetc(c, fp);
buff[n++] = '\n';
break;
}
else if (c == LF)
{
if ((c = getc(fp)) != CR && c != EOF)
ungetc(c, fp);
buff[n++] = '\n';
break;
}
buff[n++] = c;
}
buff[n] = '\0';
return n;
}

int main(int argc, char **argv)
{
int i;
for (i = 1; i < argc; i++)
{
FILE *fp = fopen(argv, "r");
if (fp)
{
char buff[16];
int n;
while ((n = getLine(fp, buff, 16)) > 0)
{
char *p = buff;
printf("%4d:", n);
while (*p)
if (isprint(*p))
printf(" %c", *p++);
else
printf(" 0x%.2X", *p++);
printf("\n");
}
fclose(fp);
}
}
return 0;
}

$ cc -pedantic -W -Wall getline.c

$ xdump /tmp/test
00000000: 4C 69 6E 65 20 74 65 72 6D 69 6E 61 74 65 64 20 |Line terminated |
00000010: 62 79 20 4C 46 0A 54 65 72 6D 20 77 69 74 68 20 |by LF.Term with |
00000020: 43 52 0D 54 65 72 6D 20 77 69 74 68 20 43 52 4C |CR.Term with CRL|
00000030: 46 0D 0A 54 65 72 6D 20 77 69 74 68 20 4C 46 43 |F..Term with LFC|
00000040: 52 0A 0D 4C 61 73 74 20 6C 69 6E 65 0A -- -- -- |R..Last line.___|

$ ./a.out /tmp/test
15: L i n e t e r m i n a t e d
7: b y L F 0x0A
13: T e r m w i t h C R 0x0A
15: T e r m w i t h C R L F 0x0A
15: T e r m w i t h L F C R 0x0A
10: L a s t l i n e 0x0A

(My getLine() also doesn't object to embedded NUL characters, although
the main program above will treat them as end of string as written and
vim won't let me insert them into a file and I couldn't be bothered to
fire up a hex editor to do it).

Chris C

Walter Roberson · Mar 17, 2005

:Have you considered that the files may be on a shared filesystem,
:accessed by both Unix and Win/DOS? Since NFS, for example, has no idea

f what is accessing the files or whether they are supposed to be 'text'

r 'binary' it can't do any translation.

NFS was also historically prone to dropping a block of nulls into
the middle of what was expected to be a text file, especially mailboxes
(locking issues...)

David Mathog · Mar 17, 2005

Flash said:
David Mathog wrote:

Yes. The idea is that getc will be a macro that can evaluate its
parameter more than once where as fgetc (even if implemented as a macro
as well as a function) has to behave like a function. So getc can take
more shortcuts and be more efficient.

Tested a few platforms and found only one where getc was faster
than fgetc (gcc 3.2.2 on Solaris 8). In all other cases they ran
at the same speed.

You could use fread and buffer stuff yourself, but then people could not
mix calls to your super_fgetc and the standard functions.

Hmm. Don't really want to read past the EOL with fread, which means
using it 1 char at a time like fgetc. Tried that and it ran at less
than half the speed of fgetc (gcc 3.3.2 on linux). Not surprising since
it is not usually called that way. The reason
I tried it is that when pure binary files were fed through
a pipe on windows xp to a tiny fgetc testbed they terminated
with an EOF only a few bytes into the program:

testfgetc <drivers.cab

where the loop being tested was:

while(fgetc(stdin) != EOF){}

So I tried replacing that with this loop:

while(fread(&readchar,1,1,stdin) !=0){}

and it terminated at the exact same place. A little googling
found some references to ^Z in the input stream having this effect.
Great. So there is apparently an intrinsic problem passing
binary data through Windows XP pipes. On linux and Solaris
both forms happily read through a pure binary file in a
pipe without throwing a premature EOF.

The pipe was eliminated on Windows by using an explicit:

fin=fopen(filename,"rb")

at which point fgetc and getc and fread all were able to read the binary
file correctly. Unfortunately at that point fgetc/getc stopped treating
CR LF as a line terminator and returned both characters instead of
just a single "\n". That's ugly enough on Windows but should
be truly hideous indeed for some of the text file formats provided by
RMS on VMS.

I'm not going to worry about an fgets() that can
read a "line" containing arbitrary binary characters in all situations
on all platforms. For now I just need one that can handle embedded
null characters in files which are otherwise valid text files. The one
posted here can apparently do that on both Windows and linux/Solaris.

Thanks,

David Mathog
(e-mail address removed)

Keith Thompson · Mar 17, 2005

David Mathog said:
Hmm. Don't really want to read past the EOL with fread, which means
using it 1 char at a time like fgetc. Tried that and it ran at less
than half the speed of fgetc (gcc 3.3.2 on linux). Not surprising since
it is not usually called that way. The reason
I tried it is that when pure binary files were fed through
a pipe on windows xp to a tiny fgetc testbed they terminated
with an EOF only a few bytes into the program:

testfgetc <drivers.cab

where the loop being tested was:

while(fgetc(stdin) != EOF){}

So I tried replacing that with this loop:

while(fread(&readchar,1,1,stdin) !=0){}

and it terminated at the exact same place. A little googling
found some references to ^Z in the input stream having this effect.
Great. So there is apparently an intrinsic problem passing
binary data through Windows XP pipes. On linux and Solaris
both forms happily read through a pure binary file in a
pipe without throwing a premature EOF.

Sure, because stdin is a text stream, not a binary stream. If you
want to read binary data on stdin, you *might* be able to use
freopen(). It's implementation-defined whether this is allowed (and
I may be missing something else, since I've never tried this).

It works on Unix-like systems because they don't make a strong
distinction between text and binary files. EOF, for either text or
binary files, is marked by the end of the file, not by any special
character.

Kevin D. Quitt · Mar 17, 2005

Possibly. I'm sure there is one system which used LFCR as well.

Ever wonder why CRLF is traditional? It's because the time it took for
the print head to actually returned to the left margin usually was longer
than the time it took to feed the paper up a line, so CRLF saved time.

Walter Roberson · Mar 17, 2005

:Sure, because stdin is a text stream, not a binary stream. If you
:want to read binary data on stdin, you *might* be able to use
:freopen(). It's implementation-defined whether this is allowed

freopen() silently ignores failures to close the existing file,
and always opens the new file provided that appropriate access
exists (and the file exists or as appropriate.) freopen() does
a full close() first.

I suspect you may have been thinking of fdopen() instead of freopen().

Walter Roberson · Mar 17, 2005

:Ever wonder why CRLF is traditional? It's because the time it took for
:the print head to actually returned to the left margin usually was longer
:than the time it took to feed the paper up a line, so CRLF saved time.

But CR followed by a printable character was supposed to return to
margin and then print the new character at the beginning of the line.
Therefore the mechanism that implimented that had to have a look-ahead --
and that being the case, LFCR could have worked just as well.

understanding fgets()	11	Jul 26, 2011
fgets problem	23	Dec 22, 2008
fgets() and extra characters...	11	Dec 6, 2005
question on fgets	13	Jul 27, 2008
fgets - design deficiency: no efficient way of finding last character read	43	Apr 11, 2012
fgets() replacement	20	May 28, 2004
Replacing fgets	32	Sep 17, 2006
fgets not doing as I expect.	8	Dec 26, 2008

fgets() and embedded null characters

David Mathog

Eric Sosman

CBFalconer

David Mathog

Dave Vandervies

Flash Gordon

Flash Gordon

Dave Vandervies

CBFalconer

TTroy

websnarf

Keith Thompson

Flash Gordon

Chris Croughton

Walter Roberson

David Mathog

Keith Thompson

Kevin D. Quitt

Walter Roberson

Walter Roberson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads