getc and "large" bytes

vippstar · May 23, 2008

Assuming all the values of int are in the range of unsigned char, what
happends if getc returns EOF?
Is it possible that EOF was the value of the byte read?
Does that mean that code aiming for maximum portability needs to check
for both feof() and ferror()?
(for example, if both feof() and ferror() return 0 for the stream when
getc() returned EOF, consider EOF a valid byte read)
To me, that seems to be the case, but maybe the standard says this to
be incorrect.

As always, all replies appreciated.

Ben Pfaff · May 23, 2008

Assuming all the values of int are in the range of unsigned char, what
happends if getc returns EOF?

Your assumption is false.

vippstar · May 23, 2008

Your assumption is false.

Would you please elaborate?

vippstar · May 23, 2008

(e-mail address removed) said:

The int type must be able to represent values in the range INT_MIN to -1,
none of which values are in the range of unsigned char (which, lacking a
sign bit, cannot represent negative values).

I'm talking about the case that both int and unsigned char are 16
bits, and to be honest I'm still not convinced that this is false.

Richard Tobin · May 23, 2008

Assuming all the values of int are in the range of unsigned char, what
happends if getc returns EOF?

If int and char are the same size, and all possible unsigned char
values can be read, then it is possible that getc() will attempt to
convert to an int a value which cannot be represented as one. This is
implementation-defined. Assuming it works in the usual way, it may
return a negative integer equal to EOF.

Does that mean that code aiming for maximum portability needs to check
for both feof() and ferror()?

Yes, but it seems to me that undefined behaviour is involved.

For maximum portability, don't use machines like that

-- Richard

Ben Pfaff · May 23, 2008

Would you please elaborate?

-1 is in the range of int.
-1 is not in the range of unsigned char.
Therefore it is not true that all the values of int are in the
range of unsigned char.

Bartc · May 23, 2008

Ben Pfaff said:
-1 is in the range of int.
-1 is not in the range of unsigned char.
Therefore it is not true that all the values of int are in the
range of unsigned char.

The OP mentioned an example where both might be 16 bits. So -1 in one could
be 0xFFFF in the other, causing ambiguity in the (I think unlikely) event of
reading a 16-bit character 0xFFFF from a file with 16-bit encoding.

(How would such a character size read standard 8-bit files? By
zero-extending to 16?)

Keith Thompson · May 23, 2008

I'm talking about the case that both int and unsigned char are 16
bits, and to be honest I'm still not convinced that this is false.

Your underlying point is right; you just stated it incorrectly. The
problem occurs when not all values of unsigned char are in the range
of int.

The value returned by getc() is either the next character from the
input stream, interpreted as an unsigned char and converted to int, or
the value EOF (which must be negative and is typically -1).

On most systems, all values of type unsigned char can be converted to
int without changing their numeric value.

If both int and unsigned char are 16 bits, then (a) the conversion
from unsigned char to int is implementation-defined for values
numerically greater than INT_MAX, and (b) some valid unsigned char
value might be converted to the value EOF.

You can work around (b) by checking feof() and ferror() after getc()
returns EOF. If both are false, then you can assume that you read a
legimate character (say, 0xFFFF) that happened to be converted to EOF
(or that there's a bug in the implementation's feof() or ferror()
function, which might be almost as likely). Most programmers don't
bother to worry about this possibility. As a result, some code will
likely break if ported to such a system (most likely a DSP, which
probably has a freestanding implementation anyway and thus needn't
support <stdio.h> at all) *if* it happens to read such a character.

(a) the implementation-definedness of the conversion, could be a more
serious problem. Given this problem, I can't think of a way to write
*really* portable code to read from a file.

fread() is likely to copy the input directly into an array of
characters, and thus probably won't run into the same problem -- but
fread() is defined to work by calling fgetc(), so the standard doesn't
guarantee that you won't run into exactly the same problem.

In my opinion, it would be reasonable for the standard to require
INT_MAX >= UCHAR_MAX for all hosted implementations.

Keith Thompson · May 23, 2008

Bartc said:
The OP mentioned an example where both might be 16 bits. So -1 in one could
be 0xFFFF in the other, causing ambiguity in the (I think unlikely) event of
reading a 16-bit character 0xFFFF from a file with 16-bit encoding.

No, -1 and 0xFFFF are two different values. It's possible that one of
those values is the result of converting the other.

(How would such a character size read standard 8-bit files? By
zero-extending to 16?)

It would be implementation-defined, or perhaps undefined.

For such an implementation to see an 8-bit file, the file would have
to have been copied to the system, or at least made visible somehow.
Such copying might necessarily involve some sort of conversion. The
conversion is outside the scope of C.

Keith Thompson · May 23, 2008

Eric Sosman said:
It seems to me that the behavior required of getc() places
far-reaching requirements on implementations where `int' and
`char' have the same width. Here are a few:

1) Since `unsigned char' can represent 2**N distinct values
and all of these must be distinguishable when converted to `int',
it follows that `int' must also have 2**N distinct values. Thus,
signed-magnitude and ones' complement representations are ruled
out, and INT_MIN must have its most negative possible value
(that is, INT_MIN == -INT_MAX - 1, all-bits-set cannot be a trap
representation).

[...]

How do you conclude that all 2**N distinct values of type unsigned
char must be distinguishable when converted to int? The result of the
conversion is implementation-defined. If, for example, int has the
range -32768 .. +32767, and unsigned char has the range 0 .. 65536, I
see nothing in the standard that forbids converting all unsigned char
values greater than 32767 to 32767 (saturation). It would break
stdio, but I'm not convinced that that would make it non-conforming
(particularly for a freestanding implementation that needn't provide
stdio).

Richard Tobin · May 23, 2008

Keith Thompson said:
In my opinion, it would be reasonable for the standard to require
INT_MAX >= UCHAR_MAX for all hosted implementations.

An implementation with, say, 16-bit ints and chars is still likely to
have 8-bit data on disk and most other input sources. In which case
fgetc() could read 8-bit values, and have no problem. At least, I
don't konw of anything in the standard that prevents this.

One can imagine a future implementation that uses UTF-32-encoded
Unicode characters, and has 32-bit chars. In that case there is no
problem with text (because Unicode in fact only goes up to about
2^20), but binary data would still have the problem.

-- Richard

Richard Tobin · May 23, 2008

The OP mentioned an example where both might be 16 bits. So -1 in one could
be 0xFFFF in the other, causing ambiguity in the (I think unlikely) event of
reading a 16-bit character 0xFFFF from a file with 16-bit encoding.

There is a problem with chars and ints of equal size, but the OP
expressed it wrongly: he talked about "all the values of int [being]
in the range of unsigned char" - which can't happen, because negative
ints aren't in the range of unsigned char. The right way to put it is
that some of the values of unsigned char are not representable as int.

-- Richard

Bartc · May 23, 2008

Keith Thompson said:
No, -1 and 0xFFFF are two different values. It's possible that one of
those values is the result of converting the other.

I don't understand. In a 16-bit system where all 65536 bit patterns might
represent characters, what bit pattern would you use to signal EOF?

(Reading Eric's first post:

An implication of (1) for the programmer is that yes, there
will be a legitimate `unsigned char' value that maps to EOF
when converted to `int'.

this seems to suggest that yes an ambiguity can occur.)

Keith Thompson · May 23, 2008

Bartc said:
I don't understand. In a 16-bit system where all 65536 bit patterns might
represent characters, what bit pattern would you use to signal EOF?

Bit patterns are not values. A value is an *interpretation* of a bit
pattern; the interpretation is done with respect to a specified type.

For example, an object of type float with the value 123.0 and an
object of type unsigned int with the value 0x42f60000 might happen to
contain the same bit pattern, but they have distinct values because
those bit patterns (representations) are interpreted as having
different types.

(Reading Eric's first post:

this seems to suggest that yes an ambiguity can occur.)

Yes.

Walter Roberson · May 23, 2008

Keith Thompson said:
Bit patterns are not values. A value is an *interpretation* of a bit
pattern; the interpretation is done with respect to a specified type.

Not in C: in C, a bit pattern is a *representation* of a value.
A machine doesn't have to use real bits (binary digits) as long as
the operators produce the right -values-.

(Though, I'd want to have another look over the wording on
floating point representations, as I seem to recall that that
wording could be interpreted as requiring Real Bits (SM).)

lawrence.jones · May 23, 2008

Richard Tobin said:
An implementation with, say, 16-bit ints and chars is still likely to
have 8-bit data on disk and most other input sources. In which case
fgetc() could read 8-bit values, and have no problem. At least, I
don't konw of anything in the standard that prevents this.

Writing a byte with fputc() and then reading it back with fgetc() must
produce the same value. That won't happen if you only write or read
half the bits.

-- Larry Jones

It must be sad being a species with so little imagination. -- Calvin

Richard Tobin · May 23, 2008

An implementation with, say, 16-bit ints and chars is still likely to
have 8-bit data on disk and most other input sources. In which case
fgetc() could read 8-bit values, and have no problem. At least, I
don't konw of anything in the standard that prevents this.

[/QUOTE]

Writing a byte with fputc() and then reading it back with fgetc() must
produce the same value. That won't happen if you only write or read
half the bits.

Are all possible unsigned char values required to be characters that
can be written and read? If char was 16 bits, could putchar(999)
always produce an i/o error?

-- Richard

Ian Collins · May 24, 2008

Jack said:
I've looked at a large number of posts in this thread, and I'm a bit
puzzled. I have actually done a little work with a DSP where all the
integer types were 32-bit, and am still doing a lot of work on a
platform where char, int, and short are all 16 bits.

I just want to ask all of you a few questions:

1. How many of you have actually ever worked on an implementation
where CHAR_BIT was greater than 8? Show of hands, please.

Yes (Long ago and far away, a DSP)

2. How many of you have actually ever worked on a hosted
implementation where CHAR_BIT was greater than 8, that fully supported
binary streams?
No.

3. How many of you have even heard of hosted environments, with full
support for binary streams, where CHAR_BIT is greater than 8?

No.

vippstar · May 24, 2008

That's the way I do it:

int get_line(char **lineptr, size_t *n, FILE *stream)
<snip code>

Thanks pete. I will look into your get_line function.

vippstar · May 24, 2008

Are all possible unsigned char values required to be characters that
can be written and read? If char was 16 bits, could putchar(999)
always produce an i/o error?

Looks like a good idea. Or for the implementation to guarantee that
INT_MAX >= UCHAR_MAX. Mr Thompson suggested the standard to guarantee
that, but I think any implementation that has the-problem-i-cant-
express-right (as noted by others), would be solely to make the
programmers life difficult.

getc() and EOF	5	Jan 26, 2007
Problem: read block of bytes from socket.	4	Jul 9, 2013
stream bytes	2	Dec 5, 2011
Questions about K&R (Kernighan and Ritchi)	57	Apr 22, 2010
Frustrating circular bytes issue	1	Jun 26, 2012
Practical packing for structs of bytes	12	Sep 17, 2010
emptying files	19	Nov 18, 2008
getc() only returns 1500 chars on a 5 Meg file	9	Feb 22, 2004

getc and "large" bytes

vippstar

Ben Pfaff

vippstar

vippstar

Richard Tobin

Ben Pfaff

Bartc

Keith Thompson

Keith Thompson

Keith Thompson

Richard Tobin

Richard Tobin

Bartc

Keith Thompson

Walter Roberson

lawrence.jones

Richard Tobin

Ian Collins

vippstar

vippstar

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads