The need for int to capture getchar()'s EOF

L

Luke Wu

Whenever one runs across programs that use the return value from
getchar() to read input, it's almost always accepted into an
int-defined variable.

I've read explanations on this and have always used this form.
Recently, someone asked me why an int was needed, since -1 (usually
EOF) can be safely represented in a signed char value. In fact, any
negative number can be safely represented in a signed char value, and
wouldn't conflict with the 0-127 positive values.


I ran the following program under two diff environments and it worked
fine:

int main()
{
signed char c;

while((c = getchar()) != EOF)
putchar(c);
}

return 0;
}

If one is only expecting input in the form of a 7-bit character set
(i.e., US-ASCII), is it safe to use signed char? I'm asking because I
use C to program little, resource poor, 8bit microcontrollers, and
often have to implement 7-bit ASCII based communications. Using
unsigned char instead of int could improve timing/ROM/RAM. But is
there something I'm missing (lurking bug)?
Help will be greatly appreciated. Thanks
 
E

Eric Sosman

Luke said:
Whenever one runs across programs that use the return value from
getchar() to read input, it's almost always accepted into an
int-defined variable.

I've read explanations on this and have always used this form.
Recently, someone asked me why an int was needed, since -1 (usually
EOF) can be safely represented in a signed char value. In fact, any
negative number can be safely represented in a signed char value, and
wouldn't conflict with the 0-127 positive values.
[...]
If one is only expecting input in the form of a 7-bit character set
(i.e., US-ASCII), is it safe to use signed char? I'm asking because I
use C to program little, resource poor, 8bit microcontrollers, and
often have to implement 7-bit ASCII based communications. Using
unsigned char instead of int could improve timing/ROM/RAM. But is
there something I'm missing (lurking bug)?
Help will be greatly appreciated. Thanks

If EOF is -1 (as it usually is), it can indeed be
represented as a `char', signed or unsigned -- in the
latter case, it will end up with the value UCHAR_MAX.

The problem isn't (usually) with the representation
of EOF in a `char', but with distinguishing between
`(char)EOF' and a legitimate input character. If you
get 0xFF as an input character (on a system with 8-bit
characters and two's complement negative numbers), that
value will be indistinguishable from -1 -- so how will
you know whether you got an exceptional condition or an
actual input datum? (One way would be to use both feof()
and ferror() on the stream after detecting the ambiguous
value; if both are false, you actually received 0xFF as
an input character. But that's pretty clumsy -- and you
said you were interested in speed, yes?)

If you really, truly know for certain that all the
legitimate input characters are 7-bit ASCII, and you
really, truly know for certain that EOF is -1, and you're
willing to go cheerfully to Hell if you're wrong, then
yes: you can go ahead and use a `char' to receive the
result of getchar(). All the 7-bit ASCII characters will
turn up with values that are distinct from `(char)-1', and
your code will work.

But it seems to me that you're risking your soul for
two stale Hostess Twinkies and half a can of Dr. Pepper
that's been out in the sun too long. What you get is worth
something, to be sure, but not very much. You're worried
about the speed penalty imposed by using one `int' variable
instead of one `char' variable -- but hold on there, bucko;
there's I/O going on behind the scenes, is there not? Even
my creaky old (but much beloved) 8-bit Kaypro 4-84 running at
4MHz could outpace all the I/O devices it was attached to.
That was twenty years ago, and the speed gap between I/O and
CPU has not exactly narrowed with the passage of time: disks
and such are perhaps 40 or 50 times the speed of those oldies,
but CPUs are something like 1000 times faster than they were.

The game is playable, Luke, but it's probably not worth
the candle.
 
K

Keith Thompson

Luke Wu said:
Whenever one runs across programs that use the return value from
getchar() to read input, it's almost always accepted into an
int-defined variable.

I've read explanations on this and have always used this form.
Recently, someone asked me why an int was needed, since -1 (usually
EOF) can be safely represented in a signed char value. In fact, any
negative number can be safely represented in a signed char value, and
wouldn't conflict with the 0-127 positive values.


I ran the following program under two diff environments and it worked
fine:

int main()
{
signed char c;

while((c = getchar()) != EOF)
putchar(c);
}

return 0;
}

If one is only expecting input in the form of a 7-bit character set
(i.e., US-ASCII), is it safe to use signed char? I'm asking because I
use C to program little, resource poor, 8bit microcontrollers, and
often have to implement 7-bit ASCII based communications. Using
unsigned char instead of int could improve timing/ROM/RAM. But is
there something I'm missing (lurking bug)?
Help will be greatly appreciated. Thanks

You mean using signed char, not using unsigned char, right?

You're writing code that could break (terminate early) if you happen
to read a character that's outside the range of 0..CHAR_MAX. And if
EOF is defined as a value less than CHAR_MIN (unlikely, but legal),
your loop will never terminate.

I'm skeptical that using char rather than int will actually improve
anything. You're talking about a single byte difference in the size
of the variable (assuming 16-bit int), and alignment constraints might
mean that the byte you save is wasted anyway. And since the getchar()
function returns an int, assigning it to a character variable may
impose an additional cost to do the conversion; assigning the result
to an int could well be cheaper.

Measure the actual performance of your program. If using a char
rather than an int actually saves you enough time or space, *and* if
you're willing to accept the drawbacks I mentioned (and any others I
haven't thought of), you might consider doing this; otherwise, I
wouldn't. (Personally I probably wouldn't take the time to look into
this in the first place, but that's up to you.) And if you decide to
use a character, add a comment explaining exactly what you're doing
and why.
 
J

James McIninch

<posted & mailed>

Luke said:
Whenever one runs across programs that use the return value from
getchar() to read input, it's almost always accepted into an
int-defined variable.

True enough. getchar() returns an int.

I've read explanations on this and have always used this form.
Recently, someone asked me why an int was needed, since -1 (usually
EOF) can be safely represented in a signed char value. In fact, any
negative number can be safely represented in a signed char value, and
wouldn't conflict with the 0-127 positive values.

That's incorrect logic. The point being that getchar() needs to be able to
return any value representable as a char, which means 0 through 255
decimal. In addition to that, it must represent at least one additional
value, EOF, that lies outside the range representable by type char.
 
C

CBFalconer

Luke said:
.... snip ...

If one is only expecting input in the form of a 7-bit character set
(i.e., US-ASCII), is it safe to use signed char? I'm asking because I
use C to program little, resource poor, 8bit microcontrollers, and
often have to implement 7-bit ASCII based communications. Using
unsigned char instead of int could improve timing/ROM/RAM. But is
there something I'm missing (lurking bug)?

The point is to differentiate between EOF and characters. Once you
do anything of the sort your code is no longer portable. For
example, EOF might be defined as -32768. However, in your embedded
application you probably have control over the actual char. set
anyhow, and of how the system finds and signals EOF. The result is
not going to be standards compliant, so you should isolate the
connecting functions in a separate file, and comment them
accordingly.
 
C

Chris Croughton

If one is only expecting input in the form of a 7-bit character set
(i.e., US-ASCII), is it safe to use signed char? I'm asking because I
use C to program little, resource poor, 8bit microcontrollers, and
often have to implement 7-bit ASCII based communications. Using
unsigned char instead of int could improve timing/ROM/RAM. But is
there something I'm missing (lurking bug)?

Well, lurking non-portability at least (where char is unsigned or EOF is
less than CHAR_MIN) if yoy are absolutely certain that your input is
only 7 bit. But why on earth are you using getchar on an 8 bit
microcontroller where you are short of resources anyway? Why pull in a
whole stdio library to do specialised communications? That environment
is an example of where a 'freestanding' implementation is generally used
with most of the standard functions not present.

Of course, if you have a special library for that microcontroller which
happens to call its input function getchar(), it can do anything it
wants, and you are not bothered by portability. In which case you are
free to do whatever you want because it's nothing to do with standard C
libraries (although I'd advise renaming the input function if it doesn't
conform, otherwise later programmers are likely to get confused).

(Yes, I used to work on 8 bit micros with a full C library, but in those
days that was all that was available to a home user on a limited
budget...)

Chris C
 
L

Lawrence Kirby

Luke said:
Whenever one runs across programs that use the return value from
getchar() to read input, it's almost always accepted into an
int-defined variable.

I've read explanations on this and have always used this form.
Recently, someone asked me why an int was needed, since -1 (usually
EOF) can be safely represented in a signed char value. In fact, any
negative number can be safely represented in a signed char value, and
wouldn't conflict with the 0-127 positive values.
[...]
If one is only expecting input in the form of a 7-bit character set
(i.e., US-ASCII), is it safe to use signed char? I'm asking because I
use C to program little, resource poor, 8bit microcontrollers, and
often have to implement 7-bit ASCII based communications. Using
unsigned char instead of int could improve timing/ROM/RAM. But is
there something I'm missing (lurking bug)?
Help will be greatly appreciated. Thanks

If EOF is -1 (as it usually is), it can indeed be
represented as a `char', signed or unsigned -- in the
latter case, it will end up with the value UCHAR_MAX.

-1 is not representable as an unsigned char. UCHAR_MAX is a different
value. Typically getc() can return EOF and UCHAR_MAX as and they
mean different things.
The problem isn't (usually) with the representation
of EOF in a `char',

But there are problems such as (char)EOF != EOF if char is unsigned and
UCHAR_MAX <= INT_MAX. So code such as while ((ch = getch(fp)) != EOF)
will not work properly, the loop here will never terminate.
but with distinguishing between
`(char)EOF' and a legitimate input character. If you
get 0xFF as an input character (on a system with 8-bit
characters and two's complement negative numbers), that
value will be indistinguishable from -1 -- so how will
you know whether you got an exceptional condition or an
actual input datum? (One way would be to use both feof()
and ferror() on the stream after detecting the ambiguous
value; if both are false, you actually received 0xFF as
an input character. But that's pretty clumsy -- and you
said you were interested in speed, yes?)

Clumsy, yes, but not too bad performance wise unless you have a file full
of 255 valued characters.
If you really, truly know for certain that all the
legitimate input characters are 7-bit ASCII, and you
really, truly know for certain that EOF is -1, and you're
willing to go cheerfully to Hell if you're wrong, then
yes: you can go ahead and use a `char' to receive the
result of getchar(). All the 7-bit ASCII characters will
turn up with values that are distinct from `(char)-1', and
your code will work.

As long as char is signed or you work around the problems where it is
unsigned.
But it seems to me that you're risking your soul for
two stale Hostess Twinkies and half a can of Dr. Pepper
that's been out in the sun too long. What you get is worth
something, to be sure, but not very much. You're worried
about the speed penalty imposed by using one `int' variable
instead of one `char' variable -- but hold on there, bucko;

If you're calling a library function that already returns int, you might
as well test the return value for EOF as int, the overhead will be
minimal. If you're using I/O with non standard functions, or writing those
I/O functions, in a freestanding environment you do what is most
appropriate according to what you know about the problem.
there's I/O going on behind the scenes, is there not? Even
my creaky old (but much beloved) 8-bit Kaypro 4-84 running at
4MHz could outpace all the I/O devices it was attached to.
That was twenty years ago, and the speed gap between I/O and
CPU has not exactly narrowed with the passage of time: disks
and such are perhaps 40 or 50 times the speed of those oldies,
but CPUs are something like 1000 times faster than they were.

Still if you're working with an 8 bit processor (which are used these
days where price is more important than performance) it is quite possible
that it can't keep up with a "modern" comms device without some effort.
The margins can be squeezed quite tightly in the name of saving costs.

Lawrence
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,522
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top