Reading from files and range of char and friends

S

Spiros Bousbouras

If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ? If you try to store it in char or signed char then it's possible
that what you read may fall outside the range of the type in which case
you get implementation defined behavior according to 6.3.1.3 p. 3. So
then why doesn't fgets() get unsigned char* as first argument ? It
would make the life of the user simpler and possibly also the life of
the implementor.
 
A

Angel

If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Yes, when you read EOF which is not an unsigned char.

"fgetc() reads the next character from stream and returns
it as an unsigned char cast to an int, or EOF on end of file or
error."
(From the Linux man pages.)
 
S

Spiros Bousbouras

Yes, when you read EOF which is not an unsigned char.

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?
 
A

Angel

Yes, when you read EOF which is not an unsigned char.

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

No, but you should use a cast there or your compiler might balk because
unsigned char is likely to have less bits than int.

fgetc() returns an int because EOF has to have a value that cannot
normally be read from a file. Once you've determined that the read value
is not EOF, it's safe to store it as an unsigned char.

And in C there is no difference between "storing" and "temporarily
assigning". Every assignment lasts until overwritten.
 
P

Paul N

Yes, when you read EOF which is not an unsigned char.

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

char is normally used for storing characters, and I think that is what
it was designed for. So it seems a bit odd not to use it. If you're
going to use the str* functions to manipulate what you've read in,
then storing it as char seems sensible, and not doing so is likely to
require some nasty casts.

In my view anyway...
 
S

Spiros Bousbouras

If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?
Yes, when you read EOF which is not an unsigned char.

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

char is normally used for storing characters, and I think that is what
it was designed for. So it seems a bit odd not to use it.

But if arr[] is char how do you avoid the implementation defined
behavior when doing arr[position++] = a ?
 
A

Angel

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

char is normally used for storing characters, and I think that is what
it was designed for. So it seems a bit odd not to use it.

But if arr[] is char how do you avoid the implementation defined
behavior when doing arr[position++] = a ?

Depends on what exactly you are reading. If it's a normal text file
encoded in ASCII, converting the values read by fgetc() should be safe
because ASCII values are only 7 bits and will fit into a char.

If it's a binary file though, you'll have to use unsigned char, and
you should consider using fread instead.
 
S

Spiros Bousbouras

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

char is normally used for storing characters, and I think that is what
it was designed for. So it seems a bit odd not to use it.

But if arr[] is char how do you avoid the implementation defined
behavior when doing arr[position++] = a ?

Depends on what exactly you are reading. If it's a normal text file
encoded in ASCII, converting the values read by fgetc() should be safe
because ASCII values are only 7 bits and will fit into a char.

If it's a binary file though, you'll have to use unsigned char, and
you should consider using fread instead.

And what if it's a non ASCII text file ? It could be ISO-8859-1 or
UTF-8. An extra complication is that you may have to read some of the
file in order to determine what kind of information it contains.
 
A

Angel

On Thu, 10 Mar 2011 14:18:05 -0800 (PST)

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

char is normally used for storing characters, and I think that is what
it was designed for. So it seems a bit odd not to use it.

But if arr[] is char how do you avoid the implementation defined
behavior when doing arr[position++] = a ?

Depends on what exactly you are reading. If it's a normal text file
encoded in ASCII, converting the values read by fgetc() should be safe
because ASCII values are only 7 bits and will fit into a char.

If it's a binary file though, you'll have to use unsigned char, and
you should consider using fread instead.

And what if it's a non ASCII text file ? It could be ISO-8859-1 or
UTF-8. An extra complication is that you may have to read some of the
file in order to determine what kind of information it contains.

fgetc() is guaranteed to return either an unsigned char or EOF, so that
always works. Interpreting the read data is up to your program and will
depend on what exactly you are trying to accomplish.

UTF-8, as the name implies, is 8 bits wide and will fit in an unsigned
char (it will fit in a signed char too, but values >127 will be
converted to negative values), and so does ISO-8859-1. For character
encodings with more bits, there is fgetwc().
 
K

Keith Thompson

Spiros Bousbouras said:
On 10 Mar 2011 16:49:57 GMT

If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Yes, when you read EOF which is not an unsigned char.

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

char is normally used for storing characters, and I think that is what
it was designed for. So it seems a bit odd not to use it.

But if arr[] is char how do you avoid the implementation defined
behavior when doing arr[position++] = a ?

Typically by ignoring the issue. (Well, this doesn't avoid
the implementation defined behavior; it just assumes it's
ok.) On any system where this is a sensible thing to do, the
implementation-defined behavior is almost certain to be what you
want. Assigning a value exceeding CHAR_MAX to a char (assuming
plain char is signed) *could* give you a strange result, or even
raise an implementation-defined signal, but any implementation that
chose to do such a thing would break a lot of existing code.

C uses plain char (which may be signed) for strings, but it reads
characters from files as unsigned char values. IMHO this is a flaw
in the language. A byte read from a file with a representation
of 10101001 (0xa9) is far more likely to mean 169 than -87 (it's
a copyright symbol in Latin-1, 'z' in EBCDIC).

One solution might be to require plain char to be unsigned, but that
causes inefficient code for some operations -- which was more of
issue in the PDP-11 days than it is now, but it's probably still
significant.

Another might be to have fgetc() return an int representing either
a *plain* char value or EOF, but it's too late to change that.

I'm usually a strong advocate for writing code as portably as possible,
but in this case I suspect that workaround around the unsigned char vs.
plain char mismatch would be more effort than it's worth.
 
E

Eric Sosman

If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Sure. To see one reason in action, try

unsigned char uchar_password[SIZE];
...
if (strcmp(uchar_password, "SuperSecret") == 0) ...
If you try to store it in char or signed char then it's possible
that what you read may fall outside the range of the type in which case
you get implementation defined behavior according to 6.3.1.3 p. 3.

Yes. This is, IMHO, a weakness in the library design, a weakness
inherited from the pre-Standard days that also gave us gets(). The
practical consequence is that the implementation must define the
behavior "usefully" in order to make the library work as desired.
(The situation is particularly bad for systems with signed-magnitude
or ones' complement notations, where the sign of zero is obliterated
on conversion to unsigned char and thus cannot be recovered again
after getc().)
then why doesn't fgets() get unsigned char* as first argument ?

Hysterical raisins, I'd guess.

In-band signaling works well in some situations -- NULL for a
failed malloc() or strchr() or getenv(), for example -- but C has
used it in situations where the benefits are not so clear. getc()
is one of those, strtoxxx() is another, and no doubt there are other
situations where the "error return" can be confused with a perfectly
valid value. Even a failed bsearch() could usefully return something
more helpful than NULL, were there an independent channel to indicate
"I didn't find it."
 
F

Francois Grieu

Yes, when you read EOF which is not an unsigned char.

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Assuming position is initially 0 and a==EOF not needed, try
position = fread(arr,1,some_size,f);
This will not cause UB if the input is too big, and it has
a fair chance to be slightly faster.
Would there be any reason for arr to be something other than
unsigned char ?

Usually no (possible exception: dead slow type conversion).
Whenever fgetc(f) does not return EOF (being passed a valid f),
it returns an unsigned char casted to an int, and casting that
int back to unsigned char cause no data loss.

Francois Grieu
 
S

Spiros Bousbouras

If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Sure. To see one reason in action, try

unsigned char uchar_password[SIZE];
...
if (strcmp(uchar_password, "SuperSecret") == 0) ...

Just to be clear , the only thing that can go wrong with this example
is that strcmp() may try to convert the elements of uchar_password to
char thereby causing the implementation defined behavior. The same
issue could arise with any other str* function. Or is there something
specific about your example that I'm missing ?
Yes. This is, IMHO, a weakness in the library design, a weakness
inherited from the pre-Standard days that also gave us gets(). The
practical consequence is that the implementation must define the
behavior "usefully" in order to make the library work as desired.
(The situation is particularly bad for systems with signed-magnitude
or ones' complement notations, where the sign of zero is obliterated
on conversion to unsigned char and thus cannot be recovered again
after getc().)

If getc() read int's from files instead of unsigned char's would it be
realistically possible that reading from a file would return a negative
zero ? That would be one strange file.
Hysterical raisins, I'd guess.

For those who didn't get it , that's historical reasons.
In-band signaling works well in some situations -- NULL for a
failed malloc() or strchr() or getenv(), for example -- but C has
used it in situations where the benefits are not so clear. getc()
is one of those, strtoxxx() is another, and no doubt there are other
situations where the "error return" can be confused with a perfectly
valid value.

I don't see how this can happen with getc(). The only improvement I
can think of is that you could have two different return values to
denote exceptional situations instead of just EOF , one value would
denote end of file and the other error. But the current interface of
getc() could accommodate this just fine , you would only need to make
the 2 exceptional values negative.
Even a failed bsearch() could usefully return something
more helpful than NULL, were there an independent channel to indicate
"I didn't find it."

--
If strings doesn't work, then there's the "Read Microsoft" tool, rm,
which gives you the useful content of Word files that strings can't
extract and helpfully moves the hideous fonts, ugly typography, macro
viruses, and general bloat that make up the rest of this class of Word
files into the bit bucket for you.
Dave Vandervies
 
K

Keith Thompson

Spiros Bousbouras said:
If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Sure. To see one reason in action, try

unsigned char uchar_password[SIZE];
...
if (strcmp(uchar_password, "SuperSecret") == 0) ...

Just to be clear , the only thing that can go wrong with this example
is that strcmp() may try to convert the elements of uchar_password to
char thereby causing the implementation defined behavior. The same
issue could arise with any other str* function. Or is there something
specific about your example that I'm missing ?

The call to strcmp() violates a constraint. strcmp() expects const
char* (a non-const char* is also ok), but uchar_password, after
the implicit conversion is of type unsigned char*. Types char*
and unsigned char* are not compatible, and there is no implicit
conversion from one to the other.

If you use an explicit cast, it will *probably* work as expected,
but without the case the compiler is permitted to reject i.t

[...]
If getc() read int's from files instead of unsigned char's would it be
realistically possible that reading from a file would return a negative
zero ? That would be one strange file.

What would be so strange about it? If a file contains a sequence of
ints, stored as binary, and the implementation has a distinct
representation for negative zero, then the file could certainly contain
negative zeros.

[...]
 
S

Spiros Bousbouras

Spiros Bousbouras said:
unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

char is normally used for storing characters, and I think that is what
it was designed for. So it seems a bit odd not to use it.

But if arr[] is char how do you avoid the implementation defined
behavior when doing arr[position++] = a ?

Typically by ignoring the issue. (Well, this doesn't avoid
the implementation defined behavior; it just assumes it's
ok.) On any system where this is a sensible thing to do, the
implementation-defined behavior is almost certain to be what you
want.

Is there a system which has stdio.h but reading from a file and storing
what you read in an array is not a sensible thing to do ?

[...]
C uses plain char (which may be signed) for strings, but it reads
characters from files as unsigned char values. IMHO this is a flaw
in the language. A byte read from a file with a representation
of 10101001 (0xa9) is far more likely to mean 169 than -87 (it's
a copyright symbol in Latin-1, 'z' in EBCDIC).

Which makes me wonder if there are any character encodings in use where
some characters get encoded by negative numbers.
One solution might be to require plain char to be unsigned, but that
causes inefficient code for some operations -- which was more of
issue in the PDP-11 days than it is now, but it's probably still
significant.

Another might be to have fgetc() return an int representing either
a *plain* char value or EOF, but it's too late to change that.

The standard could say that if an implementation offers stdio.h then
the following function

int foo(unsigned char a) {
char b = a ;
unsigned char c = b ;
return a == c ;
}

always returns 1. This I think would be sufficient to be able to assign
the return value of fgetc() to char (after checking for EOF) without
worries. But does it leave any existing implementations out ? And while
I'm at it , how do existing implementations handle conversion to a
signed integer type if the value doesn't fit ? Anyone has any unusual
examples ?

Another approach would be to have a macro __WBUC2CA (well behaved
unsigned char to char assignment) which will have the value 1 or 0 and
if it has the value 1 then foo() above will be guaranteed to return 1.
 
T

Tim Rentsch

Eric Sosman said:
If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Sure. To see one reason in action, try

unsigned char uchar_password[SIZE];
...
if (strcmp(uchar_password, "SuperSecret") == 0) ...
If you try to store it in char or signed char then it's possible
that what you read may fall outside the range of the type in which case
you get implementation defined behavior according to 6.3.1.3 p. 3.

Yes. This is, IMHO, a weakness in the library design, a weakness
inherited from the pre-Standard days that also gave us gets(). The
practical consequence is that the implementation must define the
behavior "usefully" in order to make the library work as desired.
(The situation is particularly bad for systems with signed-magnitude
or ones' complement notations, where the sign of zero is obliterated
on conversion to unsigned char and thus cannot be recovered again
after getc().) [snip subsequent paragaphs]

Do you mean to say that if a file has a byte with a bit
pattern corresponding to a 'char' negative-zero, and
that byte is read (in binary mode) with getc(), the
result of getc() will be zero? If that's what you're
saying I believe that is wrong.
 
T

Tim Rentsch

Spiros Bousbouras said:
If getc() read int's from files instead of unsigned char's would it be
realistically possible that reading from a file would return a negative
zero ?

A call to getc() cannot return negative zero. The reason is,
getc() is defined in terms of fgetc(), which returns an
'unsigned char' converted to an 'int', and such conversions
cannot produce negative zeros.
 
T

Tim Rentsch

Keith Thompson said:
What would be so strange about it? If a file contains a sequence of
ints, stored as binary, and the implementation has a distinct
representation for negative zero, then the file could certainly contain
negative zeros.

I think the question he was asking is something different, which
is, "can the int values produced by getc() ever be (int) negative
zeros?", to which the answer is they cannot.
 
T

Tim Rentsch

Spiros Bousbouras said:
On 10 Mar 2011 16:49:57 GMT

If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Yes, when you read EOF which is not an unsigned char.

In my mind I was making a distinction between storing and temporarily
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

char is normally used for storing characters, and I think that is what
it was designed for. So it seems a bit odd not to use it.

But if arr[] is char how do you avoid the implementation defined
behavior when doing arr[position++] = a ?

Assuming: the bits are in the same places for the implementation that
wrote the file and the implementation reading the file; and CHAR_BIT
is also the same; and UCHAR_MAX < INT_MAX; then you could do this:

arr[position++] = a <= CHAR_MAX ? a : a - (UCHAR_MAX+1);

which works for all values that the target machine supports.
 
T

Tim Rentsch

Angel said:
[snip]

UTF-8, as the name implies, is 8 bits wide and will fit in an unsigned
char (it will fit in a signed char too,

It will on most implementations but the Standard does not
require that.
but values >127 will be converted to negative values),

Again true on most implementations but not Standard-guaranteed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top