integer overflow in scanf functions

V

vid512

hi.

i wanted to know why doesn't the scanf functions check for overflow
when reading number. For example scanf("%d" on 32bit machine considers
"1" and "4294967297" to be the same.

I tracked to code to where the conversion itself happens. Code in
scanfs just ignores return value from conversion procedures.

More info in case of glibc posted here:
http://board.flatassembler.net/topic.php?t=6359

AFAIK, implementation doesn't define behavior in case of overflow, so
glibc could consider this error and return errno=ERANGE
 
W

Walter Roberson

i wanted to know why doesn't the scanf functions check for overflow
when reading number. For example scanf("%d" on 32bit machine considers
"1" and "4294967297" to be the same.

Because that's how it is spec'd.

"An input item is defined as the longest matching sequence of
characters, unless that exceeds a specified field width, in
which case it is the initial subsequence of that length in
the sequence." [...]

"Except in the case of a % specifier, the input item (or, in the
case of a %n directive, the count of input characters) is
converted to a type appropriate for the conversion specifier. [...]
Unless assignment suppression was indicated by a *, the result
of the conversion is placed in the object pointed to by the first
argument following the format argument that has not
already received a conversion result. If this object does not
have an appropriate type, or if the result of the conversion cannot
be represented in the space provided, the behaviour is undefined."


So there you have it: if you didn't put in a field width, then
the %d is *required* to pull in all the decimal digits there, and
if that's too big for an int, then the result is officially undefined.
This is how fscanf (and hence scanf) are -required- to work according
to the standard.
 
R

Random832

2006-12-15 said:
Because that's how it is spec'd.

"An input item is defined as the longest matching sequence of
characters,

And in what way is "429496729" a matching sequence of characters, if
there is no such integer value?
unless that exceeds a specified field width, in
which case it is the initial subsequence of that length in
the sequence." [...]

"Except in the case of a % specifier, the input item (or, in the
case of a %n directive, the count of input characters) is
converted to a type appropriate for the conversion specifier. [...]
Unless assignment suppression was indicated by a *, the result
of the conversion is placed in the object pointed to by the first
argument following the format argument that has not
already received a conversion result. If this object does not
have an appropriate type, or if the result of the conversion cannot
be represented in the space provided, the behaviour is undefined."

It's undefined. Which means there _are_ no requirements. An
implementation is free to treat it as 1, or as 429496729 with 7 still on
the stream, or as such with 7 _not_ still on the stream, or as
4294967295 (saturation), etc, etc

Anyway, I found a possible situation in which my scanf is
non-conformant:

Numerical strings are truncated to 512 characters; for example, %f
and %d are implicitly %512f and %512d.

So, if I send %f

1.0000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
e1

it converts to 1 instead of 10. Does the standard allow this?
 
J

jacob navia

Walter Roberson a écrit :
i wanted to know why doesn't the scanf functions check for overflow
when reading number. For example scanf("%d" on 32bit machine considers
"1" and "4294967297" to be the same.


Because that's how it is spec'd.

"An input item is defined as the longest matching sequence of
characters, unless that exceeds a specified field width, in
which case it is the initial subsequence of that length in
the sequence." [...]

"Except in the case of a % specifier, the input item (or, in the
case of a %n directive, the count of input characters) is
converted to a type appropriate for the conversion specifier. [...]
Unless assignment suppression was indicated by a *, the result
of the conversion is placed in the object pointed to by the first
argument following the format argument that has not
already received a conversion result. If this object does not
have an appropriate type, or if the result of the conversion cannot
be represented in the space provided, the behaviour is undefined."


So there you have it: if you didn't put in a field width, then
the %d is *required* to pull in all the decimal digits there, and
if that's too big for an int, then the result is officially undefined.
This is how fscanf (and hence scanf) are -required- to work according
to the standard.

In general functions like scanf are unusable. They are so
problematic, that it is a wonder when they work at all.

Use strtol, or a similar function that will give reasonable
error returns...
 
V

vid512

so, we agree, it's undefined.

wouldn't it be better to return this overflow as error? 10 digits would
be read off the file/stream/whatever, and function will return as if
number format was invalid, with errno=ERANGE.

i don't think that current behavior is what people await. and scanf
functions are doing lot of "smart" stuff already, just because people
await such behavior.
 
W

Walter Roberson

"Except in the case of a % specifier, the input item (or, in the
case of a %n directive, the count of input characters) is
converted to a type appropriate for the conversion specifier. [...]
Unless assignment suppression was indicated by a *, the result
of the conversion is placed in the object pointed to by the first
argument following the format argument that has not
already received a conversion result. If this object does not
have an appropriate type, or if the result of the conversion cannot
be represented in the space provided, the behaviour is undefined."
It's undefined. Which means there _are_ no requirements. An
implementation is free to treat it as 1, or as 429496729 with 7 still on
the stream, or as such with 7 _not_ still on the stream, or as
4294967295 (saturation), etc, etc

No, consumption of the maximum characters is -required-. It cannot
leave the other characters in the stream. The undefined part comes
in the valuation and storage of the overly-long result, not in
how many characters are consumed from input.
 
C

CBFalconer

jacob said:
.... snip ...

In general functions like scanf are unusable. They are so
problematic, that it is a wonder when they work at all.

Use strtol, or a similar function that will give reasonable
error returns...

No, that requires assigning a buffer of sufficient size, which is
unknown a-priori. Instead take a look at:

<http://cbfalconer.home.att.net/download/txtio.zip>

(which has been revised, but not posted) for a method of reading
values from a text stream without any buffer assignment needed. In
particular see txtinput.c.
 
E

Eric Sosman

Walter said:
No, consumption of the maximum characters is -required-. It cannot
leave the other characters in the stream. The undefined part comes
in the valuation and storage of the overly-long result, not in
how many characters are consumed from input.

Once undefined behavior strikes, the program has no way
to tell how many characters were or were not consumed. All
requirements lose their force in the face of U.B.
 
R

Random832

2006-12-15 said:
"Except in the case of a % specifier, the input item (or, in the
case of a %n directive, the count of input characters) is
converted to a type appropriate for the conversion specifier. [...]
Unless assignment suppression was indicated by a *, the result
of the conversion is placed in the object pointed to by the first
argument following the format argument that has not
already received a conversion result. If this object does not
have an appropriate type, or if the result of the conversion cannot
be represented in the space provided, the behaviour is undefined."
It's undefined. Which means there _are_ no requirements. An
implementation is free to treat it as 1, or as 429496729 with 7 still on
the stream, or as such with 7 _not_ still on the stream, or as
4294967295 (saturation), etc, etc

No, consumption of the maximum characters is -required-. It cannot
leave the other characters in the stream. The undefined part comes
in the valuation and storage of the overly-long result, not in
how many characters are consumed from input.

No, I don't think you get it.

In an undefined situation, the standard forbids nothing.

Meaning the implementation gets to do whatever the f*** it wants to,
regarding anything, once anything has happened that has been undefined.
 
P

Peter Nilsson

Eric said:
Once undefined behavior strikes, the program has no way
to tell how many characters were or were not consumed.
All requirements lose their force in the face of U.B.

True, but suppose an implementation defines the usual non-trapping 2c
overflow or strtoxxx style behaviour for the %d fscanf case, then the
behaviour is no longer undefined and the normal rules apply.

Of course, few implementations go so far as to actually define (i.e.
guarantee) such behaviour, let alone document it.
 
W

Walter Roberson

2006-12-15 said:
No, I don't think you get it.
In an undefined situation, the standard forbids nothing.
Meaning the implementation gets to do whatever the f*** it wants to,
regarding anything, once anything has happened that has been undefined.

The C90 standard defines a three-part operation, first reading
the characters, then converting the type of the value, and then
attempting to store the received value. The first two parts
do not allow for undefined behaviour: only the storage aspect does.

Therefor, in a conforming C90 implementation, the complete sequence
of decimal digits is certain to be read. Stopping reading the stream
at the maximum usable int length (for %d) is not one of the options.
The "undefined behaviour" might then go through the trouble of
"putting back" the extra characters somehow, but read them first it
must.

Ah, there's a simple way to tell: use assignment supression. Then no
actual storage attempt takes place, so whether the receiving variable
is the right size or type is not at question, and undefined behaviour
cannot take place. If you then have another format element to read a
value, or use %n to find the number of characters read, you can
determine where the %d scan left off. C90 tells you where you
should be (i.e., after the sequence of decimal characters); if
your system does leave you in the middle then your system is wrong.
 
C

Chris Torek

Anyway, I found a possible situation in which my scanf is
non-conformant:

Numerical strings are truncated to 512 characters; for example, %f
and %d are implicitly %512f and %512d.

So, if I send %f

1.0000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
e1

it converts to 1 instead of 10. Does the standard allow this?

Yes:

Environmental limits

[#7] An implementation shall support text files with lines
containing at least 254 characters, including the
terminating new-line character. The value of the macro
BUFSIZ shall be at least 256.

(under "7.13.2 Streams" in the draft .txt file I keep handy).

Most stdio implementations will have *some* convenient limit, as
they will read numerical input into a buffer and then use strtol(),
strtoll(), strtod(), etc., to perform the actual conversions. That
limit must be at least 254, but need not be as high as BUFSIZ (that
is, just because BUFSIZ is, say, 8192, does not mean that scanf()
must be able to eat 8192-digit numbers).
 
R

Random832

2006-12-16 said:
The C90 standard defines a three-part operation, first reading
the characters, then converting the type of the value, and then
attempting to store the received value. The first two parts
do not allow for undefined behaviour: only the storage aspect does.

And once the storage aspect _does_ have undefined behavior, it can
then go backwards in time and change how the other two aspects operated
in the first place.

In an undefined situation, the C standard forbids nothing.
Therefor, in a conforming C90 implementation, the complete sequence
of decimal digits is certain to be read. Stopping reading the stream
at the maximum usable int length (for %d) is not one of the options.
The "undefined behaviour" might then go through the trouble of
"putting back" the extra characters somehow, but read them first it
must.

It's undefined, there's no rule against time paradoxes.
Ah, there's a simple way to tell: use assignment supression. Then no
actual storage attempt takes place, so whether the receiving variable
is the right size or type is not at question, and undefined behaviour
cannot take place.

But since the behavior is undefined when assignment suppression is not
used, it's free to act differently than if it is used.
 
R

Random832

2006-12-16 said:
Anyway, I found a possible situation in which my scanf is
non-conformant:

Numerical strings are truncated to 512 characters; for example, %f
and %d are implicitly %512f and %512d.

So, if I send %f

1.0000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
e1

it converts to 1 instead of 10. Does the standard allow this?

Yes:

Environmental limits

[#7] An implementation shall support text files with lines
containing at least 254 characters, including the
terminating new-line character. The value of the macro
BUFSIZ shall be at least 256.

And what about sscanf?

int main() {
char *x[515];
double n;
memset(x+2,'0',510);
x[0] = '1'; x[1] = '.'; x[512] = 'e'; x[513] = '1'; x[514] = 0;
sscanf(x,"%lf",&n); printf("%f",x);
}

prints 1 or 10?
 
C

Chris Torek

Random832 said:
Does the standard allow [scanf to place limits on the size of
numbers converted with %d, %f, etc]

[snippage]

And what about sscanf?

As far as I can tell, the same rules apply.

Since there is no documentation requirement and no fixed upper
bound (just that "254" I quoted as a lower bound), each scanf
(either each call, or each member of the family, or both) could
use a different limit, too, as long as it is at least 254 each
time.

Practically speaking, I would expect either all the functions
(scanf, fscanf, and sscanf) would have the same limit because they
use the same internal engine; or the engine might "see" that sscanf
is working off a string in memory, hence there is no need to make
a copy of digit-sequences for strto*(), hence sort of "accidentally"
avoid upper limits there. (However, the ruling that conversion
of, e.g., "1.23e-x" must fail, instead of converting "1.23" and
leaving the e-x for the next directive, would make this harder than
one might think at first. If the implementor just used the endptr
parameter from strtod(), sscanf against "1.23e-x" with "%f%s" would
convert two items successfully, instead of failing as required.)
 
C

CBFalconer

Chris said:
.... snip ...

Practically speaking, I would expect either all the functions
(scanf, fscanf, and sscanf) would have the same limit because they
use the same internal engine; or the engine might "see" that sscanf
is working off a string in memory, hence there is no need to make
a copy of digit-sequences for strto*(), hence sort of "accidentally"
avoid upper limits there. (However, the ruling that conversion
of, e.g., "1.23e-x" must fail, instead of converting "1.23" and
leaving the e-x for the next directive, would make this harder than
one might think at first. If the implementor just used the endptr
parameter from strtod(), sscanf against "1.23e-x" with "%f%s" would
convert two items successfully, instead of failing as required.)

There is no necessity to have ANY string length limit affect these
textstream-to-number conversions. I have written code that avoids
the problem entirely. However the error condition for "1.2e-x"
sequences remains. This can obviously be handled easily when the
input is a string, and is otherwise limited by the guaranteed
lookback (ungetc) level.

I disagree that such an input must fail. The interpretation as a
number, followed by a string, seems perfectly reasonable to me.
The cure here is that the application must check the termination
char for the numeric field.

In addition, there should be no problem at the system level in
providing multi-level ungetc ability, provided that the system
never has to back up across line ends. Since a '\n' will always
terminate any numeric input field, this is no hardship. A short
time ago I wrote a small test program to detect this capability,
and found that DJGPP has it. I published the little test here at
the time. So this reduces to a quality of implementation issue.

In practice this all means that the scanf series of functions
should not be used to input numerics without limiting the call to a
single field.

Here is my test program for ungetc levels (tungetc.c):

#include <stdio.h>
#include <stdlib.h>
#define MAXLN 10

int main(void) {
char line[MAXLN + 1];
int ix, ch;

puts("Test ability to ungetc for multiple chars in one line");
fputs("Enter no more than 10 chars:", stdout); fflush(stdout);
ix = 0;
while ((EOF != (ch = getchar())) && ('\n' != ch)) {
if (MAXLN <= ix) break;
line[ix++] = ch;
}
line[ix] = '\0';
if ('\n' != ungetc('\n', stdin)) {
puts("Can't unget a '\\n'");
return(EXIT_FAILURE);
}
puts(line);
puts("Trying to push back the whole line");
while (ix > 0) {
ch = ungetc(line[--ix], stdin);
if (ch == line[ix]) putchar(ch);
else {
putchar(line[ix]);
puts(" failed to push back");
return(EXIT_FAILURE);
}
}
puts("\nTrying to reread the whole line");
while ((EOF != (ch = getchar())) && ('\n' != ch)) {
if (ix++ == MAXLN) break;
putchar(ch);
}
return 0;
} /* main */
 
R

Random832

2006-12-18 said:
Does the standard allow [scanf to place limits on the size of
numbers converted with %d, %f, etc]

[snippage]

And what about sscanf?

As far as I can tell, the same rules apply.

That rule does not allow a limit for any scanf function - it allows
limits for other things which allows an implementation to be written for
which no such case is possible for scanf or fscanf - that is not the
same thing.
Since there is no documentation requirement and no fixed upper
bound (just that "254" I quoted as a lower bound), each scanf
(either each call, or each member of the family, or both) could
use a different limit, too, as long as it is at least 254 each
time.

The section you quoted has absolutely nothing to do with any *scanf
function, and even less to do with sscanf.
 
R

Random832

2006-12-18 said:
There is no necessity to have ANY string length limit affect these
textstream-to-number conversions.

He was apparently saying, though, that it is _permitted_ for an
implementation to limit it to 512 characters, and quoted an unrelated
section of the standard that makes it difficult [but clearly not
impossible, as shown by my post] to construct a test case.

If I pass
1.00000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000000000000000e1 to
scanf, I expect it to come back with ten, not one, as the result value.

No-one has provided a convincing argument that an implementation which
stores 1. in the pointed-to argument is legal.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,778
Messages
2,569,605
Members
45,238
Latest member
Top CryptoPodcasts

Latest Threads

Top