How to force fscanf to find only data on a single input line?

R

Richard Heathfield

CBFalconer said:
Richard said:
CBFalconer said:
Well, I certainly never saw [Pat Foley's ggets objection],

Oh, I see. There must be two CBFalconers then, since CBFalconer
did in fact post a prompt reply to Pat Foley.

Well, maybe I should modify my answer to 'I don't remember'. This
also indicates how seriously I took any such objection at the time.

If you had taken it more seriously, ggets would be a better function.
That's ridiculous. Similarly, you can say anything that uses
malloc to collect and store information is dangerous.

Well, I didn't say it was dangerous. Nor do I agree that my claim is
ridiculous. This is what people want to be able to do:

1) initialise
2) main loop
2a) gather input
2b) process input
3) possibly do post-processing on intermediate results
4) produce output
5) clean up
6) quit

Okay, that doesn't quite cover all eventualities, but it gives a general
model for batch code. (The problem still remains for interactive code,
but let's keep our example simple.) The problem with ggets is that it
mandates an additional step within the main loop - effectively moving
part of the cleanup into the loop itself. Requiring people to do that
is a weakness. When they forget - and they will - the result can hardly
be called a leak, because it's more like a firehose.
After all, it is just one more choice. You can use gets, ggets
fgets, getline (I think that is your routines name),

You think wrong.
getc, fscanf,
etc. as you wish. Scratch gets from that list.

And scratch ggets too, until it does what it ought.
 
C

CBFalconer

Richard said:
CBFalconer said:
.... snip ...


You think wrong.


And scratch ggets too, until it does what it ought.

To be consistent you have to add 'scratch malloc' to your list.
BTW, if you add Navias favorite malloc clean-up to your system, it
will apply to ggets also. This illustrates that the 'fault' is not
within ggets, but within malloc.
 
R

Richard

Keith Thompson said:
CBFalconer said:
Richard Heathfield wrote: [...]
That's fine - but it makes your function less useful than it
could be. For example, it oughtn't to be used in environments
that are open to accidental or malicious data abuse, or in low
memory situations (because of its leak-encouraging design).

That's ridiculous. Similarly, you can say anything that uses
malloc to collect and store information is dangerous. Systems have
better methods of limiting overuse, such as memory maxima. Nor
should any recursive code be let out into the wild, since overuse
can crash. Ptui.

A program can use malloc reasonably safely as long as the program can
control how much memory is allocated. Similarly for recursion, if the
program can control the depth of recursion.

gets() is dangerous because its misbehavior (buffer overflow) can be
triggered by factors that the program cannot control, namely the
contents of stdin.

ggets() is less dangerous, but nevertheless its misbehavior
(attempting to allocate more memory that it should) can likewise be
triggered by the contents of stdin. Once my program call ggets(), it
has *no control* over how much memory may be allocated.

If you consider that to be an acceptable price to pay for the relative
simplicity of ggets(), that's your call, but it's something that
anyone thinking about using ggets() should consider.

[...]

See previous post on the matter. Without the necessary "limit"
parameters then ggets is positively dangerous.
 
M

Malcolm McLean

CBFalconer said:
That's ridiculous. Similarly, you can say anything that uses
malloc to collect and store information is dangerous. Systems have
better methods of limiting overuse, such as memory maxima. Nor
should any recursive code be let out into the wild, since overuse
can crash. Ptui.
What you should do is extend the buffer by size/10 + 128 on every call, or
every call after the first one or two.
Growing by 10% each time means that the total number of cals to realloc()
before you run out of memory will be rather small.
Richard Heathfield will probably insist on shrinking the request on failure.

Obviously you should make size a size_t, even though I persoanlly hate that
type, and check for nextsize > prevsize, for termination. Again, you could
shrink so that an array of pretty much exactly size_t max can be allocagted,
but it becomes increasingly futile.
The run it on dev/zero and see if it comes back with "out of memory"
reasonably sharpish.
 
R

Richard Heathfield

CBFalconer said:
To be consistent you have to add 'scratch malloc' to your list.

If you think so, you have missed my point completely.

Here is how most people use fgets:

code to set up buffer

while(fgets(buffer, sizeof buffer, stream) != NULL)
{
process the line
}

cleanup


And that's how they will tend to use ggets too, whereas what ggets needs
(normally) is:

setup buffer pointer

while whatever the ggets syntax is
{
process the line
free the buffer
}

In my opinion, that's bad design. It should be possible to write:

setup buffer pointer

while whatever the ggets syntax is
{
process the line
}

cleanup

without going to extravagant lengths. I can see that the ggets function
isn't quite so broken if you want the line data to persist beyond a
single loop iteration - but very often you don't.
 
B

Bart van Ingen Schenau

I already looked at the [] notation as a possible solution for this
but couldn't figure out how to force it into shape. For instance:

rc = fscanf(fp,"%d[ \t]%d[ \t]%s[\n]",&int1,&int2,%string);

Try this instead:
rc = fscanf(fp,"%d%*[ \t]%d%*[ \t]%[^\n]%*[\n]",&int1,&int2,string);

This should only have problems with input lines like
+ 100 \n-300 400 name2\n

because the %d format specifier eats leading whitespace
unconditionally.

Anyway, I guess the answer to my question is that there is no simple way
to make fscanf() treat an EOL as an input terminator. It seems slightly
bizarre to me that fscanf() has no concept of "end of input", other than
EOF!

The fact is that it is non-trivial to use (f)scanf in situations where
the input stream may contain errors.
Regards,

David Mathog

Bart v Ingen Schenau
 
C

CBFalconer

Richard said:
.... snip ...

Here is how most people use fgets:

code to set up buffer
while(fgets(buffer, sizeof buffer, stream) != NULL) {
process the line
}
cleanup

And that's how they will tend to use ggets too, whereas what ggets
needs (normally) is:

setup buffer pointer /* which is "char *ptr;" */
while whatever the ggets syntax is {
process the line
free the buffer /* which may be "free(buf);" */
}

In my opinion, that's bad design. It should be possible to write:

setup buffer pointer /* which is more complex */
while whatever the ggets syntax is {
process the line /* which has to handle parts or sizes */
}
cleanup /* not actually needed */

without going to extravagant lengths. I can see that the ggets
function isn't quite so broken if you want the line data to
persist beyond a single loop iteration - but very often you don't.

Note that the ggets syntax is "while (0 == ggets(&ptr)) {". If you
save the error return you can discriminate between memory
exhaustion and i/o errors (including EOF). If you save the
returned pointer (rather than freeing it immediately) you can tuck
the lines away for future use. Remember the design objective was
the simplicity of gets without the penalties.

The ggets source code (and usage examples) is available at:

<http://cbfalconer.home.att.net/download/>
 
R

Richard Heathfield

CBFalconer said:

Yeah, but the cost of not needing the cleanup is that you have to do the
cleanup inside the loop, which means that either you forget (which,
let's face it, is what most newbies will do) or you are doomed to
free/alloc/free/alloc/free/alloc/free/ad loopeam.
Note that the ggets syntax is "while (0 == ggets(&ptr)) {".

Great, so you're most of the way there. All you need now is a size_t *
to save you from having to reallocate a fresh buffer each time, and a
size_t to indicate the maximum allowable allocation. So near!
If you
save the error return you can discriminate between memory
exhaustion and i/o errors (including EOF). If you save the
returned pointer (rather than freeing it immediately) you can tuck
the lines away for future use. Remember the design objective was
the simplicity of gets without the penalties.

You haven't got the simplicity of gets - you lost that when you took
char ** instead of char * - and you still have the downsides of
unnecessarily complex memory management and exposure to a
denial-of-memory attack. Worst of both worlds.
The ggets source code (and usage examples) is available at:

Yes, I know. What I don't know is why you think it's worth promulgating.
 
A

Anand Hariharan

This is what people want to be able to do:

1) initialise
2) main loop
2a) gather input
2b) process input
3) possibly do post-processing on intermediate results
4) produce output
5) clean up
6) quit

Okay, that doesn't quite cover all eventualities, but it gives a general
model for batch code. (The problem still remains for interactive code,
but let's keep our example simple.)

In terms of a routine that reads a line of arbitrary length, I have
had this wish-list for sometime:

It should be possible for this routine to gracefully handle text files
created in a platform whose EOL convention is different from the one
on the target binary's platform.

E.g., if I use Cygwin or MSYS to create a text file (by say
redirecting some command's output), it is a text file on a Windows
file system, but with UNIX EOL conventions. Even if this file is only
a few hundred lines long, trying to read such a file line-by-line
causes most routines (including the fgets approach where the buffer's
memory is doubled each time its capacity is reached) to read the
entire file as a single line. This usually lands up causing the
system to become unstable due to memory exhaustion.

- Anand

PS: I wish to stay clear of "virtues and follies of ggets' design"
debate.
 
K

Keith Thompson

CBFalconer said:
To be consistent you have to add 'scratch malloc' to your list.
[...]

Not at all. A program that calls malloc can have complete control
over the maximum amount of memory that can be allocated. Even if the
amount of memory required is determined externally, the program itself
can always do a sanity check and reject a huge allocation. With
ggets, the amount of memory it attempts to allocate is entirely
controlled by the contents of stdin, something over which the program
has no control.

Do you not see the difference?

Maybe in some environments this is ok; maybe there's no harm in
allocating just as much memory as the system will allow me. But I'd
like to make that decision myself.
 
C

CBFalconer

Richard said:
.... snip ...

Great, so you're most of the way there. All you need now is a size_t *
to save you from having to reallocate a fresh buffer each time, and a
size_t to indicate the maximum allowable allocation. So near!

Well, nobody is budging, so I think I'll drop this thread. My (and
other) opinions are buried in there somewhere for anyone to review.
 
F

Francine.Neary

Well, nobody is budging, so I think I'll drop this thread. My (and
other) opinions are buried in there somewhere for anyone to review.

What doesn't seem to be somewhere for anyone to review is any attempt
by you to actually engage with the criticisms people have made of
ggets...
 
C

CBFalconer

What doesn't seem to be somewhere for anyone to review is any attempt
by you to actually engage with the criticisms people have made of
ggets...

I answered all I saw, with reasons for the existing prototype.
 
M

Malcolm McLean

CBFalconer said:
I answered all I saw, with reasons for the existing prototype.
My fix doesn't involve any changes to the prototype.
Whilst it won't exactly return quickly after being called on /dev/zero, and
there's no particular reason it should, it might not tie up resources for
too long either.

Problem if not solved, at least substantially alleviated.
 
C

CBFalconer

.... snip discussion about ggets() ...
My fix doesn't involve any changes to the prototype. Whilst it
won't exactly return quickly after being called on /dev/zero, and
there's no particular reason it should, it might not tie up
resources for too long either.

Problem if not solved, at least substantially alleviated.

I see no 'fix'. What are you talking about?
 
K

Keith Thompson

Malcolm McLean said:
My fix doesn't involve any changes to the prototype.
Whilst it won't exactly return quickly after being called on
/dev/zero, and there's no particular reason it should, it might not
tie up resources for too long either.

Problem if not solved, at least substantially alleviated.

With your change, ggets would perform fewer allocations while
attempting to read an infinitely long input line, but it would still
attempt to allocate an arbitrarily large amount of memory (until an
allocation fails). In an environment where that's a problem, it will
misbehave more quickly, but it will still misbehave.
 
M

Malcolm McLean

Keith Thompson said:
With your change, ggets would perform fewer allocations while
attempting to read an infinitely long input line, but it would still
attempt to allocate an arbitrarily large amount of memory (until an
allocation fails). In an environment where that's a problem, it will
misbehave more quickly, but it will still misbehave.
As it stands ggets() calls realloc repeatedly with increments of 128 bytes.
realloc() generally performs internal copying - I read somewhere that it is
rare for actual implementations to extend the block, though I suppose with
an increment as small as 128 bytes that might happen more often than not.

Anyway, because of the copying, we have an O(N^2) algorithm, where N is the
total amount of available memory / 128, which could well be over a million.
So it's not surprising that the whole thing crawls.

replace the 128-byte increment with an increment of 10%, and the buffer
grows exponentially. This is the standard schoolboy question, if you
invested a penny at 10% interest, applied annually, what would it be worth
in 100 years' time? The answer is rather a large sum.

So within a hundred or so allocations the system will run out of memory.
We've now got an O(N logN) algorithm. It is still gobbling lots of memory,
but it releases it in relatively short time. That's much more acceptable.
 
R

Richard Heathfield

Malcolm McLean said:

This is the standard schoolboy question,
if you invested a penny at 10% interest, applied annually, what would
it be worth in 100 years' time? The answer is rather a large sum.

137.80 isn't really all that large. In any case, after a hundred years,
you're likely to have lost the account book.
So within a hundred or so allocations the system will run out of
memory. We've now got an O(N logN) algorithm. It is still gobbling
lots of memory, but it releases it in relatively short time. That's
much more acceptable.

Not as acceptable as placing an upper limit on tolerable consumption.
 
R

Richard Tobin

Malcolm McLean said:
As it stands ggets() calls realloc repeatedly with increments of 128 bytes.
realloc() generally performs internal copying - I read somewhere that it is
rare for actual implementations to extend the block, though I suppose with
an increment as small as 128 bytes that might happen more often than not.
Anyway, because of the copying, we have an O(N^2) algorithm,

Probably not. realloc() will only copy when it has to reallocate, and
it almost certainly has an algorithm that doesn't do constant
increments.

Try timing the following program with arguments suitable to the amount
of real memory you have. It doesn't show any sign of super-linearity
on my system.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
int m = atoi(argv[1]);
int i;
void *buf = 0;

fprintf(stderr, "please wait\n");

for(i=1; i<=m; i++)
if(!(buf = realloc(buf, i)))
{
fprintf(stderr, "realloc(%d) failed\n", i);
return 1;
}

return 0;
}

By the way, what's the right way to convert a decimal string to a
size_t?
replace the 128-byte increment with an increment of 10%, and the buffer
grows exponentially. This is the standard schoolboy question, if you
invested a penny at 10% interest, applied annually, what would it be worth
in 100 years' time? The answer is rather a large sum.

So within a hundred or so allocations the system will run out of memory.
We've now got an O(N logN) algorithm. It is still gobbling lots of memory,
but it releases it in relatively short time. That's much more acceptable.

Actually no (though my first thought was the same as yours). It's
O(N). It does O(log N) copies, but they aren't of an average amount
proportional to N. Consider the easy case of doubling the allocation
each time: the copy sizes are

1 2 4 8 ... 2^k

where 2^k < N <= 2^(k+1), and the sum will be 2^(k+1) - 1, which is
O(N). For your 10% increase it's a geometric progression with ratio
1.1, so the total bytes copied will be (1.1^(k+1) -1) / (1.1 - 1), or
roughly 10N.

-- Richard
 
M

Malcolm McLean

Richard Heathfield said:
Malcolm McLean said:

Not as acceptable as placing an upper limit on tolerable consumption.
What matters is the time the memory is held, as well as the amount. If a
process hogs all available memory, but releases it after a few cycles,
you've got to be pretty unlucky for anything else to notice. If it holds it
for a few seconds you may cause a few glitches, or make badly-written
programs exit. If it takes it for a few minutes you've got a very impatient
user, for a few hours and effectively you've lost the system.

Of course if a ggets() caller processes the input line, he is using all the
memory in the machine to process some data. You can't do anything about
that, nor should you, that's what it's there for.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,096
Latest member
ThurmanCre

Latest Threads

Top