Determine the size of malloc

J

James Harris

....
1) a fresh buffer is allocated each time. There is no way to supply an
existing buffer, and consequently the program must either keep track of
all these buffers, keep freeing them as it goes, or leak them. My concern
is that the last is likely to be the most common. A reusable buffer would
make memory management simpler (at the expense of a more complicated
function interface).

2) there is no protection against denial of memory attacks. Given an
arbitrarily long line, the function will continue to try to allocate
memory until... it fails! A malicious user who is trying to soak up as
much memory as possible on the host system could exploit this in an
attempt to deny memory to other processes.

Chuck's argument is that fixing these would make the function interface
more complicated - i.e. the function would be harder to call.

This is certainly true, but not compelling, because making these changes
would make ggets easier to *use* - there is more to usage than the call
itself.

Your first point makes sense as, under your suggestion, the code which
allocated the buffer would be responsible for freeing it which makes
sense especially as there's no garbage collector. It would also be
faster.

I'm less sold on the second point, though. Say that in order to
protect against denial of memory attacks the user were to set a limit
on how much to read how would that limit be chosen? We may be running
on a 4Gbyte machine or a 64k machine. In either case not all address
space would be available to the process. So how would the user
portably choose a limit for what to read?

If we do get the user to choose a limit how does the application logic
- i.e. the code using the function - handle hitting that limit? Surely
the idea of reading in a long section of data is to allow the code to
work on whole lines or whole whatevers in one go. If we still have to
write code to handle input data split over call boundaries why not use
a standard library function in the first place?

As a slight aside, specifying how much memory to keep /free/ may be
more useful here rather than how much we can use.

There are other issues with the ggets routine, IMHO,

3. Fixed size increments - does not scale well AND makes it more
likely that any eventual memory allocation failure will leave only a
tiny bit of space left.

4. The routine returns only on a newline. In the discussions this
seems to have been taken for a gimme - and seems to be common to other
routines I've found of the same type - but the OP did not specify he
was trying to read lines. He may have another terminator in mind.

5. These routines may be handling the wrong problem. The underlying
issue is not reading from a file but buffer management. It may be
better to write code to handle buffers than to write for the specific
(albeit most prevalent) case of reading a line. To illustrate, if we
knew the buffer was large enough we could read the input with
something like (and I admit my C may have syntax or other errors -
corrections welcome - but it should hopefully make the point
nonetheless)

while ((ch = getc(infile)) != EOF) {
buffer[offset++] = ch;
if (ch == endchar) break;
}

This loop is simple and would not require to be packaged in an
imported routine. It may be good enough as it stands to point the OP
in the right direction: i.e. do it yourself using getc rather than
fgets. On top of this loop we could add something like

if ((need_space_for(&buffer, &bufsize, offset)) != 0 {
/* handle out-of-space error */
}

immediately prior to putting the character in the buffer. I'm thinking
of the "need_space_for" routine as

1) returning immediately if the buffer is large enough,
2) if not large enough reallocating the buffer,
3) if unable to reallocate the buffer returning non-zero.

For performance it would be good if step 1 could be implemented as a
macro/inline and the rest as a called function. Is that possible in C?

This should be quite flexible. If the programmer wishes to avoid the
(albeit small) cost of the macro part on each call he could read the
input in nested loops in chunks of, say, 32 bytes checking that there
is enough space for another chunk at the top of each loop rather than
one each call, calling the macro as

if ((need_space_for(&buffer, &bufsize, offset + CHUNK_SIZE)) ...

where CHUNK_SIZE = 32.

The intention is that the same macro + routine combination could be
used for handling any flexibly-sized buffer in any circumstance, not
just for reading lines.

The above is just a suggestion. Can anyone see if this would work or
not? As ever when throwing ideas out onto Usenet it's interesting to
see what comes back - good and bad!
 
F

Flash Gordon

Malcolm McLean wrote, On 24/05/08 17:18:
Best means "best available" not "best conceivable". The subject has been
rehashed many times on clc, and no-one has come up with a definitive
solution to the problem of reading an unbounded line from stdin.

Name one implementation with infinite address space, since without that
it cannot be unbounded. On many systems you will hit some other limit
before you run out of address space.
Maybe that says something about C, maybe about us, or maybe you just
can't have "unbounded" and "sensible" at the same time. Whilst people
have certainly made pertinent comments about ggets(), nothing has been
proposed which is unambiguously better.

C makes the problems more visible. However even on languages where you
can just tell it to read a line in to a string and it handles the memory
allocation for you (e.g. the BASIC implementations I've used) you can
still run out of memory, and what happens then? How do you handle it?
 
Y

Yevgen Muntyan

Malcolm said:
Best means "best available" not "best conceivable". The subject has been
rehashed many times on clc, and no-one has come up with a definitive
solution to the problem of reading an unbounded line from stdin.
Maybe that says something about C, maybe about us, or maybe you just
can't have "unbounded" and "sensible" at the same time.

No, the problem is that you can't have anything sensible
in comp.lang.c about "unbounded". E.g. some other poster
thinks you can't even use word "unbounded" if you are
talking about real computers.
Whilst people
have certainly made pertinent comments about ggets(), nothing has been
proposed which is unambiguously better.

Right, see the snipped item 623.

Yevgen
 
F

Flash Gordon

Malcolm McLean wrote, On 24/05/08 21:37:
C runs out of memory. Whilst address space cannot be infinite it can,
even on a non-perverse system, be much larger than the number of atoms
in the universe. Imagine a system that encodes some sort of
error-checking into pointers.
Running out of memory is one of the conditions ggets() has to handle,
and of course it is an obstacle to a nice clean interface.

Which was precisely my point.
My BASIC implementation simply exits with the message "line too long" if
the input line goes over a limit - I think 1024 characters though I
can't remember offhand. The programmer needs to know that excessively
long lines are not supported. If he needs to deal with them for some
reason he must find another language.

I.e. the problem does not go away, it just gets removed from the
programmer, which was again precisely my point.
MiniBasic also has a centralised out of memory handler. There are
numerous places where the program could run out of memory, and it all
cases the result is simply termination with a message. Again, this
really just brushes the problem away rather than solving it, though 90%
of C programs probably have to terminate on out of memory as well.

As a bare minimum properly written SW first makes an attempt at
reporting why it is terminating. Often it should attempt to do a lot
more even if it is going to terminate (e.g. if it is a database ensure
that all files are in a sensible state), or if it is a JVM it should try
and create an out of memory exception in the Java code it is running and
see what that wants to do about it.
 
B

Bartc

Malcolm McLean said:
C runs out of memory. Whilst address space cannot be infinite it can, even
on a non-perverse system, be much larger than the number of atoms in the
universe. Imagine a system that encodes some sort of error-checking into
pointers.
Running out of memory is one of the conditions ggets() has to handle, and
of course it is an obstacle to a nice clean interface.

Input text files are either line-oriented or they're not.

For line-oriented input, impose a reasonable line length limit (1K, 8K etc),
and use ggets or whatever line-input function is desired. Any longer line
should raise an error.

For other types of files, use a different approach (character-by-character
for example). C source files are an example of non-line-oriented input
because everything /could/ legally be on one long line.

It's silly to expect line-input functions to attempt to cope with lines that
could conceivably be as long as the size of the drive (and longer?).
 
B

Bartc

pete said:
I can't think of how a file can be more line oriented
than just by being a text file.

(A) Text files with no line breaks
(B) Text files with line breaks that are not significant
(C) Text files with significant line breaks

By line-oriented I mean type (C), such as Windows INI or BAT files (and
loads more I can't think of at the minute), which would likely break if two
lines were joined into one. With such files there is little point in having
individual lines of unlimited length.

A C source file would be type (B), despite your comment about the
line-length limit; a newline in a C file is usually just white-space and
with no other significance. Python source code on the other hand I think
/would/ be line-oriented.
I don't know what you think you mean by that.
C source files are text files.
Text files are made of lines.

Yes, but there could be zero lines; or just one, regardless of the size of
the file.
There's an environmental limit of
4095 characters in a logical source line.

That's news to me; I always thought that logically C source files could be
written on just one long line.

It sounds like the C standard is doing what I proposed above, setting
certain limits to what is allowable, to make life easier for implementers.
While still allowing implementers to set their own higher limits (the 4095
is the minimum I think that must work).

In fact 4K characters is a good limit to use for line-reading functions for
everyday use. If it's not enough then you probably need specialised
functions; or rethink your file format.
 
K

Keith Thompson

pete said:
Bartc wrote: [...]
For line-oriented input, impose a reasonable line length limit (1K, 8K etc),
and use ggets or whatever line-input function is desired. Any longer line
should raise an error.
For other types of files, use a different approach
(character-by-character
for example). C source files are an example of non-line-oriented input
because everything /could/ legally be on one long line.

I don't know what you think you mean by that.
C source files are text files.
Text files are made of lines.
There's an environmental limit of
4095 characters in a logical source line.

No, that's the minimum limit. An implementation is allowed to accept
lines of any arbitrary length. (In fact, I suspect most compilers
probably don't set an explicit maximum length for lines in C source
files, other than what's imposed by memory limitations.)
 
K

Keith Thompson

Bartc said:
Input text files are either line-oriented or they're not.

For line-oriented input, impose a reasonable line length limit (1K, 8K etc),
and use ggets or whatever line-input function is desired. Any longer line
should raise an error.

For other types of files, use a different approach (character-by-character
for example). C source files are an example of non-line-oriented input
because everything /could/ legally be on one long line.

It's silly to expect line-input functions to attempt to cope with lines that
could conceivably be as long as the size of the drive (and longer?).

You certainly *can* impose a limit if you want to. But there's no
fundamental reason to mandate that text files cannot have lines of,
say, 1 million characters or more, any more than there's any
fundamental reason to limit the size of an entire file.

It's true that allowing arbitrarily long lines makes for tricky memory
management. (Nobody promised that this stuff would be easy.)

<OT>In Perl, the fundamental input operation reads an arbitrarily long
line and stores it in a dynamically allocated string. You *can*
emulate something like C's fgets(), but in my own programming I don't
bother. Sometimes, when it's convenient, I even slurp the entire
contents of a file into memory. It hasn't been a problem for me so
far.</OT>
 
K

Keith Thompson

Bartc said:
That's news to me; I always thought that logically C source files could be
written on just one long line.
[...]

Preprocessor directives have to be on single lines, and "//" comments
are terminated by the end of a line.

An implementation may impose a limit on line lengths, but it's not
required to do so.
 
F

Flash Gordon

Malcolm McLean wrote, On 25/05/08 07:50:
That's a case for xmalloc() or similar.

No, it's a case for handling memory allocation failure.
The MiniBasic program terminates, the function does not. It returns a
failure flag to caller, and the reason can be extracted from the error
stream.

Which is an example of where the program does not terminate, so it is
another example of what you suggest might be the 10% of C programs that
do not terminate on memory allocation failure (assuming MiniBasic is
written in C).
 
B

Bartc

Keith said:
You certainly *can* impose a limit if you want to. But there's no
fundamental reason to mandate that text files cannot have lines of,
say, 1 million characters or more, any more than there's any
fundamental reason to limit the size of an entire file.

It's true that allowing arbitrarily long lines makes for tricky memory
management. (Nobody promised that this stuff would be easy.)

By imposing a limit, I can use, say, a 4K static buffer to efficiently read
in successive lines of my line-oriented text files (init files, data files,
and so on, all typical simple stuff).

Allowing any line length requires more complex management and a risk of
bringing down the program, and possibly the rest of the system, for minimum
benefits, for what would anyway be an input error for the sort of files I'm
talking about.

It not a question of being easy, just sensible.
 
K

Keith Thompson

Text files are line-oriented by definition. There's no inherent limit
on the length of a line.
By imposing a limit, I can use, say, a 4K static buffer to efficiently read
in successive lines of my line-oriented text files (init files, data files,
and so on, all typical simple stuff).

Allowing any line length requires more complex management and a risk of
bringing down the program, and possibly the rest of the system, for minimum
benefits, for what would anyway be an input error for the sort of files I'm
talking about.

Which makes your job easier at the expense of not being able to handle
very long lines.
It not a question of being easy, just sensible.

It might well be sensible for the sort of files *you're* talking
about. But if I want to process text files created by some other
program, and I have no control over what that program does, I just
might need to be able to handle arbitrarily long lines.

I've seen computer-generated HTML and XML files (plain text) with
lines that are hundreds or thousands of characters long. Or I might
have a text file that represents a large array of floating-point
numbers, with one row of the array on each line. In both cases, the
line boundaries are significant.

If I'm writing a program that reads such files, and if I impose a
limit on the length of a line that it can read, then that's an
unacceptable limitation in my program, not a problem with the input
files.
It not a question of being easy, just sensible.

It's a question of assuming that text files are intended only for
human-readable text. Sometimes you can make that assumption, but
often you can't.

And I think you actually made that point yourself when you referred to
files as being "line-oriented". But it's possible to have a
"line-oriented" text file with arbitrarily long lines. Sometimes the
best approach is to read in chunks that might be smaller than a line
(a simple fgets() call will do this). But sometimes you might need to
have an entire line in memory at once. To do that, you need dynamic
allocation, which means you need to handle possible allocation
failures. But if you're going to be storing data derived from the
entire file's contents in memory, storing one arbitrarily long line
isn't going to be much of an additional burden.
 
B

Bartc

Keith Thompson said:
Which makes your job easier at the expense of not being able to handle
very long lines.


It might well be sensible for the sort of files *you're* talking
about. But if I want to process text files created by some other
program, and I have no control over what that program does, I just
might need to be able to handle arbitrarily long lines.

I've seen computer-generated HTML and XML files (plain text) with
lines that are hundreds or thousands of characters long.
Or I might
have a text file that represents a large array of floating-point
numbers, with one row of the array on each line. In both cases, the
line boundaries are significant.

At this point I thought I should write a getline()-type function just to see
what is involved. And 90 minutes or so later, I have a function that:

* Uses an internally allocated buffer for line data

* Reuses the same buffer without reallocation if possible

* Where necessary resizes the buffer (in one go, not gradually) to cope with
longer lines

* Uses a pre-agreed upper size of the buffer, which can be set at runtime

* Uses binary mode only (I hate the system messing with my files),
recognising 4 combinations of cr and lf (a little unportable I know) to
separate lines

* Returns an error on too-long lines when they exceed any upper limit

* And, the buffer size limit can be turned off by setting it to zero, to
read text files with arbitrarily long lines

So it's not such a big deal as I thought: where the input is expected to be
well-formed, I use a hard limit. And in other cases I can turn off the limit
but have to be prepared for scary things happening when it is fed files with
huge line lengths.

However in your examples of HTML and XML (which I don't believe are
line-oriented, they're controlled by opening/closing tags), I wouldn't use a
line-reading approach at all.
 
B

Ben Bacarisse

Bartc said:
At this point I thought I should write a getline()-type function just to see
what is involved. And 90 minutes or so later, I have a function that:

* Uses an internally allocated buffer for line data

* Reuses the same buffer without reallocation if possible

* Where necessary resizes the buffer (in one go, not gradually) to cope with
longer lines

* Uses a pre-agreed upper size of the buffer, which can be set at runtime

* Uses binary mode only (I hate the system messing with my files),
recognising 4 combinations of cr and lf (a little unportable I know) to
separate lines

This is a slightly odd one. My version of the same has a FILE *
parameter and just uses that. The text/binary distinction can be very
useful and seem to me to be something that should not be part of the
specification of a getline function.
* Returns an error on too-long lines when they exceed any upper limit

* And, the buffer size limit can be turned off by setting it to zero, to
read text files with arbitrarily long lines

Mine has a configurable delimiter string because I also used to use it
for "blocks" of text where the ending is "\n\n" (as seen by fgetc so
\n\n would mean two consecutive line terminators on text stream).
So it's not such a big deal as I thought:

No, the programming is relative simple. The "big deal" is agreeing on
the prototype, exact error behaviour, the return type and so on. If
everyone had the same ideas about these matters we could have a c.l.c
recommended version. I'd post mine, but for the fear of having it
publicly filleted here! I suspect everyone who's programmed in C for
more than a few months has a similar function ready for private use.
 
B

Bartc

Ben said:
This is a slightly odd one. My version of the same has a FILE *
parameter and just uses that.

I thought they all did?
The text/binary distinction can be very
useful and seem to me to be something that should not be part of the
specification of a getline function.

My function has no control of this, but if given a text mode file (and I
believe stdin is in text mode), then on my machine newlines are single lf
characters. If I drop the rare lf-cr newline combination, then it will cope
comfortably with both binary and text mode.

BTW is there a way of telling if stdin is 'connected' to a keyboard, or is
an actual file? My code sometimes has to re-read a long line and this
doesn't work well with keyboard entry.
Mine has a configurable delimiter string because I also used to use it
for "blocks" of text where the ending is "\n\n" (as seen by fgetc so
\n\n would mean two consecutive line terminators on text stream).

So you would end up with even longer lines that normal?
No, the programming is relative simple. The "big deal" is agreeing on
the prototype, exact error behaviour, the return type and so on. If
everyone had the same ideas about these matters we could have a c.l.c
recommended version. I'd post mine, but for the fear of having it
publicly filleted here! I suspect everyone who's programmed in C for
more than a few months has a similar function ready for private use.

Problem is they can't then easily post code that makes use of it.
 
B

Bartc

Bartc said:
BTW is there a way of telling if stdin is 'connected' to a keyboard,
or is an actual file? My code sometimes has to re-read a long line
and this doesn't work well with keyboard entry.

No need: getting/setting the current file offset (ftell/fseek) give error
values for the keyboard. Finally a use for checking error returns!
 
B

Ben Bacarisse

Bartc said:
Ben Bacarisse wrote:

My function has no control of this, but if given a text mode file (and I
believe stdin is in text mode), then on my machine newlines are single lf
characters. If I drop the rare lf-cr newline combination, then it will cope
comfortably with both binary and text mode.

Ah, I see (I think). I am not a fan of that method, but then it is
your function not mine.

So you would end up with even longer lines that normal?

The function reads input until the delimiter has been placed in the
buffer (or something goes wrong). Some text files are organised in
blocks with a blank line between the parts you need to read and having
a configurable delimiter helps with that without costing very much in
the code.
 
J

James Harris

I think that is an important insight.
Really we need something like

struct buffer
{
void *buff;
size_t itemsize;
size_t capacity;
size_t length;

}

we then use the structure liberally whenever we have an array whose length
cannot be determined at the time of its creation. However that means it
needs to be in a standard library. Otherwise it is just one more layer of
nuisance to everybody.

Thanks for the positive feedback. Do you think it's possible to
jointly come up with a solution? It is a bit beyond a normal FAQ entry
but is similar to a FAQ issue in that the same problem is likely to
come up from time to time.

Does it really need to be in the standard library? AFAICT it's not
easy to add libraries to C (e.g. only <...> and "..." include options)
but how about these options:

1) .c and .h to put into the same directory as the caller,
2) code to copy verbatim into the current source file.

Is there a better option? Either way I suspect only a few lines of
code would be needed.
 
J

James Harris

...
jacob showed that lcc -ansic is conforming in this regard, and only lcc
without -ansic provides (f)ggets in <stdio.h>. The complaint about a
missing prototype is not an error (and a useful warning), and the
redefinition error is only given when -ansic is not included in the
compiler options.

Is it not valid to want to write (and be error checked) as ANSI-
compatible C but also use (f)ggets? If so is there a way to combine
the two?
 
H

Harald van Dijk

Is it not valid to want to write (and be error checked) as ANSI-
compatible C but also use (f)ggets? If so is there a way to combine the
two?

Since ANSI C doesn't define (f)ggets, if you use it, and you don't define
it yourself, your C code isn't ANSI-compatible. There's no way to use non-
standard library functions and keep your code standard. It doesn't make
sense, and I hope it isn't what you meant. If you meant something else,
would you please clarify? I would take a guess, but I can think of at
least two different possibilities with different answers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,139
Latest member
JamaalCald
Top