Reading a string of unknown size

J

james of tucson

Sundar said:
I am unable to understand the intricacies of the above statement.

Then understand a couple of simpler statements:

1. Do not cast the return value of malloc().

2. If you use malloc() and/or free(), be sure to #include stdlib.h


No intricacies needed :)
 
S

Sundar

james said:
Then understand a couple of simpler statements:

1. Do not cast the return value of malloc().

2. If you use malloc() and/or free(), be sure to #include stdlib.h


No intricacies needed :)

I mean that the definition of malloc says that it returns a void
pointer to the allocated space. Why should u not typecast ur return
void pointer into ur required char or int pointer? And i think that you
dont understand the term "intricacies" else i would not have had to
post my query again.
 
R

Richard Heathfield

Sundar said:

[...]the definition of malloc says that it returns a void
pointer to the allocated space. Why should u not typecast ur return
void pointer into ur required char or int pointer?

(a) Why would you need to cast? All code should either Do Something Good or
Stop Something Bad Happening. An automatic conversion is supplied, and
that's guaranteed by the Standard, so a cast adds neither value nor
protection.
(b) Without the cast, if you omit <stdlib.h> the compiler is required to
issue a diagnostic message for the mismatch between int and pointer, but
the cast removes this requirement (without fixing the bug that the
diagnostic message is reporting to you), thus giving you a silent bug
instead of a noisy bug.

For a fuller answer, see http://www.cpax.org.uk/prg/writings/casting.php
 
K

Keith Thompson

Santosh Nayak said:
I suggest that one or both of the [Ss]antosh's currently posting here
consider using their full name to avoid confusion. (Neither of them,
of course, is obligated to follow my advice.)

Sounds Logical !

Thanks. If I have this straight, you've been signing yourself as
"Santosh", and you started posting here recently; the other Santosh
has been signing himself as "santosh" and has been posting here for
some time. Is that right?
 
S

santosh

Spiros said:
Is Santosh a common name ?

Depends. It's more common in N.India than the south. But it isn't a
particularly common name.
Or was Santosh
trying to impersonate santosh ? Personally
I didn't realize that Santosh was different than
santosh until santosh's first post and in fact I
was surprised at the way he was conducting himeslf.
(Before I realized they were different people that is.)

For a split second before I noticed the capitalisation, I thought I was
looking at one of my previous posts. :)
 
S

Santosh Nayak

If I have this straight, you've been signing yourself as
"Santosh", and you started posting here recently; the other Santosh
has been signing himself as "santosh" and has been posting here for
some time. Is that right?

Yes, that is right. Thanks for the suggestion.
 
S

Santosh Nayak

Is Santosh a common name ?
Depends. It's more common in N.India than the south. But it isn't a
particularly common name.

Santosh is also a common name in South Indian (where i belong).
 
S

santosh

Santosh said:
Santosh is also a common name in South Indian (where i belong).

If you say so, though that's contrary to my experience. Anyway let's
stop this discussion, it's getting way off-topic.
 
W

websnarf

santosh said:
First include necessary headers: stdio.h, stdlib.h


Better yet, replace above with int main(void)


Don't cast return value of malloc() in C.

This is not a bug. Note that without some sort of cast, there is no
type checking, which is the biggest risk when dealing with void *
pointers.
[...] It can hide the non-inclusion of it's prototype,

On *SOME* older generation compilers. No modern compiler fails to give
a warning about this regardless of the cast.
[...] (by way of failure to include stdlib.h), and, on
some implementations, can result in nasty crashes during runtime.

Since sizeof(char) is by definition 1, you can omit that and instead do
'2 * sizeof *str'. This has the advantage of becoming automatically
updated when you later on happen to change the type of *str.

If you want automatic type safety you should do this:

#define safeMallocStr(p,n,type) do { (p) = (type *) malloc
((n)*sizeof (type)); } while (0);

and you get type checking, and correct semantics. So if you change the
type of the variable you are using, your compiler will issue warnings
if you mismatch here. Any variation you do in which you omit the cast
outside of malloc will fail to catch this "change the definition of the
pointer" scenario.
i++ ;
str = (char*) realloc(str, (2*sizeof(char)) + i ) ;
[...]

Anyway, your allocation strategy is very inefficient. Your calling
realloc() once every iteration of the loop. This could result in
fragmentation of the C library's memory pool. Why not allocate in terms
of fixed sized

Because your memory pool will *still* fragment. It is also O(n^2)
rather than O(n) because of the implicit copying performed in
realloc().
[...] or dynamically growing blocks, say 128 bytes or so to
start with?

You have to do thing in exponential steps (doublings is the most
obvious was to do this) otherwise, weak malloc strategies will
inevitably be explosed. Its also going to be very slow for large
inputs (it really *IS* O(n^2)). So are you going to trade a buffer
overflow in just to be handed back a denial of service?
 
W

websnarf

Sundar said:
Can u please further explain the following statement....

"Don't cast return value of malloc() in C. It can hide the
non-inclusion
of it's prototype, (by way of failure to include stdlib.h), and, on
some implementations, can result in nasty crashes during runtime."

On really old compilers, if you forget to explicitely declare the
prototype of a function before you use it, the compiler can assume that
it has a prototype like:

extern int malloc ( /* this is not void; its empty. */ );

But what it is supposed to be is:

void * malloc (size_t);

The compiler may emit different callsite code based on these. This is
usually an issue if void * and int are of different sizes, or target
different machine registers or whatever. The reasoning behind leaving
off the cast, is that even old compilers will crap out if you try to
assign an int to a pointer. I.e., they are using a completely
different error to assist you on what is really another kind of error
altogether.

This "automatic prototype" thing has been deprecated in the latest ANSI
C standards and there is basically no moden compiler in existence which
will not issue at least a warning as soon as it detects this scenario.
I personally always compile with "warning equals errors", and warnings
set to max or near max (of course the vendors own header files rarely
compile at the highest warning level -- *sigh*). So its not a real
issue for me, and it shouldn't be an issue for anyone who pays
attention to compiler warnings.

The cast allows for much more serious type checking. If you are using
the pattern:

<var> = (<type> *) malloc (<count> * sizeof (<type>));

and find a way of enforcing that the exact type is repeated precisely
(i.e., use a macro) then this usually works out better. If you change
<type> or the type of <var> the compiler will issue a warning unless
they are synchronized -- if you leave off this cast, you get no
assistance from the compiler, and you just have to get it correct.

So the whole idea of leaving off the cast is to help you because you
might make the mistake of omitting a header (a non-issue on modern
compilers which will give you this warning without this mechanism) but
it sacrifices more detailed type checking like making sure pointers
have the right base type which, for some reason, is not considered (by
some) as relevant a kind of mistake.

The whole "omit the cast" thing is a meme that exists amongst certain
people who have a very skewed idea of what it means to be C programmer.
If they are honest, they think that *development* of your code (not a
port-facto port) may, at any time, be forced onto some old C compiler
where their imagined scenario is an actual issue. More likely they
want to create an intention gratuitous incompatibility with C++ (where
the cast is mandatory.)
 
C

CBFalconer

Keith said:
Santosh Nayak said:
I suggest that one or both of the [Ss]antosh's currently posting
here consider using their full name to avoid confusion. (Neither
of them, of course, is obligated to follow my advice.)

Sounds Logical !

Thanks. If I have this straight, you've been signing yourself as
"Santosh", and you started posting here recently; the other Santosh
has been signing himself as "santosh" and has been posting here for
some time. Is that right?

I think it's the other way around. The 'good' santosh has been
using lower case s.
 
C

CBFalconer

Sundar said:
.... snip ...

Can u please further explain the following statement....

"Don't cast return value of malloc() in C. It can hide the
non-inclusion of it's prototype, (by way of failure to include
stdlib.h), and, on some implementations, can result in nasty
crashes during runtime."

I am unable to understand the intricacies of the above statement.

It's in the FAQ. Don't use silly abbreviations such as 'u'.
 
R

Richard Heathfield

(e-mail address removed) said:
This is not a bug.

It merely hides one.
Note that without some sort of cast, there is no
type checking, which is the biggest risk when dealing with void *
pointers.

No, there is an even bigger risk - cargo cult programming, which is what
most people are doing when they cast malloc.
[...] It can hide the non-inclusion of it's prototype,

On *SOME* older generation compilers. No modern compiler fails to give
a warning about this regardless of the cast.

So if anyone comes up with a counter-example, you can simply claim that it's
not a "modern" compiler. ("True Scotsman" argument.) Furthermore, do not
forget that some organisations are remarkably conservative, and will not
change software that they know to work - especially if that software is
mission-critical, as compilers easily can be.
[...] (by way of failure to include stdlib.h), and, on
some implementations, can result in nasty crashes during runtime.

Since sizeof(char) is by definition 1, you can omit that and instead do
'2 * sizeof *str'. This has the advantage of becoming automatically
updated when you later on happen to change the type of *str.

If you want automatic type safety you should do this:

#define safeMallocStr(p,n,type) do { (p) = (type *) malloc
((n)*sizeof (type)); } while (0);

That doesn't look very type-safe to me.

void *p;
safeMallocStr(p, n, void); /* requires a diagnostic */

void *q;
safeMallocStr(q, n, char);
int *r = q; /* so much for type safety */
and you get type checking, and correct semantics. So if you change the
type of the variable you are using, your compiler will issue warnings
if you mismatch here.

Why not just remove the mismatch risk completely?
Any variation you do in which you omit the cast
outside of malloc will fail to catch this "change the definition of the
pointer" scenario.

Wrong.

T *p;

p = malloc(n * sizeof *p);

Now change p's type to U *. The malloc is still correct, and does not need
an extra, potentially error-prone, edit to a spurious macro call.
 
C

CBFalconer

.... snip ...

The cast allows for much more serious type checking. If you are
using the pattern:

<var> = (<type> *) malloc (<count> * sizeof (<type>));

and find a way of enforcing that the exact type is repeated
precisely (i.e., use a macro) then this usually works out better.
If you change <type> or the type of <var> the compiler will issue
a warning unless they are synchronized -- if you leave off this
cast, you get no assistance from the compiler, and you just have
to get it correct.

If you use the recommended:

<var> = malloc(<count> * sizeof *<var>);

you need no casts, and the exact type is enforced without any
concealment behind obfuscating macros or whatever.
 
S

santosh

santosh said:
Santosh said:
I have to read characters from stdin and save them in a string. The
problem is that I don't know how much characters will be read.
i++ ;
str = (char*) realloc(str, (2*sizeof(char)) + i ) ;
[...]

Anyway, your allocation strategy is very inefficient. Your calling
realloc() once every iteration of the loop. This could result in
fragmentation of the C library's memory pool. Why not allocate in terms
of fixed sized

Because your memory pool will *still* fragment. It is also O(n^2)
rather than O(n) because of the implicit copying performed in
realloc().

The implicit copying will be performed in both cases, if realloc() runs
out of contiguous space. In addition calling realloc() for every byte
read adds significant avoidable overhead. Note the test below:

/* t1.c - realloc() once every byte */
#include <stdio.h>
#include <stdlib.h>

int main(void) {
unsigned char *buf = NULL, *stash = NULL;
size_t read = 0, bsize = 0;
int ch;

while((ch = fgetc(stdin)) != EOF) {
if(read == bsize) {
if((buf = realloc(buf, ++bsize)) == NULL) {
if(stash) free(stash);
return EXIT_FAILURE;
}
else stash = buf;
}
buf[read++] = ch;
}
free(stash);
return EXIT_SUCCESS;
}
/* end t1.c */

/* t2.c - buffer doubles upon each reallocation */
#include <stdio.h>
#include <stdlib.h>

int main(void) {
unsigned char *buf = NULL, *stash = NULL;
size_t read = 1, bsize = 1;
int ch;

while((ch = fgetc(stdin)) != EOF) {
if(read == bsize) {
if((buf = realloc(buf, bsize <<= 1)) == NULL) {
free(stash);
return EXIT_FAILURE;
}
else stash = buf;
}
buf[read - 1] = ch;
read++;
}
free(stash);
return EXIT_SUCCESS;
}
/* end t2.c */

$ gcc -Wall -Wextra -ansi -pedantic -o t1 t1.c
$ gcc -Wall -Wextra -ansi -pedantic -o t2 t2.c
$ du -b t1
7315 t1
$ du -b t2
7299 t2
$ du -sh /cdrom/boot/isolinux/linux
1.7M /cdrom/boot/isolinux/linux
$ time ./t1 < /cdrom/boot/isolinux/linux
real 0m1.483s
user 0m0.600s
sys 0m0.880s
$time ./t2 < /cdrom/boot/isolinux/linux
real 0m0.131s
user 0m0.104s
sys 0m0.028s
$

Well, apparently, the exponential allocation scheme is significantly
faster than the linear one.
[...] or dynamically growing blocks, say 128 bytes or so to
start with?

You have to do thing in exponential steps (doublings is the most
obvious was to do this) otherwise, weak malloc strategies will
inevitably be explosed. Its also going to be very slow for large
inputs (it really *IS* O(n^2)).

I'm sorry but I don't quite understand how it's going to any _slower_
than calling realloc() every one byte. I guess for large files, (or any
file for that matter), the disk overhead is going to swamp that of
realloc(), whether or not it's called every byte, since they won't be
buffered by the operating system.
So are you going to trade a buffer
overflow in just to be handed back a denial of service?

You can possibly recover from a denial of service situation but a
buffer overflow could mean a hard crash.
 
W

websnarf

santosh said:
santosh said:
Santosh wrote:
I have to read characters from stdin and save them in a string. The
problem is that I don't know how much characters will be read.
i++ ;
str = (char*) realloc(str, (2*sizeof(char)) + i ) ;

[...]
Anyway, your allocation strategy is very inefficient. Your calling
realloc() once every iteration of the loop. This could result in
fragmentation of the C library's memory pool. Why not allocate in terms
of fixed sized

Because your memory pool will *still* fragment. It is also O(n^2)
rather than O(n) because of the implicit copying performed in
realloc().

The implicit copying will be performed in both cases, if realloc() runs
out of contiguous space. In addition calling realloc() for every byte
read adds significant avoidable overhead. Note the test below:

/* t1.c - realloc() once every byte */
#include <stdio.h>
#include <stdlib.h>

int main(void) {
unsigned char *buf = NULL, *stash = NULL;
size_t read = 0, bsize = 0;
int ch;

while((ch = fgetc(stdin)) != EOF) {
if(read == bsize) {
if((buf = realloc(buf, ++bsize)) == NULL) {
if(stash) free(stash);
return EXIT_FAILURE;
}
else stash = buf;
}
buf[read++] = ch;
}
free(stash);
return EXIT_SUCCESS;
}
/* end t1.c */

/* t2.c - buffer doubles upon each reallocation */
#include <stdio.h>
#include <stdlib.h>

int main(void) {
unsigned char *buf = NULL, *stash = NULL;
size_t read = 1, bsize = 1;
int ch;

while((ch = fgetc(stdin)) != EOF) {
if(read == bsize) {
if((buf = realloc(buf, bsize <<= 1)) == NULL) {
free(stash);
return EXIT_FAILURE;
}
else stash = buf;
}
buf[read - 1] = ch;
read++;
}
free(stash);
return EXIT_SUCCESS;
}
/* end t2.c */

$ gcc -Wall -Wextra -ansi -pedantic -o t1 t1.c
$ gcc -Wall -Wextra -ansi -pedantic -o t2 t2.c
$ du -b t1
7315 t1
$ du -b t2
7299 t2
$ du -sh /cdrom/boot/isolinux/linux
1.7M /cdrom/boot/isolinux/linux
$ time ./t1 < /cdrom/boot/isolinux/linux
real 0m1.483s
user 0m0.600s
sys 0m0.880s
$time ./t2 < /cdrom/boot/isolinux/linux
real 0m0.131s
user 0m0.104s
sys 0m0.028s
$

Well, apparently, the exponential allocation scheme is significantly
faster than the linear one.

There is a reason for this. Let's do the math. The exponential
realloc strategy means you will call realloc() log_2(n) times, each
with a cost of 1, 2, 4, ..., O(N) in the event of a required memcpy().
Sum that up and you will see that its O(N). The constant factor is
basically 2 writes per byte.

If you do it linearly, k bytes at a time, then you will do O(N/k)
reallocs, with an average potential memcpy penalty of N/2. So that's a
cost of O(N*N/(2*k)) which is O(N^2).

Changing from "every byte" to some linear block of length k, is
basically trying to decrease the constant factor outside of the O(N^2).
Of course that will work for small N, but that's pointless and
unnecessary, and ineffective for large N.

This analysis is critically important, since it affects what you say
next:
[...] or dynamically growing blocks, say 128 bytes or so to
start with?

You have to do thing in exponential steps (doublings is the most
obvious was to do this) otherwise, weak malloc strategies will
inevitably be explosed. Its also going to be very slow for large
inputs (it really *IS* O(n^2)).

I'm sorry but I don't quite understand how it's going to any _slower_
than calling realloc() every one byte.

Who said this?
[...] I guess for large files, (or any
file for that matter), the disk overhead is going to swamp that of
realloc(), whether or not it's called every byte, since they won't be
buffered by the operating system.

If you are swapping to disk, then the inherent memcpy() performance hit
is still there and is significantly worsened. In fact I would suggest
that any scenario that leads to disk swapping in the linear growth
method would necessarily turn into a kind of DOS (whether intentional
or not.) While a robust enough system using the exponential strategy
is likely to continue to function even if it does swap.
You can possibly recover from a denial of service situation but a
buffer overflow could mean a hard crash.

So you are saying its ok to suffer DOSes when there is no good reason
to do so?
 
W

websnarf

CBFalconer said:
If you use the recommended:

<var> = malloc(<count> * sizeof *<var>);

you need no casts, and the exact type is enforced without any
concealment behind obfuscating macros or whatever.

Well, this still has the potential for cut and paste errors unless you
macrofy the whole line. If you don't macrofy, then you risk error no
matter what.

So let us take a more serious approach compare macros which prevent any
mismatch errors:

#define scaredOfCPlusPlus(var,count) var = malloc(count*sizeof *var)
#define newThing(type,count) (type *) malloc (count * sizeof (type))

So you can say var = newThing(char *, 512), and if the type is wrong,
the compiler tells you. Furthermore you can pass newThing(,) as a
parameter to a function. The scaredOfCPlusPlus(,) macro works fine,
but doesn't look familliar, and can't be passed as a parameter to a
function.

And, of course, the real difference is that the first compiles straight
in C++, and the second is just an error. I have found that in general
the C++ optimizers and warnings are better for the C++ mode of my
compilers than the C mode.
 
C

Chris Torek

And, of course, the real difference is that the first compiles straight
in C++, and the second is just an error. I have found that in general
the C++ optimizers and warnings are better for the C++ mode of my
compilers than the C mode.

Fortran compilers are usually even better than that. Obviously
you should write your C code so that it compiles with Fortran compilers.
 
E

Eric Sosman

Well, this still has the potential for cut and paste errors unless you
macrofy the whole line. If you don't macrofy, then you risk error no
matter what.

(Macros cure errors? News to me ...)

The principal advantage of the recommended form is that a
visual inspection of the line *in isolation* tells you whether
it's correct or incorrect. If the l.h.s. agrees with the sizeof
operand, the allocation is correct (unless <var> is a float or
an int or some other non-pointer thing, in which case the
compiler will squawk anyhow).

You do not need to go hunting for the declaration of <var>,
nor do you need to rummage around for the definition of some
macro and try to figure out its expansion. You can verify the
correctness of the code with a "local" inspection, without
digging through a pile of headers, hoping you've found the
right version and haven't been fooled by a morass of conditional
compilation.
So let us take a more serious approach compare macros which prevent any
mismatch errors:

#define scaredOfCPlusPlus(var,count) var = malloc(count*sizeof *var)
#define newThing(type,count) (type *) malloc (count * sizeof (type))

So you can say var = newThing(char *, 512), and if the type is wrong,
the compiler tells you. Furthermore you can pass newThing(,) as a
parameter to a function. The scaredOfCPlusPlus(,) macro works fine,
but doesn't look familliar, and can't be passed as a parameter to a
function.

Why not? I don't see any advantage in writing such a macro,
but if you chose to do so the expression it generated would be
perfectly good as an argument to function.
And, of course, the real difference is that the first compiles straight
in C++, and the second is just an error. I have found that in general
the C++ optimizers and warnings are better for the C++ mode of my
compilers than the C mode.

(Have you interchanged "first" and "second" here?)

It doesn't seem to me that one's C style should be twisted
in deference to the practices of C++, or of Java, or of COBOL.
Nor vice versa, of course. Speak English or speak Spanish,
but don't lapse into Spanglish.
 
K

Keith Thompson

Santosh Nayak said:
Yes, that is right. Thanks for the suggestion.

You're welcome.

And please don't snip attribution lines. I wrote the text above
starting with "If I have this straight". There should have been a
line at the start of the quoted text identifying the author, something
automatically provides such a line; please don't delete it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top