Pointer and string literal question

T

Tagore

hi,

#include <stdio.h>
int main(void){
char *s="LET";
char *t="LET";
if(s==t)
printf("same");
else
printf("different");
return 0;
}

In above code, output is "same".
but I expected output to be "different". I think that s and t points
to string literals present at different addresses.
Can any one please help me in understanding its output.

Regards,
 
K

Keith Thompson

bartc said:
Because the literals are identical, perhaps only a single copy is used.

Right. Compilers are explicitly permitted, but not required, to do
this. C99 6.4.5p6:

It is unspecified whether these arrays are distinct provided their
elements have the appropriate values. If the program attempts to
modify such an array, the behavior is undefined.

Your program shouldn't assume either that they're the same, or that
they aren't.

For example, for this program:

#include <stdio.h>
int main(void)
{
char *s0 = "abcde";
char *s1 = "abcde";
char *s2 = "Xabcde";
if (s0 == s1) {
puts("s0 == s1");
}
else {
puts("s0 != s1");
}
if (s0 == s2+1) {
puts("s0 == s2+1");
}
else {
puts("s0 != s2+1");
}
return 0;
}

all 4 possible results are valid. (The compiler I'm using prints
s0 == s1, s0 != s2+1
without optimization,
s0 == s1, s0 == s2+1
with optimization.)
 
J

Jens Thoms Toerring

Tagore said:
#include <stdio.h>
int main(void){
char *s="LET";
char *t="LET";
if(s==t)
printf("same");
else
printf("different");
return 0;
}
In above code, output is "same".
but I expected output to be "different". I think that s and t points
to string literals present at different addresses.

Why do you think so? It's correct that both 's' and 't' point to
string literals - but since the strings they point to are identical
it's one of the most simple (memory-related) optimizations for the
compiler to make them point to the same location. Actually, that's
the very reason why you aren't allowed to change string literals -
i.e. if you would do e.g.

s[1] = 'x'; /* not allowed by the C standard! */

then this would also change the content of what 't' is poin-
ting to. The guys writing the C standard had two alternatives:
allow changes to string literals - in which case 's' couldn't
point to the same place as 't', thus making a certain kind of
optimization impossible - or allow for optimization like the
one you are seeing here and thus forbid changing string lite-
rals. They went with the second one, which to me seems to be
in the spirit of C, i.e. go for compact, fast and least resour-
ce-hungry compiled programs.

But if you don't like it your compiler may have a flag to make
it less standard-compliant and force it to produce code where
's' is pointing to a different location than 't' (and where you
thus may change string literals).
Regards, Jens
 
K

Kaz Kylheku

Why do you think so? It's correct that both 's' and 't' point to
string literals - but since the strings they point to are identical
it's one of the most simple (memory-related) optimizations for the
compiler to make them point to the same location. Actually, that's
the very reason why you aren't allowed to change string literals -
i.e. if you would do e.g.

It's not the only reason.

Literals are effectively pieces of the program text made available to itself as
data, so that modifying a literal de facto constitutes self-modifying code.
Self-modifying code can't be placed into read-only storage, such as a ROM, or
write-protected virtual pages.
s[1] = 'x'; /* not allowed by the C standard! */

This undefinedness also means that once you perform s[1] = 'x', a subsequent
statement of the form

if (s[1] == 'x') ...

could go either way (if it ever gets to execute at all). It's not just about
other copies of the i literal being affected by the change.

The translated program is also simply not required to be aware of
self-modifications like this.

Not only can another instance of the literal share the same space as s, but the
expression s[1] can be optimized to a constant which does not respond to
changes to s.
 
K

Keith Thompson

Kaz Kylheku said:
Tagore said:
char *s="LET";
char *t="LET";
[...]
Why do you think so? It's correct that both 's' and 't' point to
string literals - but since the strings they point to are identical
it's one of the most simple (memory-related) optimizations for the
compiler to make them point to the same location. Actually, that's
the very reason why you aren't allowed to change string literals -
i.e. if you would do e.g.

It's not the only reason.

Literals are effectively pieces of the program text made available
to itself as data, so that modifying a literal de facto constitutes
self-modifying code. Self-modifying code can't be placed into
read-only storage, such as a ROM, or write-protected virtual pages.
s[1] = 'x'; /* not allowed by the C standard! */

This undefinedness also means that once you perform s[1] = 'x', a subsequent
statement of the form

if (s[1] == 'x') ...

could go either way (if it ever gets to execute at all). It's not just about
other copies of the i literal being affected by the change.

The translated program is also simply not required to be aware of
self-modifications like this.

Not only can another instance of the literal share the same space as
s, but the expression s[1] can be optimized to a constant which does
not respond to changes to s.

Agreed.

In addition, it's also likely (but not required) that attempting:

s[1] = 'x';

will cause your program to crash. (In fact, this is the *best*
outcome, since it shows you where the error is.)
 
E

Eric Sosman

hi,

#include<stdio.h>
int main(void){
char *s="LET";
char *t="LET";
if(s==t)
printf("same");
else
printf("different");
return 0;
}

In above code, output is "same".
but I expected output to be "different". I think that s and t points
to string literals present at different addresses.
Can any one please help me in understanding its output.

As others have explained, the compiler might choose
to create only one nameless "LET" string, and aim both
pointers at that single instance.

A compiler might go further still:

char *s = "LET";
char *t = "NUMBER ONE WITH A BULLET";
if (s == t)
... obviously false ...
if (s == t+21)
... ??? ...
 
P

Peter Nilsson

Eric Sosman said:
     As others have explained, the compiler might choose
to create only one nameless "LET" string, and aim both
pointers at that single instance.

     A compiler might go further still:

        char *s = "LET";
        char *t = "NUMBER ONE WITH A BULLET";
        if (s == t)
            ... obviously false ...
        if (s == t+21)
            ... ??? ...

Are you sure that's allowed in C89/90? I thought the
string literals had to be 'identical' before they could
share the same address.
 
K

Keith Thompson

Peter Nilsson said:
Are you sure that's allowed in C89/90? I thought the
string literals had to be 'identical' before they could
share the same address.

The wording did change between C90 and C99.

C90 6.1.4:

Identical string literals of either form need not be distinct. If
the program attempts to modify a string literal of either form,
the behavior is undefined.

where "either form" refers to character string literals and wide
string literals.

C99 6.4.5p6:

It is unspecified whether these arrays are distinct provided their
elements have the appropriate values. If the program attempts to
modify such an array, the behavior is undefined.

But the C90 standard didn't say that string literals that aren't
identical *can't* overlap (and I can't think of any good reason to
assume that they can't). I think C99 mostly just improved the
wording.
 
E

Eric Sosman

Are you sure that's allowed in C89/90? I thought the
string literals had to be 'identical' before they could
share the same address.

The word "identical" doesn't seem to appear in any part
of the C99 Standard that's relevant. But perhaps I've missed
something; can you cite which "identical" you're thinking of?

In C99, 6.4.5p6 says "It is unspecified whether these arrays
are distinct provided their elements have the appropriate values."
The word "appropriate" does not seem to me to imply "identical."

C89/ANSI 3.1.4 says "Identical string literals of either form
need not be distinct," but doesn't seem to say anything at all
about non-identical literals. (It doesn't even say that "X"
and "FOOBAR" are distinct.)

I don't have a copy of C90 to consult, but others have said
it's the same as C89 except for section and paragraph numbers.
 
N

Nick

Keith Thompson said:
The wording did change between C90 and C99.

C90 6.1.4:

Identical string literals of either form need not be distinct. If
the program attempts to modify a string literal of either form,
the behavior is undefined.

where "either form" refers to character string literals and wide
string literals.

C99 6.4.5p6:

It is unspecified whether these arrays are distinct provided their
elements have the appropriate values. If the program attempts to
modify such an array, the behavior is undefined.

But the C90 standard didn't say that string literals that aren't
identical *can't* overlap (and I can't think of any good reason to
assume that they can't). I think C99 mostly just improved the
wording.

There is, presumably, nothing to stop the compiler pointing s at four
bytes of machine code that happen to make up part of the body of your
program and which constitute codes for L,E and T followed by a 0 byte.
If they should so happen to appear, of course.
 
J

James Dow Allen

Literals are effectively pieces of the program text made available to itself as
data, so that modifying a literal de facto constitutes self-modifying code.
Self-modifying code can't be placed into read-only storage, such as a ROM, or
write-protected virtual pages.

A related reason for "read-only when possible" concerns text-sharing.

One might have dozens of copies of the same program (e.g. interpreter)
running on one machine; the interpreter's data might include hundreds
of messages; there's a very big savings if the messages can be moved
to a read-only, sharable memory section. (There used to be a
complicated
pre-processor that accomplished this, also looking for string matches;
it became obsolete when compilers started treating string literals as
read-only by default.)

James Dow Allen
 
N

Nick Keighley

Why do you think so? It's correct that both 's' and 't' point to
string literals - but since the strings they point to are identical
it's one of the most simple (memory-related) optimizations for the
compiler to make them point to the same location.

But if you don't like it your compiler may have a flag to make
it less standard-compliant and force it to produce code where
's' is pointing to a different location than 't' (and where you
thus may change string literals).

why is this not-compliant?
 
R

Richard Bos

Keith Thompson said:
Kaz Kylheku said:
char *s="LET";
char *t="LET"; [...]
Why do you think so? It's correct that both 's' and 't' point to
string literals - but since the strings they point to are identical
it's one of the most simple (memory-related) optimizations for the
compiler to make them point to the same location. Actually, that's
the very reason why you aren't allowed to change string literals -
i.e. if you would do e.g.

It's not the only reason.

Literals are effectively pieces of the program text made available
to itself as data, so that modifying a literal de facto constitutes
self-modifying code. Self-modifying code can't be placed into
read-only storage, such as a ROM, or write-protected virtual pages.
s[1] = 'x'; /* not allowed by the C standard! */

This undefinedness also means that once you perform s[1] = 'x', a subsequent
statement of the form

if (s[1] == 'x') ...

could go either way (if it ever gets to execute at all). It's not just about
other copies of the i literal being affected by the change.

The translated program is also simply not required to be aware of
self-modifications like this.

Not only can another instance of the literal share the same space as
s, but the expression s[1] can be optimized to a constant which does
not respond to changes to s.

Agreed.

In addition, it's also likely (but not required) that attempting:

s[1] = 'x';

will cause your program to crash. (In fact, this is the *best*
outcome, since it shows you where the error is.)

It's even possible that a later

if (ch == 'L')

is compiled to compare to the first character of your string literal,
instead of to a literal 'L', on systems where this is faster. It is even
allowed that, if you do try to change the string, that comparison fails
when ch is 'L', at a point which _appears_ to have nothing whatsoever to
do with the original string literal.
I have never seen an implementation which goes that far in its
optimisations (in fact, I've never seen one where it would make sense),
but I would not be very surprised to find one. It would certainly be
perfectly legal.

Richard
 
J

Jens Thoms Toerring

why is this not-compliant?

Sorry, that was badly expressed. What I meant was that there might
be a flag that gets the compiler to emit a working program (in the
sense of "as maybe expected by the user") for non-compliant code
(i.e. that allows for changing of string literals, which otherwise
results in undefined behaviour). But on thinking about it a bit
more even that doesn't guarantee that 's' and 't' will point to dif-
ferent locations, what one would need for that is a flag that sup-
presses the kind of optimization that merges identical (parts of)
string literals.
Regards, Jens
 
M

Michael Tsang

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
hi,

#include <stdio.h>
int main(void){
char *s="LET";
char *t="LET";
if(s==t)
printf("same");
else
printf("different");
return 0;
}

As string literals are really "const" char *, there are read-only and the
compiler is free to place them at the same or at different addresses.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAksuDYoACgkQG6NzcAXitM8q8QCggCk8nbniYiOayL/SLP3qQIxE
eWMAoIgO0i2qL7Sf4PE9rmye3xp3IfK3
=nz2o
-----END PGP SIGNATURE-----
 
B

Ben Bacarisse

Michael Tsang said:
Tagore wrote:

As string literals are really "const" char *, there are read-only and the
compiler is free to place them at the same or at different
addresses.

It's worth pointing out (as I think you know from the quotes you used
round "const") that string literals are not actually const objects in
C. They are not modifiable (in that the effect of doing so is
undefined) but if they were really const, you'd get a compiler
diagnostic from the initialisations in the program above.

Also (and this is very much a small point) a literal like "same" is
really of type char[5] since sizeof will report the array object's
size not the size of a char *.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top