Writing a C Compiler: lvalues

A

André Wagner

Hello,

I'm writing a C compiler. It's almost over, except that is not
handling lvalues correctly.

Let me show a example. The code "x = 5" (let's say 'x' was declared
before) yields this in pseudo-assembly:

mov $b, $fp+8 ; $fp+8 is 'x' addess, so I'm storing x's address in
$b
mov $a, 5
mov [$b], $a ; here I'm putting what's in $a in the address
pointed to $b

Since 'x' is a lvalue in this case, I don't need its value, just the
address of the variable.

Now, if I want to access 'x' in the middle of a non-lvalue expressing,
I would do:

mov $a, $fp+8
mov $a, [$a]

Notice how I get the varible addres, and from it, the value.

What I'm trying to say is: the compiler yields different assembly code
for when 'x' is a lvalue and when 'x' is not a lvalue.

This gets more confusing when I have expressions such as 'x++'. This
is simple, since 'x' is obviously a lvalue in this case. In the case
of the compiler, I can parse 'x' and see that the lookahead points to
'++', so it's a lvalue.

But what about '(x)++'? In this case, the compiler evaluates the
subexpression '(x)', and this expression results the value of 'x', not
the address. Now I have a '++' ahead, so how can I know the address of
'x' since all that I have is a value?

All documentation that I found about lvalues were too vague, and
directed to the programmer, and not to the compiler writer. Are there
any specific rules for determining if the result of a expression is a
lvalue?

Thanks in advance,

Andri
[I believe the usual approach is to translate the expression into an
AST before doing much else, which has the useful effect of making the
parentheses go away. As you've found, in C you have to treat (x) and x
the same. It's not Fortran. -John]
 
B

Ben Bacarisse

AndrC) Wagner said:
I'm writing a C compiler. It's almost over, except that is not
handling lvalues correctly.
What I'm trying to say is: the compiler yields different assembly code
for when 'x' is a lvalue and when 'x' is not a lvalue.

Yes, that's normal -- at least as the level of the abstract machine
which seems to be roughly what yo pseudo-assembler is.
This gets more confusing when I have expressions such as 'x++'. This
is simple, since 'x' is obviously a lvalue in this case. In the case
of the compiler, I can parse 'x' and see that the lookahead points to
'++', so it's a lvalue.

But what about '(x)++'? In this case, the compiler evaluates the
subexpression '(x)', and this expression results the value of 'x', not
the address. Now I have a '++' ahead, so how can I know the address of
'x' since all that I have is a value?

All documentation that I found about lvalues were too vague, and
directed to the programmer, and not to the compiler writer. Are there
any specific rules for determining if the result of a expression is a
lvalue?

The C standard (draft PDF available here[1]) tells you which expression
forms denote lvalues and which don't. As you traverse the parse tree,
the "lead operator" of the tree will tell you whether you need l- or
r-value evaluation. The result will be rather naive code, but it is a
start.

[1] http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf
<snip>
 
T

Tom St Denis

What I'm trying to say is: the compiler yields different assembly code
for when 'x' is a lvalue and when 'x' is not a lvalue.

This gets more confusing when I have expressions such as 'x++'. This
is simple, since 'x' is obviously a lvalue in this case. In the case
of the compiler, I can parse 'x' and see that the lookahead points to
'++', so it's a lvalue.

But what about '(x)++'? In this case, the compiler evaluates the
subexpression '(x)', and this expression results the value of 'x', not
the address. Now I have a '++' ahead, so how can I know the address of
'x' since all that I have is a value?

++ requires an object that an address can be taken of attached to
either the right or left which forms part of a larger expression.

so it's really

(object)++

could be, for instance

(*(ptr + a))++

For all it matters.

I guess it depends on how you wrote your parser, but basically when
you encounter ++ it must either be before or after an expression whose
address is computable.
All documentation that I found about lvalues were too vague, and
directed to the programmer, and not to the compiler writer. Are there
any specific rules for determining if the result of a expression is a
lvalue?

Read the BNF grammar for C. The full BNF form is in appendix A32 of K&R
C 2nd edition. Page 238 describes how to look at both post and prefix
expressions.

BTW I don't claim to be a compiler theory expert so that's about all
the help you're gonna get from me :)

Tom
 
K

Keith Thompson

AndrC) Wagner said:
...
What I'm trying to say is: the compiler yields different assembly code
for when 'x' is a lvalue and when 'x' is not a lvalue.

Of course.
This gets more confusing when I have expressions such as 'x++'. This
is simple, since 'x' is obviously a lvalue in this case. In the case
of the compiler, I can parse 'x' and see that the lookahead points to
'++', so it's a lvalue.

But what about '(x)++'? In this case, the compiler evaluates the
subexpression '(x)', and this expression results the value of 'x', not
the address. Now I have a '++' ahead, so how can I know the address of
'x' since all that I have is a value?

In C, a parenthesized lvalue is an lvalue.
All documentation that I found about lvalues were too vague, and
directed to the programmer, and not to the compiler writer. Are there
any specific rules for determining if the result of a expression is a
lvalue?

The definitive document is the C standard. You can get a copy of the
1999 ISO C standard by sending money to your national standard body;
see, for example, webstore.ansi.org. Or you can get a free copy of
the latest post-C99 draft, incorporating the C99 standard plus the
three Technical Corrigenda, at
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf>. The
Technical Corrigenda themselves are available at no charge.

I wouldn't even consider trying to implement a compiler without
having a copy of the standard for the language.

Some relevant passages:

C99 6.5.2.1:

Except when it is the operand of the sizeof operator, the unary &
operator, the ++ operator, the -- operator, or the left operand
of the . operator or an assignment operator, an lvalue that
does not have array type is converted to the value stored in
the designated object (and is no longer an lvalue).

C99 6.5.1p5:

A parenthesized expression is a primary expression. Its
type and value are identical to those of the unparenthesized
expression. It is an lvalue, a function designator, or a void
expression if the unparenthesized expression is, respectively,
an lvalue, a function designator, or a void expression.

Note that the definition of "lvalue" in C99 6.3.2.1p1 is flawed, or
at least incomplete. An lvalue is not merely "an expression with
an object type or an incomplete type other than void"; it's such
an expression that designates, or that could designate, an object.
For example, int is an object type, and 42 is an expression of
type int, but 42 is not an lvalue. On the other hand, if ptr is a
pointer-to-int, *ptr is an lvalue, even if ptr==NULL (but attempting
to use it invokes undefined behavior).
 
E

Eric Sosman

++ requires an object that an address can be taken of attached to
either the right or left which forms part of a larger expression.

Yes to "object," no to "address can be taken." Examples:

register int obj1 = 42;
struct { int obj2 : 7; } s = { 42 };
++obj1; // okay
s.obj2++; // okay
&obj1; // constraint violation
&s.obj2; // constraint violation
 
S

Stargazer

Hello,

I'm writing a C compiler. It's almost over, except that is not
handling lvalues correctly.

It's not "almost over" then :)
Let me show a example. The code "x = 5" (let's say 'x' was declared
before) yields this in pseudo-assembly:

mov $b, $fp+8 ; $fp+8 is 'x' addess, so I'm storing x's address in
$b
mov $a, 5
mov [$b], $a ; here I'm putting what's in $a in the address
pointed to $b

Since 'x' is a lvalue in this case, I don't need its value, just the
address of the variable.

Now, if I want to access 'x' in the middle of a non-lvalue expressing,
I would do:

mov $a, $fp+8
mov $a, [$a]

It looks as real x86 assembly and looks like you're jumping into
assembly generation too early.
Notice how I get the varible addres, and from it, the value.

What I'm trying to say is: the compiler yields different assembly code
for when 'x' is a lvalue and when 'x' is not a lvalue.

This gets more confusing when I have expressions such as 'x++'. This
is simple, since 'x' is obviously a lvalue in this case. In the case
of the compiler, I can parse 'x' and see that the lookahead points to
'++', so it's a lvalue.

No, you can't assume that programmer always writes correct code. A
programmer may mistake, as in Eric's example, or he can write junk as

if (heaven)
666--;

and compiler must be able to determine that an assignment to a non-
lvalue takes place.
But what about '(x)++'? In this case, the compiler evaluates the
subexpression '(x)', and this expression results the value of 'x', not
the address. Now I have a '++' ahead, so how can I know the address of
'x' since all that I have is a value?

When I attempted at writing a C compiler (I wrote parser by hand), I
defined a "simpler C" pseudo-code - a subset of C, which allowed only
assignments in form "__temp_NN = &var;", "__temp_NN = *__temp_MM;",
"*__temp_NN = __temp_MM;", "__temp_NN = ~__temp_MM;" (instead of "~"
there could be "!" or "-") and "__temp_NN = var1 + var2;" (instead of
"+" there could be any arithmetic or logic binary operator). Also
allowed were conditional branches in form of "if (__temp_NN != 0) goto
xxx;" and unconditional branches ("goto xxx;"). "__temp_NN" were
temporary variables of suitable type for machine registers and if out
of registers they were added as additional local variables.

Then "x" and "address of x" would be evaluated separately, something
like "__temp_1 = x;", then at next sequence point: "__temp_2 = &x;
*__temp_2 = __temp_1". If "x" is not an l-value, during generation of
"__temp_2 = &x" compiler will fail parsing and show diagnostic.

Pseudo-code is a good thing, it allows easy debugging of the parser
and also - easy processing by optimizer. Pseudo-code should be defined
in a way that it answers C standard's requirements (think that if for
programmers the standard is a guide, for compiler's writer it's an
SRS) and that it includes only operations supported by any sensible
CPU architectures.

Note that while you don't need to care about anything that is
"undefined behavior" (the generated code needs not be meaningful), you
must add special rules processing for the standard's constraints.
 
M

Marc van Lieshout

What I'm trying to say is: the compiler yields different assembly code
for when 'x' is a lvalue and when 'x' is not a lvalue.

This gets more confusing when I have expressions such as 'x++'. This
is simple, since 'x' is obviously a lvalue in this case. In the case
of the compiler, I can parse 'x' and see that the lookahead points to
'++', so it's a lvalue.

But what about '(x)++'? In this case, the compiler evaluates the
subexpression '(x)', and this expression results the value of 'x', not
the address. Now I have a '++' ahead, so how can I know the address of
'x' since all that I have is a value?
All documentation that I found about lvalues were too vague, and
directed to the programmer, and not to the compiler writer. Are there
any specific rules for determining if the result of a expression is a
lvalue?

An lvalue is an expression that evaluates to an address, so it *can* be
used on the left hand side of an assignment. But this is not necessarily
the case. In an expression like (x + 5) x *is* an lvalue, but it isn't
used as such, so it should be compiled as an ordinary rvalue. So in the
expression x = foo, x should be compiled as an lvalue (an address to
which a value is assigned), and in the expression foo = x, x should be
compiled to an rvalue (code that results in the value of x).

As far as I can tell, you're trying syntax-directed translation on a
C-like language. That can be done but, in a grammar like C, you have to
postpone compilation of an identifier until you know how it's used.

If you want to see an example of using immediate (syntax-directed)
compilation of a C-like language, look at the source code of David Betz'
BOB compiler. It compiles to bytecodes, which are interpreted by a
voirtual machine.

The original DrDobbs article:

http://www.drdobbs.com/184409401

The latest sources via:

http://www.xlisp.org/
 
E

Eric Sosman

An lvalue is an expression that evaluates to an address, so it *can* be
used on the left hand side of an assignment.

That won't quite do. Here are two counter-examples, one an
expression that evaluates to an address but is not an lvalue:

malloc(42)

.... and one an lvalue that cannot possibly involve an address:

register int x;
x = 42;

An lvalue (we're talking C here, right?) "is an expression with an
object type or an incomplete type other than void" (6.3.2.1p1).
 
K

Keith Thompson

Marc van Lieshout said:
An lvalue is an expression that evaluates to an address, so it *can* be
used on the left hand side of an assignment.

Note that this is cross-posted to comp.lang.c and comp.compilers.
I'm posting this from comp.lang.c, and I'm using the C standard's
definitions of terms. C's definition of "lvalue" isn't necessarily
consistent with wider usage. For details, see
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf>,
section 6.3.2.1.

No an lvalue is (roughly) an expression that *designates an object*.
(I say "roughly" because ``*ptr'' is an lvalue even if ptr==NULL.)

An lvalue can designate an object that has no address, such as
a bit field or a register variable. Conversely, an expression
that evaluates to an address, such as ``&obj'' is not necessarily
an lvalue.

The distinction, in C, between computing the address of an object
and "designating" an object is subtle but important.

It's very likely that the code generated for evaluating an lvalue will
compute the address of the designated object, which is relevant if
you're writing a compiler, but as far as C is concerned that's an
implementation detail that's not covered by the standard.
But this is not necessarily
the case. In an expression like (x + 5) x *is* an lvalue, but it isn't
used as such, so it should be compiled as an ordinary rvalue.

C99 6.3.2.1p2:

Except when it is the operand of [list of operators deleted],
an lvalue that does not have array type is converted to the value
stored in the designated object (and is no longer an lvalue).

So in (x + 5), x *isn't* an lvalue, even though it started out as one.

[...]
 
K

Keith Thompson

Eric Sosman said:
That won't quite do. Here are two counter-examples, one an
expression that evaluates to an address but is not an lvalue:

malloc(42)

... and one an lvalue that cannot possibly involve an address:

register int x;
x = 42;

An lvalue (we're talking C here, right?) "is an expression with an
object type or an incomplete type other than void" (6.3.2.1p1).

Which is a horribly incomplete definition. 42 is an lvalue by
this definition (since int is an object type), but it's clearly
not intended to be an lvalue.

The essence of lvalue-ness is that an lvalue designates an object.
The trick is defining the term so that *ptr remains an lvalue even
if ptr==NULL.
 
B

bart.c

Marc said:
An lvalue is an expression that evaluates to an address, so it *can*
be used on the left hand side of an assignment. But this is not
necessarily the case. In an expression like (x + 5) x *is* an lvalue,
but it isn't used as such, so it should be compiled as an ordinary
rvalue. So in the expression x = foo, x should be compiled as an
lvalue (an address to which a value is assigned), and in the
expression foo = x, x should be compiled to an rvalue (code that
results in the value of x).

Actually there is little difference between lvalues and rvalues:

A variable x used as an lvalue involves taking the address of x and
automatically dereferencing to store a value into it.

A variable x used as an rvalue involves taking the address of x and
automatically dereferencing to load a value from it.

The compiler can treat them the same, other than checking that it qualifies
as an lvalue.

Some terms and expressions (such as the result of a+b) generally don't
involve addresses so can only be rvalues.

There might be some confusion when such an operation actually yields an
address value; in this case, this is still an rvalue, unless accompanied by
explicit dereferencing (depending on language):

*f(a) = x; /* '*f(a)' is the lvalue */

In fact it is probably the act of derefencing (implicitly or explicitly)
that tells the compiler to check for lvalue-ness.

And since this is cross-posted to comp.lang.c, there are always the usual
exceptions mentioned whenever the subject comes up: register variables,
bitfields and so on, which in C cannot have their address taken, yet are
still lvalues!

This is not a big deal: it is possible to have an abstract concept of the
address of a register or bitfield, since to qualify as an lvalue, no actual
address need taken, only the potential for one is important (as the address,
or taking the address, is accompanied by the dereference, so these cancel
out).
 
L

lawrence.jones

Keith Thompson said:
So in (x + 5), x *isn't* an lvalue, even though it started out as one.

I disagree -- x *is* an lvalue, but it's converted to the value stored in
the object when the containing expression is evaluated.
 
K

Keith Thompson

I disagree -- x *is* an lvalue, but it's converted to the value stored in
the object when the containing expression is evaluated.

A subtle distinction at best. As I wrote upthread, the standard says:

C99 6.3.2.1p2:

Except when it is the operand of [list of operators deleted],
an lvalue that does not have array type is converted to the value
stored in the designated object (and is no longer an lvalue).

So in the above context, ``x'' *was* an lvalue, but "is no longer"
an lvalue.

The wording is a bit odd. When exactly does it stop being an lvalue?
Surely conversions happen during execution, conceptually at least,
(right?) but by that time lvalue-ness is no longer meaningful.

(Note to comp.compilers readers: this is *very* C-specific.)
 
H

Hans-Peter Diettrich

Keith said:
I disagree -- x *is* an lvalue, but it's converted to the value stored in
the object when the containing expression is evaluated.

A subtle distinction at best. As I wrote upthread, the standard says:

C99 6.3.2.1p2:

Except when it is the operand of [list of operators deleted],
an lvalue that does not have array type is converted to the value
stored in the designated object (and is no longer an lvalue).

So in the above context, ``x'' *was* an lvalue, but "is no longer"
an lvalue.

IMO the question should be rephrased as:

What can be *used* as an lvalue?

in order to find out what value items can be elevated to lvalue items,
when required by the grammar.

As the name says, only an lvalue can occur at the left hand side of an
assignment. This IMO doesn't imply that it has an address (e.g. register
variables) or is mutable (by ++), so that other names should be used for
items that have further attributes, like "is-mutable" or "has-address"
for items usable with the address-of or auto-increment operator.

While a register-based local variable can be used as an lvalue inside
the procedure, it cannot be passed to a subroutine as a pointer. A
compiler can flag such a usage as an error, or it can (silently) remove
the "register" attribute from the variable declaration, in order to make
it eligible as a general lvalue.

Thus a compiler can have multiple rules, for what is acceptable as an
lvalue in various use-cases. A single simple "requires an lvalue" error
message IMO is inappropriate.

DoDi
 
S

s_dubrovich

Hello,

I'm writing a C compiler. It's almost over, except that is not
handling lvalues correctly. ...
What I'm trying to say is: the compiler yields different assembly code
for when 'x' is a lvalue and when 'x' is not a lvalue.

This gets more confusing when I have expressions such as 'x++'. This
is simple, since 'x' is obviously a lvalue in this case. In the case
of the compiler, I can parse 'x' and see that the lookahead points to
'++', so it's a lvalue.
Yes, the post increment operator takes an lvalue as an operand.
K&R C Appendix A 7.2 Unary Operators, unary-expression: ..., lvalue +
+, ...
But what about '(x)++'? In this case, the compiler evaluates the
subexpression '(x)', and this expression results the value of 'x', not
the address. Now I have a '++' ahead, so how can I know the address of
'x' since all that I have is a value?

'x' is an expression, and '(x)' is an expression, but in
implementation, you need to delay their treatment until you know what
to do with them, the post increment operator says how to treat them,
each as a 'lvalue'.

By virtue of '(let's say 'x' was declared before)', presumeably your
compiler holds 'x' in its symbol table, along with its attributes. So
at the point of parsing the operator, you now know how to entreat the
lexeme 'x', you take its lvalue, instead of its rvalue and apply the
operator, post increment, to it (in a syntactical sense).

I had similar problems keeping this straight, so I deprecated my
attention to BCPL to learn that it has operators for lvalue .LVAL. and
rvalue .RVAL. which can be applied to a suitable identifier, thus the
following..

My notes..
'
Historically, lvalue and rvalue take their cue
from their position in an assignment statement:
lvalue := rvalue; .such that. (for C Syntax),
a = b; .where. the value held in the location ident-
ified by variable 'b' is copied to the location identified
by variable 'a'.
..thus. a variable can be said to possess the three attri-
butes of; a lexeme 'b', a location, and a value. In
contrast, a constant has only two attributes; a lexeme '3'
and a value, but no location. .thus. an assignment to a
constant is an error state. Because of this, I think of
lvalue in terms of being the 'location value' of a variable,
and a rvalue in terms of being the 'referenced value' held
in a variable's lvalue.
'
-Two point to the above:
1) I don't mean to say only three attributes, obviously there are
more; scope, type, etc.

2) Using the terminology 'location value' instead of 'address' makes
sense of things like 'register' and user defined types which may not
have have an ordinal address for a pointer.
All documentation that I found about lvalues were too vague, and
directed to the programmer, and not to the compiler writer. Are there
any specific rules for determining if the result of a expression is a
lvalue?

Yes, follow the operator. Again see K&R C, The C Programming
Language, Appendix A C Reference Manual. Supplement with current iso
standards.

hth,

Steve
[I believe the usual approach is to translate the expression into an
AST before doing much else, which has the useful effect of making the
parentheses go away. As you've found, in C you have to treat (x) and x
the same. It's not Fortran. -John]
 
L

lawrence.jones

Keith Thompson said:
A subtle distinction at best. As I wrote upthread, the standard says:

C99 6.3.2.1p2:

Except when it is the operand of [list of operators deleted],
an lvalue that does not have array type is converted to the value
stored in the designated object (and is no longer an lvalue).

So in the above context, ``x'' *was* an lvalue, but "is no longer"
an lvalue.

By that token, it's no longer ``x'' either, it's just the value stored
in x. The wording is a bit odd because lvalue-ness sits firmly on the
fence between syntax and semantics and is thus awkward to talk about
from either side because it doesn't quite fit (which is also why the
definition has been so darned hard to get right).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top