---3

K

Keith Thompson

Michael Press said:
I disagree. The token "--" can only be picked out by
defining it to be a token a priori; and we do that only
because we know in a later phase it will be found to have meaning.

That's true only in the sense that the designers of the language chose
to make "--" a token because it has certain meanings in certain contexts
in later translation phases.

Several tokens consist of multiple punctuation characters.
See C99 6.4.6 for a complete list. Each of these is recognized
quite early in the translation process. For example, "--" and "-"
are recognized as tokens in the same early phase (phase 3); they're
recognized as pre- or post-decrement, or as unary or binary minus,
in a later phase (phase 7).

If you don't have a copy of the C standard, download
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf>.

[...]
 
B

Ben Bacarisse

Michael Press said:
Ben Bacarisse said:
Michael Press said:
Thank you all for explaining this to me. I heard people
speak of a tokenizer and a semantic parser so I thought
that identifying tokens and finding meaning are two
entirely separate processes in the formal model used to
generate a machine executable. So by my, incorrect,
picture we first identify tokens. ---x produces the
list ("-", "-", "-", "x", ";"). Then the grammar finds
meaning.

That description is not wrong. The only part that is wrong it the
actual list of tokens. The process *is* a two-phase one[1]: first find
the tokens and then use the grammar to find the structure (I'd reserve
the word "meaning" for something else, but that really is an unimportant
detail).

It's not clear from your example if you thought that a token was just a
synonym for a character (it would have been clear if you'd used a
multi-character variable rather than 'x')

I know that abc is a token.
---abc -> ("-", "-", "-", "abc").
but one way or another all you
got wrong was the details of the rule used for finding the tokens.

Yes. The reason I got it wrong is that I considered
going from the string "--" to the operator pre-decrement
to be a two step affair.

It is. Seriously, bear with me... If I write (as the only text in a C
program)

-- -- --

this is tokenised into three '--' tokens but it has no meaning. It is
not even clear if I have a pre- or a post-decrement operator or some
mixture of them. Since the source does not parse, in effect I have
neither -- it could be argued that there are no operators there at all.

When a token like '*' is recognised, it can't be decided what it is.
There are three possible meanings that come to mind (multiplcation,
pointer declarator syntax and pointer de-reference but there are more,
at least in C99). The token is an abstract thing with no meaning until
the parser decides what it is. '--' has only two meanings that are
closely related so it's easy to think of '--' as always being decrement
but in fact it just '--' until the parses can give it a meaning.
Because making "--" a token is looking ahead to the phase
where meaning for the C programing language is found.

I can see how you think that, but there is no meaning. There are a
fixed set of string starting with '-' that can be tokens ('-', '--',
'->' and '-=') and the tokeniser chooses the longest one that it finds.
No need to resort to what any of these mean or indeed if any make
sense -- whichever is the longest is the one that will be chosen.
I disagree. The token "--" can only be picked out by
defining it to be a token a priori; and we do that only
because we know in a later phase it will be found to have meaning.

Yes, all the tokens *are* defined up front and quite independently of
whether they will or will not have any meaning later in the parse. One
example might be '*' that I talked about above. It is a token with many
possible meanings but it can be picked out as a token without any
reference to any of them. Perhaps more revealing is that the input

a\b

has three tokens (technically pre-processor tokens, but let's not get
into that difference!) namely 'a', '\' and 'b' despite the fact that the
middle one has no meaning at all. (Notation is a problem now -- I'm
using ''s to delimit a token with no reference to C's character
constants.)

[Aside: I should not have talked about syntax because that can be
confusing in this context. C have two levels of grammar and they are
collected to together in Appendix A which is "the syntax". The first
part is called the "lexical grammar" and it describes the approximate
set of rules for recognising tokens. It's not quite all there because
of some details to do with header file names and phases 1 and 2. The
second part (the "phrase structure grammar" as it is called) is what
most people refer to as the syntax of C. Th upshot is that for this
purpose, I prefer the term you use that I criticised earlier: "the
meaning"!]

<snip>
 
J

James Kuyper

Thank you all for explaining this to me. I heard people
speak of a tokenizer and a semantic parser so I thought
that identifying tokens and finding meaning are two
entirely separate processes in the formal model used to
generate a machine executable. ...

They are. The process of generating the list of tokens for grammatical
analysis takes place during translation phases 3-7. It's only after that
list is finalized that "The resulting tokens are syntactically and
semantically analyzed and translated as a translation unit.".

The standard doesn't actually require that an implementation perform
tokenization and semantic parsing in identifiably distinct parts of the
compiler; but it does require that the implementation behave as if each
of the eight translation phases occurs in the order specified.
... So by my, incorrect,
picture we first identify tokens. ---x produces the
list ("-", "-", "-", "x", ";").

According to C rules, that's the wrong list of tokens. The correct list
is {"--", "-", "x", ";"}.
... Then the grammar finds
meaning. Since that list could be parsed to a
meaningful C construct, it would be. To my way of
thinking the token identifying phase and the phase that
finds meaning are confused when the token identifier
decides that the meaning of ---x is (--)-x.

The token identifier doesn't decide that meaning. All it decides is how
to identify tokens. In C, "--" is a single token; so is "-". This leads
to a ambiguity, and there's a lot of similar ambiguities: virtually
every multi-character token in C is made up of characters that could
also appear as tokens in their own right. C resolves these ambiguities
through the use of the maximal munch rule. That rule only pays attention
to the length of the tokens; their meanings are completely irrelevant to
the rule. In fact, as a result of the maximal munch rule, some cases
which have a possible meaning become meaningless (and therefore a syntax
error). ---x is one of them.
 
M

Michael Press

Ben Bacarisse said:
Michael Press said:
Ben Bacarisse said:
<snip>
Thank you all for explaining this to me. I heard people
speak of a tokenizer and a semantic parser so I thought
that identifying tokens and finding meaning are two
entirely separate processes in the formal model used to
generate a machine executable. So by my, incorrect,
picture we first identify tokens. ---x produces the
list ("-", "-", "-", "x", ";"). Then the grammar finds
meaning.

That description is not wrong. The only part that is wrong it the
actual list of tokens. The process *is* a two-phase one[1]: first find
the tokens and then use the grammar to find the structure (I'd reserve
the word "meaning" for something else, but that really is an unimportant
detail).

It's not clear from your example if you thought that a token was just a
synonym for a character (it would have been clear if you'd used a
multi-character variable rather than 'x')

I know that abc is a token.
---abc -> ("-", "-", "-", "abc").
but one way or another all you
got wrong was the details of the rule used for finding the tokens.

Yes. The reason I got it wrong is that I considered
going from the string "--" to the operator pre-decrement
to be a two step affair.

It is. Seriously, bear with me... If I write (as the only text in a C
program)

-- -- --

this is tokenised into three '--' tokens but it has no meaning. It is
not even clear if I have a pre- or a post-decrement operator or some
mixture of them. Since the source does not parse, in effect I have
neither -- it could be argued that there are no operators there at all.

I agree that the token identifier does not find meaning.
So in --abc it identifies ("--", "abc", ";").
When presented with ---abc it should generate ("---", "abc", ";")

When a token like '*' is recognised, it can't be decided what it is.
There are three possible meanings that come to mind (multiplcation,
pointer declarator syntax and pointer de-reference but there are more,
at least in C99). The token is an abstract thing with no meaning until
the parser decides what it is. '--' has only two meanings that are
closely related so it's easy to think of '--' as always being decrement
but in fact it just '--' until the parses can give it a meaning.
Because making "--" a token is looking ahead to the phase
where meaning for the C programing language is found.

I can see how you think that, but there is no meaning. There are a
fixed set of string starting with '-' that can be tokens ('-', '--',
'->' and '-=') and the tokeniser chooses the longest one that it finds.
No need to resort to what any of these mean or indeed if any make
sense -- whichever is the longest is the one that will be chosen.
I disagree. The token "--" can only be picked out by
defining it to be a token a priori; and we do that only
because we know in a later phase it will be found to have meaning.

Yes, all the tokens *are* defined up front and quite independently of
whether they will or will not have any meaning later in the parse. One
example might be '*' that I talked about above. It is a token with many
possible meanings but it can be picked out as a token without any
reference to any of them. Perhaps more revealing is that the input

a\b

has three tokens (technically pre-processor tokens, but let's not get
into that difference!) namely 'a', '\' and 'b' despite the fact that the
middle one has no meaning at all. (Notation is a problem now -- I'm
using ''s to delimit a token with no reference to C's character
constants.)

[Aside: I should not have talked about syntax because that can be
confusing in this context. C have two levels of grammar and they are
collected to together in Appendix A which is "the syntax". The first
part is called the "lexical grammar" and it describes the approximate
set of rules for recognising tokens. It's not quite all there because
of some details to do with header file names and phases 1 and 2. The
second part (the "phrase structure grammar" as it is called) is what
most people refer to as the syntax of C. Th upshot is that for this
purpose, I prefer the term you use that I criticised earlier: "the
meaning"!]

Some of what you describe accords entirely with my
understanding of grammars and parsing. First identify
tokens, then attempt to find meaning in a language
using a grammar that embodies the language. When
the token identifier takes ---abc; and generates
("--", "-", "abc", ";") it crosses the line into
assigning meaning.
 
R

robertwessel2

<snip>
Thank you all for explaining this to me. I heard people
speak of a tokenizer and a semantic parser so I thought
that identifying tokens and finding meaning are two
entirely separate processes in the formal model used to
generate a machine executable. So by my, incorrect,
picture we first identify tokens. ---x produces the
list ("-", "-", "-", "x", ";"). Then the grammar finds
meaning.
That description is not wrong.  The only part that is wrong it the
actual list of tokens.  The process *is* a two-phase one[1]: firstfind
the tokens and then use the grammar to find the structure (I'd reserve
the word "meaning" for something else, but that really is an unimportant
detail).
It's not clear from your example if you thought that a token was just a
synonym for a character (it would have been clear if you'd used a
multi-character variable rather than 'x')
I know that abc is a token.
---abc -> ("-", "-", "-", "abc").
but one way or another all you
got wrong was the details of the rule used for finding the tokens.
Yes. The reason I got it wrong is that I considered
going from the string "--" to the operator pre-decrement
to be a two step affair.
It is.  Seriously, bear with me...  If I write (as the only text ina C
program)
  -- -- --
this is tokenised into three '--' tokens but it has no meaning.  It is
not even clear if I have a pre- or a post-decrement operator or some
mixture of them.  Since the source does not parse, in effect I have
neither -- it could be argued that there are no operators there at all.

I agree that the token identifier does not find meaning.
So in --abc it identifies ("--", "abc", ";").
When presented with ---abc it should generate ("---", "abc", ";")


But "--" (and "-") are tokens in C, while "---" is not. Specifically
they're "punctuator" tokens. There's an explicit list in the standard
(see the snippet below from C99). Other tokens are keywords (another
explicit list - "for", "if", etc.), constants (integers, floats,
characters, enums, and variants of those - for example, hex and octal
integers), identifiers (user defined names of various sorts), string
literals, header names, preprocessing numbers and comments. The
complete list is in 6.4 of the C99 standard, and (IIRC), 3.1 of the
C89 standard (or that might just be the draft - I don't have the C89
hardcopy in front of me).

From the C99 standard:
6.4.6 Punctuators
Syntax
punctuator: one of
[ ] ( ) { } . ->
++ -- & * + - ~ !
/ % << >> < > <= >= == != ^ | && ||
? : ; ...
= *= /= %= += -= <<= >>= &= ^= |=
, # ##
<: :> <% %> %: %:%:
 
B

Ben Bacarisse

On Mar 23, 11:02 pm, Michael Press <[email protected]> wrote:

But "--" (and "-") are tokens in C, while "---" is not. Specifically
they're "punctuator" tokens. There's an explicit list in the standard
(see the snippet below from C99). Other tokens are keywords (another
explicit list - "for", "if", etc.), constants (integers, floats,
characters, enums, and variants of those - for example, hex and octal
integers), identifiers (user defined names of various sorts), string
literals, header names, preprocessing numbers and comments.

In any other context this would be a pointless nit-pick, but here I
think it may matter. enums constants are not pre-processor tokens and
these are what are being discussed, I think (i.e. this whole discussion
is really about translation phase 3). If they were pp-tokens, it would
indeed suggest the sort of meaning that Michael Press is concerned
about. The tokeniser is already recognising identifiers, so being able
to find enum constants (which are just identifiers that have been
declared in a particular way) would mean that the tokeniser does, in
fact, "know too much".

In translation phase 7, pp-tokens are converted into tokens and that is
also when parsing happens so the extra knowledge needed to classify
identifiers is available.

<snip>
 
J

James Kuyper

On 03/24/2011 12:02 AM, Michael Press wrote:
....
So in --abc it identifies ("--", "abc", ";").
When presented with ---abc it should generate ("---", "abc", ";")

But "---" is not a valid C token, so if it the tokenizer did generate
that tokenization, the parser would have to reject the code as a syntax
error. That string can be broken up into a sequence of valid tokens as
either {"--", "-"}. or {"-", "--"}, or {"-", "-", "-"}. The C standard
mandates the first option.

....
Some of what you describe accords entirely with my
understanding of grammars and parsing. First identify
tokens, then attempt to find meaning in a language
using a grammar that embodies the language. When
the token identifier takes ---abc; and generates
("--", "-", "abc", ";") it crosses the line into
assigning meaning.

The rule that mandates interpreting "---" as {"--", "-"} says nothing
about the meaning of those tokens; it's based purely upon their length.
 
R

robertwessel2

In any other context this would be a pointless nit-pick, but here I
think it may matter.  enums constants are not pre-processor tokens and
these are what are being discussed, I think (i.e. this whole discussion
is really about translation phase 3).  If they were pp-tokens, it would
indeed suggest the sort of meaning that Michael Press is concerned
about.  The tokeniser is already recognising identifiers, so being able
to find enum constants (which are just identifiers that have been
declared in a particular way) would mean that the tokeniser does, in
fact, "know too much".


I was puzzling over that myself, after I wrote it. I pulled up a copy
of the standard to make sure I got the complete list, so that's where
it's from. But what puzzles me is why enum constants are listed as
separate lexical elements in both c89 and c99, while being defined
there as being (exactly) identifiers. Yes, enumeration constants are
special in that the can be used in constant expressions, but that
doesn't actually make them distinct *lexical* elements. IOW, I'm
suggesting that the (very small) contents of 6.4.4.3 (c99 section
numbers) would better be moved into 6.7.2.2.
 
J

James Kuyper

'Doctor, doctor, it hurts when I do this!'

'Then stop doing that.'


Yeah, Ritchie got some ideas out of Algol 68, and then messed them up because he
didn't fully understand them. This is one of them, for which Algol 68 had a
solution involving TADs, TAMs, NOMADs, and MONADs. But this is one mistake that
it's too late to fix, so just put in the extra space character and forget about.

Could you translate that comment for the benefit of those of us who have
no idea what you're talking about? I did a web search for Algol, TAD,
TAM, NOMAD, and MONAD. What I got convinced me that the quickest way to
figure out what you meant was to ask you.
 
B

Ben Bacarisse

China Blue Meanies said:
'Doctor, doctor, it hurts when I do this!'

'Then stop doing that.'


Yeah, Ritchie got some ideas out of Algol 68, and then messed them up
because he didn't fully understand them.

What evidence do you have that Ritchie did not fully understand some of
the ideas borrowed from Algol 68? Bits of C you don't like might well
be that way for reasons quite unrelated to a lack of understanding.
This is one of them, for
which Algol 68 had a solution involving TADs, TAMs, NOMADs, and
MONADs. But this is one mistake that it's too late to fix, so just put
in the extra space character and forget about.

What is the "this" that he got from Algol 68 with out properly
understanding it? Algol 68's method of finding operators is so
radically different from C's that I can't see what you think was
borrowed from it by Ritchie.
 
B

Ben Bacarisse

China Blue Meanies said:
Do you still use =+ =- or =* ?

I'd describe that as non-responsive. The =+ error was introduced by
Thompson into B. What evidence do you have that it's initial inclusion
in C was due to Ritchie not understanding something about Algol 68?
 
B

Ben Bacarisse

Kenneth Brody said:
'Doctor, doctor, it hurts when I do this!'

'Then stop doing that.'


Yeah, Ritchie got some ideas out of Algol 68, and then messed them up because he
didn't fully understand them. This is one of them, for which Algol 68 had a
solution involving TADs, TAMs, NOMADs, and MONADs. But this is one mistake that
it's too late to fix, so just put in the extra space character and forget about.

At least this way there's no arguing that the compiler picked the
"wrong" interpretation[1]. I don't know anything about Algol, but how
would it resolve something like "x---y", which has two "valid"
interpretations if not for the "maximal munch" rule.

The scheme referred to is only for Algol 68, by the way. In Algol 68
"x---y" must be parsed as "x - (-(-y))" and this remains true even if
there are arbitrary user-defined operators in scope. This is because
the *second* symbol is a monad: a symbol that can be, on it's own, a
monadic (i.e. unary) operator. By preventing any operator from having a
monad as its second symbol, the parser can always tell (given only a
little context) whether a monadic or dyadic (binary) operator is meant.

The other operator symbols are called nomads and the rules for permitted
operator are based on these symbol sets (I think they were given in
another post).

It's a clever scheme, but it is not as easy to grasp a C's rule.
 
M

Michael Press

James Kuyper said:
On 03/24/2011 12:02 AM, Michael Press wrote:
...

But "---" is not a valid C token, so if it the tokenizer did generate
that tokenization, the parser would have to reject the code as a syntax
error. That string can be broken up into a sequence of valid tokens as
either {"--", "-"}. or {"-", "--"}, or {"-", "-", "-"}. The C standard
mandates the first option.

...

The rule that mandates interpreting "---" as {"--", "-"} says nothing
about the meaning of those tokens; it's based purely upon their length.

They would not be picked out specially if it were not
known that they would have meaning later.

So what is a token? Somebody in this thread said they
have no meaning. If that were so, any digraph of
non-alphanumeric characters might be a token. Since
there is a special list of tokens then we are already
assigning meaning, or have in mind a number of meanings
such as with "*".

Not that it has any weight, my picture was that meaning
was found through a grammar that finds its way through
token lists, and any string of characters jammed
together without white space that is not an identifier
is a passed on as is. In principal the token identifier could be
easily modified to identify tokens for Perl, bash, Java, Fortan, ...
 
K

Keith Thompson

Michael Press said:
They would not be picked out specially if it were not
known that they would have meaning later.

You're describing the motivation for the way the language defines
certain sequences as tokens. In other words, this knowledge
you're talking about existed in Dennis Ritchie's head when he
was inventing C. There's no such knowledge in the software that
converts a source file into tokens.

In principle, someone could write a tokenizer (to be made a part of
a conforming C compiler) knowing that "-" and "--" are tokens, but
"---" is not, without having the slightest clue of what they mean.
In practice, someone writing a C tokenizer almost certainly does
know what "--" means, but there's no particular use to be made of
that knowledge in that phase of the compiler. (Of course it's vital
in later phases.) Tokenization is a purely mechanical process,
with a fairly simple set of rules.

[...]
 
M

Martin Ambuhl

So what is a token? Somebody in this thread said they
have no meaning. If that were so, any digraph of
non-alphanumeric characters might be a token. Since
there is a special list of tokens then we are already
assigning meaning, or have in mind a number of meanings
such as with "*".

The above is just nonsense. The breaking up of the input stream into
elements in a language has nothing to do with meaning. One could
generate legal sequences for a program all day long without the
slightest inkling of the "meaning" of any of them; one can break that
string into tokens of the language all day long without the slightest
inkling that they _have_ any "meaning", never mind what it is.

Please learn the difference between syntax and semantics.
Not that it has any weight, my picture was that meaning
was found through a grammar that finds its way through
token lists, and any string of characters jammed
together without white space that is not an identifier
is a passed on as is. In principal the token identifier could be
easily modified to identify tokens for Perl, bash, Java, Fortan, ...

"Meaning" in your world seems a vacuous concept. The idea that if you
can determine the words and even their functional relationship in the
Chomsky sentence then you know its "meaning" is grotesquely absurd. And
what is this exception for "white space"? Many implementations of many
languages, including, for example, Algol and Fortran, allow white space
within tokens and has different kinds of rules for determining the
tokens in those language implementations.
 
J

James Kuyper

They would not be picked out specially if it were not
known that they would have meaning later.

Actually, they are. In fact, in precisely those cases where the maximal
munch rule is needed to resolve ambiguities, it often resolves them in
favor of a meaningless sequence of tokens. Because of the maximal munch
rule, those cases constitute syntax errors; without that rule, they
would be ambiguous.
So what is a token? Somebody in this thread said they
have no meaning. If that were so, any digraph of
non-alphanumeric characters might be a token. Since
there is a special list of tokens then we are already
assigning meaning, or have in mind a number of meanings
such as with "*".

See section 6.4p1, A token is any one of the following:
keyword
identifier
constant
string-literal
punctuator

You'll need to read the rest of 6.4 to find out what each of those token
types is.

Note: C uses a 8-phase translation process process. The tokens I've
described above are recognized as such only during translation phase 7.
A different set of tokens, called preprocessing tokens, is identified
during translation phase 3. A preprocessing token is one of the following:

header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above

Between translation phase 3 and translation phase 7, a lot of things
happen to rearrange, delete, add, and in some cases merge the
preprocessing tokens. Only in phase 7 do the surviving preprocessing
tokens get converted into ordinary tokens.

One important thing to note about this two-stage process is that many
things which meet the specification for a pp-number do not meet the
specifications for any type of numeric constant. I don't remember the
details, but I do remember people being surprised by some of the
consequences of that fact.
Not that it has any weight, my picture was that meaning
was found through a grammar that finds its way through
token lists, and any string of characters jammed
together without white space that is not an identifier
is a passed on as is.

That's the part that you got wrong. Most of C's tokens are
multi-character, not just the identifiers. Many of the punctuators are
multi-character, as are the overwhelming majority of the constants and
pp-numbers. All of header-names, keywords, character constants, and
string literals are multi-character.
... In principal the token identifier could be
easily modified to identify tokens for Perl, bash, Java, Fortan, ...

I have no idea what the rules are for those other languages, so I'm not
sure whether that would be easy. If it is easy, I suspect that it's so
only because of the power of the available tools for automatically
generating tokenizers. Take a look at lex (or flex) for an example.
 
B

Ben Bacarisse

Michael Press said:
James Kuyper <[email protected]> wrote:

They would not be picked out specially if it were not
known that they would have meaning later.

So what is a token? Somebody in this thread said they
have no meaning.

That was me (for one).
If that were so, any digraph of
non-alphanumeric characters might be a token.

That does not follow.
Since
there is a special list of tokens then we are already
assigning meaning, or have in mind a number of meanings
such as with "*".

What, then, are the assigned meanings of the tokens produced by the
following input

\''

The rules of the language are clear about how this input must be scanned
into tokens but it illustrates that "meaning" is not a key part of the
process.
 
H

Hans Vlems

At least this way there's no arguing that the compiler picked the
"wrong" interpretation[1].  I don't know anything about Algol, but how
would it resolve something like "x---y", which has two "valid"
interpretations if not for the "maximal munch" rule.

The scheme referred to is only for Algol 68, by the way.  In Algol 68
"x---y" must be parsed as "x - (-(-y))" and this remains true even if
there are arbitrary user-defined operators in scope.  This is because
the *second* symbol is a monad: a symbol that can be, on it's own, a
monadic (i.e. unary) operator.  By preventing any operator from having a
monad as its second symbol, the parser can always tell (given only a
little context) whether a monadic or dyadic (binary) operator is meant.

The other operator symbols are called nomads and the rules for permitted
operator are based on these symbol sets (I think they were given in
another post).

It's a clever scheme, but it is not as easy to grasp a C's rule.

Yes, Algol68 was a *very* clever language. And that is partly
responsible for its succes, or rather lack of...
Hans
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,217
Latest member
topweb3twitterchannels

Latest Threads

Top