Why identifiers don't beging with digits?

S

Steven T. Hatton

This is just idle curiosity. I was playing with this code from the Lex &&
Yacc book [http://www.oreilly.com/catalog/lex/], and discovered that it
does strange things with strings beginning with numbers.
$ cat example.l
%{
/*
* this sample demonstrates (very) simple recognition:
* a verb/not a verb.
*/

%}
%%

[\t ]+ /* ignore white space */ ;

is |
am |
are |
were |
was |
be |
being |
been |
do |
does |
did |
will |
would |
should |
can |
could |
has |
have |
had |
go { printf("%s: is a verb\n", yytext); }

[a-zA-Z]+ { printf("%s: is not a verb\n", yytext); }

..|\n { ECHO; /* normal default anyway */ }
%%

main()
{
yylex();
}
###############################
$ ls
example.l
$ flex example.l
$ ls
example.l lex.yy.c
$ gcc -o example lex.yy.c -lfl
$ ls
example example.l lex.yy.c
$ ./example
test
test: is not a verb

is
is: is a verb

is123
is: is a verb
123
123is
123is: is a verb
^D

C is actually older than Lex, but I suspect the techniques used to scan
early C code were similar to what was incorporated into Lex. Anybody know
about this?
 
V

Victor Bazarov

Steven said:
This is just idle curiosity. I was playing with this code from the
Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
discovered that it does strange things with strings beginning with
numbers. [...]

C is actually older than Lex, but I suspect the techniques used to
scan early C code were similar to what was incorporated into Lex.
Anybody know about this?

C _language_ is off-topic. Please visit comp.lang.c for that.
C _Library_ is on topic since it's part of C++ Library. Just so
there is no misunderstanding.

Identifiers don't begin with digits because there would be no way
to tell an indentifier from a number literal if only digits are
used, I guess. But I only guess. You might also want to consider
a newsgroup for compiler design for further inquiry.

V
 
S

Steven T. Hatton

Victor said:
Steven said:
This is just idle curiosity. I was playing with this code from the
Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
discovered that it does strange things with strings beginning with
numbers. [...]

C is actually older than Lex, but I suspect the techniques used to
scan early C code were similar to what was incorporated into Lex.
Anybody know about this?

C _language_ is off-topic. Please visit comp.lang.c for that.
C _Library_ is on topic since it's part of C++ Library. Just so
there is no misunderstanding.

Then why does C++ have the rule regarding not beginning an identifier with a
digit? I was intending to imply that this was due to the fact that C
already had that rule.

I find it interesting that C, Lex, YACC, and C++ were all products of the
same shop, if I understand correctly. Stroustrup actually says he gave up
on using YACC to produce a formal definition of C++. But in that he seems
to blame C.
Identifiers don't begin with digits because there would be no way
to tell an indentifier from a number literal if only digits are
used, I guess. But I only guess. You might also want to consider
a newsgroup for compiler design for further inquiry.

At this point it's not really very significant to me. I was just a bit
curious about the possibility. Such historical tidbits can, however, shed
new light on how a language works, and can also exposed potential pitfalls.
 
J

Jack Klein

Victor said:
Steven said:
This is just idle curiosity. I was playing with this code from the
Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
discovered that it does strange things with strings beginning with
numbers. [...]

C is actually older than Lex, but I suspect the techniques used to
scan early C code were similar to what was incorporated into Lex.
Anybody know about this?

C _language_ is off-topic. Please visit comp.lang.c for that.
C _Library_ is on topic since it's part of C++ Library. Just so
there is no misunderstanding.

Then why does C++ have the rule regarding not beginning an identifier with a
digit? I was intending to imply that this was due to the fact that C
already had that rule.

Because it is the one and only way to avoid imposing more complex
requirements on the remaining characters of the identifier.

Leave aside the prohibitions on symbols reserved for the
implementation due to underscores, which is an issue for the linker
and not the parser. The regular expression for a valid C or C++
identifier is:

[_A-Za-z][_A-Za-z0-9]*

If you allow the first character to be a digit, you have the
apparently simpler:

[_A-Za-z0-9]*

....but that expression accepts, for example, '123' as an identifier as
opposed to an integer literal, but '123T' is truly an identifier. So
now you need a rule that states that if the identifier begins with a
digit, it must also include at least one non-digit character.

OK, what happens to '0x7fff'? Oops, that's an identifier, not a
literal with a value of 32767.

So you could make the rule that if the first character is '0', then
either the second character can't be 'x' or 'X', or the remaining
characters must contain at least one character not a digit and
[^a-fA-F].

So start rewriting the C or C++ grammar for an identifier that handles
all of these cases. Then start writing the text that explains the
limitations at several levels, from books for beginners to the
normative text of the standard itself.

Then start rewriting the preprocessor so it works according to the
rules, with all the special cases. In the first place, you either
have to forget about the tradition that you can completely pp tokenize
C source with one character's worth of ungetc() in a single pass.
Either you have to retain a lot more text to back up through, or you
have to supply some sort of state machine that processes the value as
a number and a symbol simultaneously until the disambiguating
character is encountered.

And of course, write the diagnostics to issue to the programmer when
he accidentally slips up and delivers a numeric literal where an
identifier is required.

Consider that when the C grammar evolved, it was quite common for
assemblers, which tended to be used a lot more by systems programmers
in those days than since C and to some extent C++ have become
universal, tended to have the same restriction.

Now that you've made a time and money estimate of the effort make it
work, factor in the additional opportunities for programmer error and
put together a business case. Can you make a convincing argument that
the cost of requiring these changes to every C and C++ compiler in the
world (or even just the C++ compilers) is more than compensated by the
added benefit of allowing programmers to create identifiers starting
with a digit? What's the dollar value of this benefit?
 
?

=?ISO-8859-15?Q?Juli=E1n?= Albo

Jack said:
put together a business case. Can you make a convincing argument that
the cost of requiring these changes to every C and C++ compiler in the
world (or even just the C++ compilers) is more than compensated by the
added benefit of allowing programmers to create identifiers starting
with a digit? What's the dollar value of this benefit?

The Obfuscated C contests will probaly be more fun.
 
S

Steven T. Hatton

Jack said:
Victor said:
Steven T. Hatton wrote:
This is just idle curiosity. I was playing with this code from the
Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
discovered that it does strange things with strings beginning with
numbers. [...]

C is actually older than Lex, but I suspect the techniques used to
scan early C code were similar to what was incorporated into Lex.
Anybody know about this?

C _language_ is off-topic. Please visit comp.lang.c for that.
C _Library_ is on topic since it's part of C++ Library. Just so
there is no misunderstanding.

Then why does C++ have the rule regarding not beginning an identifier
with a
digit? I was intending to imply that this was due to the fact that C
already had that rule.

Because it is the one and only way to avoid imposing more complex
requirements on the remaining characters of the identifier.

Leave aside the prohibitions on symbols reserved for the
implementation due to underscores, which is an issue for the linker
and not the parser. The regular expression for a valid C or C++
identifier is:

[_A-Za-z][_A-Za-z0-9]*

If you allow the first character to be a digit, you have the
apparently simpler:

[_A-Za-z0-9]*

...but that expression accepts, for example, '123' as an identifier as
opposed to an integer literal, but '123T' is truly an identifier. So
now you need a rule that states that if the identifier begins with a
digit, it must also include at least one non-digit character.

OK, what happens to '0x7fff'? Oops, that's an identifier, not a
literal with a value of 32767.

So you could make the rule that if the first character is '0', then
either the second character can't be 'x' or 'X', or the remaining
characters must contain at least one character not a digit and
[^a-fA-F].

So start rewriting the C or C++ grammar for an identifier that handles
all of these cases. Then start writing the text that explains the
limitations at several levels, from books for beginners to the
normative text of the standard itself.

Then start rewriting the preprocessor so it works according to the
rules, with all the special cases. In the first place, you either
have to forget about the tradition that you can completely pp tokenize
C source with one character's worth of ungetc() in a single pass.
Either you have to retain a lot more text to back up through, or you
have to supply some sort of state machine that processes the value as
a number and a symbol simultaneously until the disambiguating
character is encountered.

And of course, write the diagnostics to issue to the programmer when
he accidentally slips up and delivers a numeric literal where an
identifier is required.

Consider that when the C grammar evolved, it was quite common for
assemblers, which tended to be used a lot more by systems programmers
in those days than since C and to some extent C++ have become
universal, tended to have the same restriction.

Now that you've made a time and money estimate of the effort make it
work, factor in the additional opportunities for programmer error and
put together a business case. Can you make a convincing argument that
the cost of requiring these changes to every C and C++ compiler in the
world (or even just the C++ compilers) is more than compensated by the
added benefit of allowing programmers to create identifiers starting
with a digit? What's the dollar value of this benefit?
Hmmmm.... I guess that's kind of what I said in my first post to this
thread. Just not in so many words, and I was specifically talking about
lex, (Well flex, to be exact) not the general idea of regular expressions.
Thanks for the partial confirmation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top