number formats

James Brown · Nov 13, 2006

All,

this is a bit of an odd question but please bear with me:

Suppose I have the following (bad) C expression:

unsigned int x = 0xABCDEFg;

Note the illegal 'g' at the end of the hex-literal. My question is, what
would the expected behavior of an ANSI-C compiler be in this case? I would
expect it either to say something along the lines of "illegal suffix on
number 0xABCDEF" or "unexpected identifier 'g' "

Is there an expected, 'correct' way for the compiler to deal with this
scenario? In other words, if I was writing a simple C-parser (which I am),
what would be the proper way to deal with this?

thanks,
James

Richard Heathfield · Nov 13, 2006

James Brown said:

Is there an expected, 'correct' way for the compiler to deal with

[unsigned int x = 0xABCDEFg; ]

It's a syntax error. The implementation must emit at least one diagnostic
message for any translation unit containing any syntax errors or constraint
violations.

In other words, if I was writing a simple C-parser (which I am),
what would be the proper way to deal with this?

Emit a diagnostic message.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: normal service will be restored as soon as possible. Please do not
adjust your email clients.

Eric Sosman · Nov 13, 2006

James Brown wrote On 11/13/06 17:49,:

All,

this is a bit of an odd question but please bear with me:

Suppose I have the following (bad) C expression:

unsigned int x = 0xABCDEFg;

Note the illegal 'g' at the end of the hex-literal. My question is, what
would the expected behavior of an ANSI-C compiler be in this case? I would
expect it either to say something along the lines of "illegal suffix on
number 0xABCDEF" or "unexpected identifier 'g' "

Is there an expected, 'correct' way for the compiler to deal with this
scenario? In other words, if I was writing a simple C-parser (which I am),
what would be the proper way to deal with this?

It depends on the purposes of your parser. If you were
writing a full-blown C compiler, it would parse 0xABCDEFg as
a preprocessing token and later on would issue a diagnostic
when unable to convert that preprocessing token to a token.
(Note that the grammar for pp-numbers matches all manner of
nonsense: 1.TWO.3, for example. Such things are invalidated
on semantic rather than syntactic grounds.)

Ben Pfaff · Nov 13, 2006

Eric Sosman said:
(Note that the grammar for pp-numbers matches all manner of
nonsense: 1.TWO.3, for example. Such things are invalidated
on semantic rather than syntactic grounds.)

I would think that this would qualify as a lexical error. It
occurs early in translation phase 7, when pp-tokens are converted
to tokens. Syntactic and semantic analysis happens after that,
although still in the same phase.

David Wade · Nov 13, 2006

James Brown said:
All,

this is a bit of an odd question but please bear with me:

Suppose I have the following (bad) C expression:

unsigned int x = 0xABCDEFg;

Note the illegal 'g' at the end of the hex-literal. My question is, what
would the expected behavior of an ANSI-C compiler be in this case? I would
expect it either to say something along the lines of "illegal suffix on
number 0xABCDEF" or "unexpected identifier 'g' "

How can you be sure that its the "g" thats wrong? There could be a missing
"+" between the "0" and the "X", Some one could have used a lower case "x"
instead of a "*", there could be an operator missing between the "F" and the
"g". All you can say is that its a syntax error.

James Brown · Nov 13, 2006

Eric Sosman said:
James Brown wrote On 11/13/06 17:49,:

It depends on the purposes of your parser. If you were
writing a full-blown C compiler, it would parse 0xABCDEFg as
a preprocessing token and later on would issue a diagnostic
when unable to convert that preprocessing token to a token.
(Note that the grammar for pp-numbers matches all manner of
nonsense: 1.TWO.3, for example. Such things are invalidated
on semantic rather than syntactic grounds.)

ok thanks, so I think what you (and Richard) are saying is that as long as
an appropriate error is issued, it doesn't really matter. If I class it as a
'bad number' syntax error then this is fine, and likewise reporting that an
'identifier follows a number-literal' is also suitable. And the reason is,
it totally depends on what stage I find/classify the error in my compiler? I
guess what I was trying to get at was, what is the most appropriate message
to give the user:

1# treat '0xABCDEFg' as a single unit (malformed integer constant),
2# treat '0xABCDEFg' in the same way as I would treat: '0xABCDEF'
<whitespace> 'g', because my lexer knows to stop processing hex-digits when
it finds the first non-digit (the 'g') and it return two tokens representing
the hex-part and the 'g'.

I'll go with option#1, seems more natural to me at least.

James

Random832 · Nov 14, 2006

2006-11-13 said:
ok thanks, so I think what you (and Richard) are saying is that as long as
an appropriate error is issued, it doesn't really matter. If I class it as a
'bad number' syntax error then this is fine, and likewise reporting that an
'identifier follows a number-literal' is also suitable. And the reason is,
it totally depends on what stage I find/classify the error in my compiler? I
guess what I was trying to get at was, what is the most appropriate message
to give the user:

1# treat '0xABCDEFg' as a single unit (malformed integer constant),
2# treat '0xABCDEFg' in the same way as I would treat: '0xABCDEF'
<whitespace> 'g', because my lexer knows to stop processing hex-digits when
it finds the first non-digit (the 'g') and it return two tokens representing
the hex-part and the 'g'.

I'll go with option#1, seems more natural to me at least.

Also, if you use option #2, you might forget to handle 0xE+1 as
a malformed number constant, since it _would_ be valid if you split it
up (which you're not allowed to do)

Eric Sosman · Nov 14, 2006

James said:
[...]
1# treat '0xABCDEFg' as a single unit (malformed integer constant),
2# treat '0xABCDEFg' in the same way as I would treat: '0xABCDEF'
<whitespace> 'g', because my lexer knows to stop processing hex-digits when
it finds the first non-digit (the 'g') and it return two tokens representing
the hex-part and the 'g'.

I'll go with option#1, seems more natural to me at least.

Your instincts are good. Consider

#define g + 42
unsigned int x = 0xABCDEFg;

.... as it would be treated under the two options. #2 would
separate the `g' and lead to `= 0xABCDEF + 42', while #1 would
group the `g' with the rest and eventually toss an error for
an ill-formed constant. That's what a C compiler does, so
that's what your parser should imitate.

Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
Accessing array elements via floating point formats.	33	Dec 10, 2010
"archive" data formats	1	Feb 24, 2012
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
security vs. XML-based formats	4	Mar 8, 2010
Output confusion	2	Mar 9, 2023
C coding a rotate function (help me pleasee)	1	Dec 26, 2022
Portable random number generator	2	Nov 15, 2010

number formats

James Brown

Richard Heathfield

Eric Sosman

Ben Pfaff

David Wade

James Brown

Random832

Eric Sosman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads