Problem with gcc

Nick Keighley · Nov 24, 2009

[...] we are stuck with the baggage of 7 bit ASCII.

Click to expand...

7-bit ASCII? 7-BIT ASCII? I used to *dream* of being stuck with 7-bit
ASCII!

if a character takes up more than 5-bits you're just burning
bandwidth.

Baudot knew what he was at!

Lew Pitcher · Nov 24, 2009

Eric Sosman a Ã©crit :

The letter 'Ã©' is 130. Why I should have it as -126 ???

No, Jacob, the letter 'Ã©' is not 130.

The numeric value assigned to a letter (technically, the codepoint that a
glyph occupies) varies from characterset to characterset. The letter
(glyph) 'Ã©' is not guaranteed have the numeric value (occupy the codepoint)
130 in /every/ characterset; in some charactersets, that glyph does not
exist at all, while in others, it will occupy a different codepoint.

For instance,
the character doesn't exist at all in ASCII
in the CP-HU characterset, that character is 0x82
in the UTF16 characterset, that character is 0x00E9
in the CSA_T500-1983 characterset, that character is a two-character
sequence 0xC2 0x65
in the EBCDIC-CP-FR characterset, that character is 0xC0

Of course, if you qualified your statement to include the characterset,
you'd be closer. But then again, you still have to take into account the
way your environment handles non-core characters; are they signed or
unsigned, 8bits, 16bits, or variable?

[snip]

Lew Pitcher · Nov 24, 2009

No, Jacob, the letter 'Ã©' is not 130.

The numeric value assigned to a letter (technically, the codepoint that a
glyph occupies) varies from characterset to characterset. The letter
(glyph) 'Ã©' is not guaranteed have the numeric value (occupy the
codepoint) 130 in /every/ characterset; in some charactersets, that glyph
does not exist at all, while in others, it will occupy a different
codepoint.

For instance,
the character doesn't exist at all in ASCII
in the CP-HU characterset, that character is 0x82
in the UTF16 characterset, that character is 0x00E9
in the CSA_T500-1983 characterset, that character is a two-character
sequence 0xC2 0x65
in the EBCDIC-CP-FR characterset, that character is 0xC0

Some statistics, courtesy of an analysis of the ISO/IEC Internationalization
Working Group's catalog of charactersets

Of the 601 different charactersets in their catalog
331 charactersets do not have a codepoint for the e-acute glyph
73 charactersets encode the e-acute glyph as codepoint 0xE9
56 charactersets encode the e-acute glyph as codepoint 0x51
41 charactersets encode the e-acute glyph as codepoint 0x82
22 charactersets encode the e-acute glyph as codepoint 0x7B
19 charactersets encode the e-acute glyph as escape sequence 0xC2 0x65
12 charactersets encode the e-acute glyph as codepoint 0xC0
7 charactersets encode the e-acute glyph as codepoint 0x79
6 charactersets encode the e-acute glyph as codepoint 0x65
6 charactersets encode the e-acute glyph as codepoint 0x5A
5 charactersets encode the e-acute glyph as codepoint 0x8E
4 charactersets encode the e-acute glyph as codepoint 0xDB
4 charactersets encode the e-acute glyph as codepoint 0xD0
4 charactersets encode the e-acute glyph as codepoint 0xC5
4 charactersets encode the e-acute glyph as codepoint 0x5D
3 charactersets encode the e-acute glyph as codepoint 0xE5
3 charactersets encode the e-acute glyph as codepoint 0xDD
1 characterset encodes the e-acute glyph as codepoint 0x60

Note that these values are expressed in a signless hexadecimal format.
Codepoints (when used in characterset documentation) are always positive
values, but /platforms/ often do not support positive values for the entire
range of codepoints that a characterset may contain.

We are discussing the difference between /encoding/ (which, for codepoints
is always signless) and /interpretation/ (which may be signed, as the
interpreter wishes).

HTH

jacob navia · Nov 24, 2009

Lew Pitcher a Ã©crit :

Note that these values are expressed in a signless hexadecimal format.

That's exactly what I was saying.

Thanks

Nick · Nov 24, 2009

Nick Keighley said:
[...] we are stuck with the baggage of 7 bit ASCII.

Click to expand...

7-bit ASCII? 7-BIT ASCII? I used to *dream* of being stuck with 7-bit
ASCII!

Click to expand...

if a character takes up more than 5-bits you're just burning
bandwidth.

Baudot knew what he was at!

Hmm. I once had great fun dealing with a 3 of 7 code.

David Thompson · Dec 9, 2009

is that even possible? (ie. allowed by the standard)

I believe not, although it depends on <horror> common sense </> .

6.2.5p26 requires "A pointer to void shall have the same
representation and alignment requirements as a
pointer to a character type." with a (nonnormative) footnote that this
is "meant to imply interchangeability as
arguments to functions, return values from functions, and members of
unions."

6.5.2.2p6 requires for calls using nonprototype designators that
"pointers to qualified or unqualified versions of a character type or
void" are interchangeable as arguments, and 7.5.1.1p2 requires it for
a variadic argument using va_arg (except it doesn't mention
qualification; it probably should).

The question is, does 'a character type' in these places mean
'any/each CT' or 'some CT'? I believe 'any' is the intended and only
sensible interpretation. If the equivalence and interchangability
applies only to one of the three character types, with no way to
determine which, a portable program (or programmer) cannot use
this capability, so there's no point in the standard providing it.
But if it applies to all three character types, this is both
consistent with widespread practice and useful.

And if so, then pointers to each CT 'equal' (HTSRAAR as) pointer to
void, so transitively they 'equal' (HTSRAAR as) each other.

Frank · Dec 10, 2009

I believe not, although it depends on <horror> common sense </> .

6.2.5p26 requires "A pointer to void shall have the same
representation and alignment requirements as a
pointer to a character type." with a (nonnormative) footnote that this
is "meant to imply interchangeability as
arguments to functions, return values from functions, and members of
unions."

6.5.2.2p6 requires for calls using nonprototype designators that
"pointers to qualified or unqualified versions of a character type or
void" are interchangeable as arguments, and 7.5.1.1p2 requires it for
a variadic argument using va_arg (except it doesn't mention
qualification; it probably should).

The question is, does 'a character type' in these places mean
'any/each CT' or 'some CT'? I believe 'any' is the intended and only
sensible interpretation. If the equivalence and interchangability
applies only to one of the three character types, with no way to
determine which, a portable program (or programmer) cannot use
this capability, so there's no point in the standard providing it.
But if it applies to all three character types, this is both
consistent with widespread practice and useful.

And if so, then pointers to each CT 'equal' (HTSRAAR as) pointer to
void, so transitively they 'equal' (HTSRAAR as) each other.

bitte schoen?
--

Tim Rentsch · Jan 12, 2010

David Thompson said:
I believe not, although it depends on <horror> common sense </> .

6.2.5p26 requires "A pointer to void shall have the same
representation and alignment requirements as a
pointer to a character type." with a (nonnormative) footnote that this
is "meant to imply interchangeability as
arguments to functions, return values from functions, and members of
unions."

6.5.2.2p6 requires for calls using nonprototype designators that
"pointers to qualified or unqualified versions of a character type or
void" are interchangeable as arguments, and 7.5.1.1p2 requires it for
a variadic argument using va_arg (except it doesn't mention
qualification; it probably should).

The question is, does 'a character type' in these places mean
'any/each CT' or 'some CT'? I believe 'any' is the intended and only
sensible interpretation. If the equivalence and interchangability
applies only to one of the three character types, with no way to
determine which, a portable program (or programmer) cannot use
this capability, so there's no point in the standard providing it.
But if it applies to all three character types, this is both
consistent with widespread practice and useful.

In fact the Standard defines the italicized phrase "character types"
in 6.2.5p15 as all three character types. I believe the Standard
means 'character type' in 6.2.5p26 to invoke this defined term,
that usage being consistent with how other defined terms are
used with minor variations.

David Thompson · Jan 22, 2010

In fact the Standard defines the italicized phrase "character types"
in 6.2.5p15 as all three character types. I believe the Standard
means 'character type' in 6.2.5p26 to invoke this defined term,
that usage being consistent with how other defined terms are
used with minor variations.

Certainly 'character type' is defined in 6.2.5p15. My unserious
concern was about the quantifier 'a'. As I said I believe 'any/each'
makes sense here. (On rereading I realize even 'any' could be
misunderstood; make it 'each/every', versus 'one/at-least-one/some'.)

But consider also: 6.7.2.2p4 says "Each enumerated type
shall be compatible with char, a signed integer type, or an
unsigned integer type ... capable of representing the values ..."
and 6.2.5p4,6 defines 'signed integer type' and 'unsigned integer
type' to include the five standard widths of each, possibly plus
implementation-defined ones.

Clearly this should not require that plain char, unsigned short, and
signed long long all be compatible -- that would be stupid. So this
must mean enum-foo is compatible with ONE of the specified types but
not (necessarily) others. By (IMO unjustified) analogy this _could_
mean that void * has same representation as char * but not unsigned
char * , and thus unsigned char * can be different from char * in a
way that would cause problems for M Navia -- and a huge number of
other C programmers. Which is why it isn't done, and I'm confident it
won't be done and wasn't intended.

Tim Rentsch · Jan 22, 2010

David Thompson said:
Certainly 'character type' is defined in 6.2.5p15. My unserious
concern was about the quantifier 'a'. As I said I believe 'any/each'
makes sense here. (On rereading I realize even 'any' could be
misunderstood; make it 'each/every', versus 'one/at-least-one/some'.)

Right, I got that. What I was trying to say, but obviously
didn't say explicitly enough, is that when the Standard uses
defined terms in this way usually it means the statement to apply
to all of them, not to just one (and not differently for the
different cases, unless some word like "each" is used). More
specifically, when I say "usually", what I mean is I'm not aware
of any exceptions.

But consider also: 6.7.2.2p4 says "Each enumerated type
shall be compatible with char, a signed integer type, or an
unsigned integer type ... capable of representing the values ..."
and 6.2.5p4,6 defines 'signed integer type' and 'unsigned integer
type' to include the five standard widths of each, possibly plus
implementation-defined ones.

Yes, there's that "each" word -- the statement applies individually
to the separate cases; in other words it's true for all of them,
but which possibility applies varies from enumerated type to
enumerated type.

Clearly this should not require that plain char, unsigned short, and
signed long long all be compatible -- that would be stupid. So this
must mean enum-foo is compatible with ONE of the specified types but
not (necessarily) others. By (IMO unjustified) analogy this _could_
mean that void * has same representation as char * but not unsigned
char * , and thus unsigned char * can be different from char * in a
way that would cause problems for M Navia -- and a huge number of
other C programmers. Which is why it isn't done, and I'm confident it
won't be done and wasn't intended.

And the distinction between these two cases is evidenced by the
presence of "each/or" in one of them, and not in the other. In
other words I agree with you, and I think the text in the
Standard does supports this conclusion (although, it would have
been better if the interpretation desired had been expressed more
clearly, so we wouldn't have to be having this discussion).

String operations with unsigned char arrays	2	Mar 27, 2009
Compiling fics-1.7.4	3	May 6, 2011
Warning when comparing char[] to a #define'd string	12	Nov 7, 2008
gcc 4 signed vs unsigned char	22	Jul 26, 2005
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Differing signedness warnings when compiling ruby-odbc.	0	Jan 9, 2006
review of the "container library", part 1/?	18	Mar 1, 2011
M2Crypto-0.20.2, SWIG-2.0.0, and OpenSSL-1.0.0a build problem	5	Jul 13, 2010

Problem with gcc

Nick Keighley

Lew Pitcher

Lew Pitcher

jacob navia

Nick

David Thompson

Frank

Tim Rentsch

David Thompson

Tim Rentsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads