Problem with gcc

Keith Thompson · Nov 18, 2009

bartc said:
Dik T. Winter said:

int data[256]={0};

data['Ãº'] += 1;

Click to expand...

I would not expect a line like that in code, more something like:
data[c] += 1;
where c is the return value of getchar().

Click to expand...

Yet another source of confusion: getchar returns codes 128 to 255 as
positive values, but put that value into a char type, and it becomes
negative: char c; data[getchar()] works, but data[c=getchar()] doesn't.

getchar() returns an int; you should never store the result of
getchar() directly in an object of type char.

And why can't someone write: char text[100]; data[text] ?

Um, because char might be a signed type, and text might be
negative. Yeah, I know that's not really what you were asking.

Making arbitrary rules about what can or can't be coded is not really
helpful; why not just admit that negative characters are a bad idea as
Eric Sosman did in this thread?

Click to expand...

The rules aren't arbitrary. They follow from the fact that plain
char may be either signed or unsigned. I certainly admit that that
causes problems, and all else being equal I would prefer plain char
to be unsigned.

Historically, I believe that making plain char signed made for more
efficient code on the PDP-11. This was before the types signed char
and unsigned char had been introduced. Since there probably were
other systems on which making plain char unsigned made for faster
code, the choice was left up to the implementation. On ASCII-based
systems at the time, character values outside the range 0..127 were
rare (Accented letters? On a computer? You're lucky to get lower
case!), so it wasn't much of an issue. EBCDIC-based systems made
plain char unsigned anyway.

If the standard were changed to require plain char to be unsigned,
it would not break any existing portable code. It might break some
existing non-portable code that assumes plain char is signed --
and that might be a reasonable assumption for code that's intended
only for a single target system (though I'd still prefer to use
signed char explicitly).

Any implementation could avoid the problems of negative characters
by making plain char unsigned. But most compilers I've used still
make plain char signed by default (though some have an option to
change it). I have to assume there's some valid reason for that,
though perhaps it's just inertia.

Eric Sosman · Nov 18, 2009

bartc said:
[...] why not just admit that negative characters are a bad idea as
Eric Sosman did in this thread?

You're reading more into my words than I wrote in them.
I agreed that it would be nice if `char' were unsigned, but
did not say it was a "bad idea" to leave their signedness up
to the implementation's discretion.

Sometimes the journey to a desired destination requires
carrying inconvenient baggage. In the days when C was young,
memory was short and CPU cycles were long and both were
expensive, and the overhead of forcing an unsigned `char' on
a machine that "naturally" treated bytes as signed would have
been substantial. It's conceivable that C would simply have
died out if it had acquired a reputation as "slow and bloated."

A sufficiently smart optimizer might distinguish the cases
where sign bits needed suppression from those where they could
be permitted to survive, but remember: The machine that ran the
optimizer was also small and slow, and couldn't sustain a whole
lot of fancy data structures to support such decisions.

Alan Curry · Nov 19, 2009

And why can't someone write: char text[100]; data[text] ?

Because the size of data[] would have to be 1<<CHAR_BIT and that could get
unreasonably large. Tables indexed by character need to be sparse.

Since the subject line still says "Problem with gcc", I'll also point out
that gcc would warn you about the above code.

warning: array subscript has type 'char'

Let's pick on some other compiler that doesn't helpfully point out the
trouble with plain char as an array index.

Keith Thompson · Nov 19, 2009

And why can't someone write: char text[100]; data[text] ?

Click to expand...

Because the size of data[] would have to be 1<<CHAR_BIT and that could get
unreasonably large. Tables indexed by character need to be sparse.

That's not much of a concern if you're willing to limit yourself to
systems with CHAR_BIT==8. For greater portability that that, of
course, you're right.

Note that the implementations of <ctype.h> that I've seen use arrays
of length 1<<CHAR_BIT or thereabouts. Of course such implementations
are targeted to specific platforms, so portability isn't an issue.

[...]

Phil Carmody · Nov 19, 2009

Richard Heathfield said:
They are - in C/370.

Apple's GCC on POWER ditto.

Phil

Phil Carmody · Nov 19, 2009

bartc said:
'\x82' is a convenient way of embedding a 0x82 code in the middle of a
string literal. Why would anyone expect it to have a value other then
hex 82 when assigned to a single char or an int?

I personally don't expect a single char to be able to hold the
value hex 82.

Phil Carmody · Nov 19, 2009

Eric Sosman said:
bartc said:

Dik T. Winter said:

The letter 'Ã©' is 130. Why I should have it as -126 ???

Looking at the wrong way again.

Click to expand...

Unless you can tell us the reason for widening e-grave, c-hacek or
e-acute
and so on this makes no sense.

Click to expand...

int data[256]={0};

data['Ãº'] += 1;

Click to expand...

int data[1+UCHAR_MAX] = { 0 };
data['Ãº' - CHAR_MIN] += 1;

Or you could use the `int *datap = data - CHAR_MIN;' trick
if desired.

Click to expand...

That's not a particularly powerful technique, one can't even
implement the is{ctype} family or equivalents using just that.

Phil

Lew Pitcher · Nov 19, 2009

I personally don't expect a single char to be able to hold the
value hex 82.

I guess that you've never worked on an IBM mainframe, then.

IIRC, IBM's C compiler for S/390 (and followons) uses the EBCDIC-INT (or
CP038) characterset mapping, which places the "Latin Small Letter B ('b')
as 0x82. So, if you want
char this_is_a_b = 'b';
on the mainframe, a "single char" /must/ "be able to hold the value hex 82".

HTH

Eric Sosman · Nov 19, 2009

Richard said:
I do. It's required by the Standard. The bit pattern 10000010 takes
only 8 bits, and C chars are required to be at least 8 bits wide, so
a char *must* be able to hold hex 82.

No: CHAR_MAX can be as small as 127, and 0x82 == 130 is
larger than that.

Nick · Nov 19, 2009

Lew Pitcher said:
I guess that you've never worked on an IBM mainframe, then.

IIRC, IBM's C compiler for S/390 (and followons) uses the EBCDIC-INT (or
CP038) characterset mapping, which places the "Latin Small Letter B ('b')
as 0x82. So, if you want
char this_is_a_b = 'b';
on the mainframe, a "single char" /must/ "be able to hold the value hex 82".

Is that the one that came with a shelf of manuals. You'd compile the
code and it would say
C0091928737 An illegal construct was encountered

You'd go to the shelves, find the manual that covered codes C0091928693
to C0091929003 and page through it to the find the right page.

There it would be:
C0091928737 An illegal construct was encountered

This message means that the construct encountered was illegal.

?

Lew Pitcher · Nov 19, 2009

And you remembered that? Sure.

Well, my memory /is/ a bit foggy about /which/ EBCDIC-variant characterset
the C compiler expected. I only worked with it a little bit, although I was
forced to (by other circumstances) become somewhat knowledgable about the
differences in the various EBCDIC variants.

As for /which/ character 0x82 maps to, I had some help. I doublechecked my
memory against http://anubis.dkuug.dk/i18n/charmaps/EBCDIC-INT

Of course, since I retired, things /could/ have changed in the mainframe
world, and they /could/ have changed IBM C to accept ASCII ;-)

Nick · Nov 20, 2009

Richard Heathfield said:
That's the one - except that, IIRC, a lot of those messages began with
I rather than C. Fortunately, it usually wasn't too hard to figure
out what was going wrong, *despite* the "helpful" manual.

You may be right. It was nearly 20 years ago.

Phil Carmody · Nov 21, 2009

Richard Heathfield said:
I do. It's required by the Standard. The bit pattern 10000010 takes
only 8 bits, and C chars are required to be at least 8 bits wide, so
a char *must* be able to hold hex 82.

It's rare to see you pen such nonsense. Please desist.

Phil

Phil Carmody · Nov 21, 2009

Lew Pitcher said:
I guess that you've never worked on an IBM mainframe, then.

IIRC, IBM's C compiler for S/390 (and followons) uses the EBCDIC-INT (or
CP038) characterset mapping, which places the "Latin Small Letter B ('b')
as 0x82. So, if you want
char this_is_a_b = 'b';
on the mainframe, a "single char" /must/ "be able to hold the value hex 82".

Oh, for pity's sake, has everyone been hit with a stupid stick today?

Just because there exist systems where it's possible doesn't mean
anyone has any reason to expect it to be so. Expecting things that
are not mandated by the standard, for example, the standard does
not mandate that you use IBM mainframes, is not often advisable.

Phil

Keith Thompson · Nov 21, 2009

Phil Carmody said:
It's rare to see you pen such nonsense. Please desist.

Richard wrote the following in a followup two days ago:

| Oops, there was me, thinking I was thinking straight, and it turned
| out I wasn't. I was, of course, thinking in bits rather than in
| signs.

People do make mistakes. I see no point in jumping on them after
they've been acknowledged and corrected.

Phil Carmody · Nov 22, 2009

Keith Thompson said:
Richard wrote the following in a followup two days ago:

| Oops, there was me, thinking I was thinking straight, and it turned
| out I wasn't. I was, of course, thinking in bits rather than in
| signs.

People do make mistakes. I see no point in jumping on them after
they've been acknowledged and corrected.

Yes, mum.

Phil

Phil Carmody · Nov 22, 2009

Richard Heathfield said:
I love you too, Phil. You got a slow feed or something?

Nah, just less time for Usenet currently. Bloody scrum meetings
every morning mean I can't spend the morning lazily reading
usenet any more. And so if I go out in the evenings, I may miss
usenet for many days.

The way I read news using gnus doesn't help. I'll pull in all
the new messages for all groups, read each group in turn,
then repeat. This means that there can be posts sitting on the
server for days before I even pull them in. I really ought to
refresh the message list each time I sit down at this computer
to reduce the clump size of the updates.

Phil

Nick Keighley · Nov 24, 2009

I standardized into unsigned char throughout the container library. There
is NO system that I know of that would have a different pointer size
or characteristics for signed or unsigned chars!

is that even possible? (ie. allowed by the standard)

Nick Keighley · Nov 24, 2009

Oh, for pity's sake, has everyone been hit with a stupid stick today?

Just because there exist systems where it's possible doesn't mean
anyone has any reason to expect it to be so.

what? I'd say exactly the opposite. If there exist systems where X is
possible then someone will have reason to expect it to be so. Did you
mean "If there exist systems where X is possible then not everyone
will have reason to expect it to be so"?

I've never encountered EBCDIC but I have seen some pretty wierd
charcater sets.

Expecting things that
are not mandated by the standard, for example, the standard does
not mandate that you use IBM mainframes, is not often advisable.

but *some* people have to worry about such things. And even those that
don't should still apply the "why make it unportable if you don't have
to" rule. If your code can be charatcer set agnostic (admittedly this
is getting harder with Unicode floating about) why not do so?

Nick Keighley · Nov 24, 2009

The letter 'é' is 130. Why I should have it as -126 ???

in which character set? I looked it up and found it was some wierd
thing called BPH. Wikipedia shows that part of Latin-1 to be
unallocated. e-acute appears to be 233.

String operations with unsigned char arrays	2	Mar 27, 2009
Compiling fics-1.7.4	3	May 6, 2011
Warning when comparing char[] to a #define'd string	12	Nov 7, 2008
gcc 4 signed vs unsigned char	22	Jul 26, 2005
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Differing signedness warnings when compiling ruby-odbc.	0	Jan 9, 2006
review of the "container library", part 1/?	18	Mar 1, 2011
M2Crypto-0.20.2, SWIG-2.0.0, and OpenSSL-1.0.0a build problem	5	Jul 13, 2010

Problem with gcc

Keith Thompson

Eric Sosman

Alan Curry

Keith Thompson

Phil Carmody

Phil Carmody

Phil Carmody

Lew Pitcher

Eric Sosman

Nick

Lew Pitcher

Nick

Phil Carmody

Phil Carmody

Keith Thompson

Phil Carmody

Phil Carmody

Nick Keighley

Nick Keighley

Nick Keighley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads