Problem with gcc

Eric Sosman · Nov 14, 2009

jacob said:
Alan Curry a écrit :

I assume characters are codes from one to 255. This is a bad assumption
maybe, in some kind of weird logic when you assign a sign to a character
code.

It's certainly a bad assumption on machines where `char'
runs from -128 to 127 ...

There is a well established confusion in C between characters (that are
encoded as integers) and integer VALUES.

A character -- loosely, a glyph like 'A' -- is not something
computers nowadays can represent directly in their memories.
Unable to store an actual 'A', they instead store a number like
65 or 193, and say "When thought of as a character, the value
refers to the 65th/193d entry in a list of glyphs." The members
of that list and the order in which they appear are a matter of
convention, nothing more.

It's not really different from the convention that "zero is
false, anything else is true." Some other languages use other
conventions, like "even values are false, odds are true." Neither
scheme is inherently more "right" or "wrong" than the other; it's
just a matter of convention, of a correspondence between the
notions one wants to represent and the numbers that are all the
computer can store internally.

What I'm getting at is that there is (or need be) no confusion
between storing a character and storing a number: The computer always
does the latter and never does the former. When we talk about
"storing a character," it's just a convenient verbal shorthand for
"storing the number that represents a character." And the data type
C uses for this purpose is `char'. Some awkwardnesses stem from this
choice, mostly having to do with the library, and getting the library
to work nicely sometimes involves converting the numbers to and
from other types -- see getchar() or isalpha(), for instance. But
when you want to store character codes, use `char'. Use `unsigned
char' or `signed char' when you want to store small numbers that
are *not* to be thought of as characters.

I prefer not to use any sign in the characters, and treat 152 as character
code 152 and not as -104. Stupid me, I know.

152 is not a character; it is a number. In one popular
encoding scheme it corresponds to the character 'q', by virtue
of one of those conventional correspondences. If you want a 'q',
use a `char' and store 'q' in it. If you want the number 152
in a small space, use an `unsigned char' -- but don't think of
it as a character, because it isn't one.

Besides, when I convert it into a bigger type, I would like to get
152, and not 4294967192.

Much depends on the type to which you are converting, and
on why you are performing the conversion.

Since size_t is unsigned, converting to unsigned is a fairly common
operation.

It sounds very much as if you are dealing with "raw" numbers,
not with numbers that correspond to characters. If so, it's
quite strange that you are using strcmp() on assemblages of these
numbers, because strcmp() isn't well-suited to the task.

Writing software
is difficult enough without having to bother with the sign of characters
or the
sex of angels, or the number of demons you can fit in a pin's head.

A little thought about the artificiality of number-to-glyph
correspondences will remove much of the difficulty.

Flash Gordon · Nov 14, 2009

bartc said:
Yes, there should have been signed and unsigned byte. And a separate char
type equivalent to (or a synonym for) unsigned byte.

I disagree. Ideally char should be a separate type which is *nothing* to
do with integer types. So to assign a char to an integer type you have
to cast it to that type (just as with pointers).

It really is exasperating when most people in this group insist that signed
character codes are perfectly normal and sensible!

Insisting that they are perfectly normal is *not* the same as saying
that it is sensible.

Apparently chars are signed because on the PDP11 or some such machine,
sign-extending byte values was faster than zero-extending them. A bit
shortsighted. (If it had been the other way around, they would of course
have been singing the praises of unsigned char codes; except they would
have
been justified this time..)

Ah, but the people you are complaining about would proably accept that
char being unsigned is *also* perfectly normal.

As I understand it, you can easily choose to use unsigned char type for
such
codes. The problem being when passing these to library functions where char
is signed and this triggers a warning?

More to the point, why does he actually care wither a given character
value happens to be positive or negative? The only time it matters that
I can see is when using certain specific functions in the C library, and
unfortunately then you need a cast.

Of course, with gcc you can (on many architectures) select whether char
is signed or unsigned, it is of course still a distinct type.

Why doesn't widening a signed value into an unsigned one itself trigger a
warning?

Why should it? In any case, as others mentioned, a cast will fix this.
Although I have to wonder why the char is being assigned to a larger
unsigned integer type in the first place, it seems an odd thing to do to me.

bartc · Nov 14, 2009

Eric Sosman said:
jacob navia wrote:

A little thought about the artificiality of number-to-glyph
correspondences will remove much of the difficulty.

Making char types always positive would remove all the difficulties.

And there are difficulties because this issue keeps coming up.

Ben Bacarisse · Nov 14, 2009

John Kelly said:
And if testing in a loop, you may want to cast separately from the test.
Like in this trim function:

static void
trim (char **ts)
{
unsigned char *exam;
unsigned char *keep;

exam = (unsigned char *) *ts;
while (*exam && isspace (*exam)) {

You can remove the *exam test.

++exam;
}
*ts = (char *) exam;
if (!*exam) {
return;
}
keep = exam;
while (*++exam) {
if (!isspace (*exam)) {
keep = exam;
}
}
if (*++keep) {
*keep = '\0';
}

And here you could replace the whole 'if' with 'keep[1] = 0;'.
Neither of them is wrong, of course, but every test makes the reader
wonder why it is there.

John Kelly · Nov 14, 2009

You can remove the *exam test.

But then you're testing whether '\0' is a space or not. Perhaps it
improves performance, but is it good programming?

++exam;
}
*ts = (char *) exam;
if (!*exam) {
return;
}
keep = exam;
while (*++exam) {
if (!isspace (*exam)) {
keep = exam;
}
}
if (*++keep) {
*keep = '\0';
}

Click to expand...

And here you could replace the whole 'if' with 'keep[1] = 0;'.
Neither of them is wrong, of course, but every test makes the reader
wonder why it is there.

But then you replace '\0' with '\0'. Which is worse, one extra test, or
a redundant action?

Eric Sosman · Nov 14, 2009

John said:
But then you're testing whether '\0' is a space or not. Perhaps it
improves performance, but is it good programming?

The test yields "false," so what's wrong with it?
Or, to turn it around, what would your response be to

while (*exam && *exam != '#' && *exam != 'X' && isspace(*exam))

?

John Kelly · Nov 14, 2009

The test yields "false," so what's wrong with it?

'\0' is not part of the string, it's a pseudo length specifier, and
conceptually, should not be treated as part of the string. You can get
away with it in this case, but it's a bad programming habit to rely on
environmental assumptions.

With real length specifiers, you wouldn't test one position beyond the
end of the string, so why do it with NUL terminated strings? It's just
a stupid C trick for some dubious performance gain. For my use of that
code, the performance gain doesn't amount to a drop in a bucket.

I would rather think portably, as in from one language to another. I
may use tricks when performance really matters, but then I would include
some remark about my choice and why.

Seebs · Nov 14, 2009

'\0' is not part of the string, it's a pseudo length specifier, and
conceptually, should not be treated as part of the string. You can get
away with it in this case, but it's a bad programming habit to rely on
environmental assumptions.

The nul terminator is part of the string in C. It's not an environmental
assumption, it's a definition.

I would rather think portably, as in from one language to another. I
may use tricks when performance really matters, but then I would include
some remark about my choice and why.

You can't meaningfully "think portably" about C strings, because they're
not really analagous to things in other languages.

-s

jacob navia · Nov 14, 2009

Eric Sosman a écrit :

152 is not a character; it is a number. In one popular
encoding scheme it corresponds to the character 'q', by virtue
of one of those conventional correspondences. If you want a 'q',
use a `char' and store 'q' in it. If you want the number 152
in a small space, use an `unsigned char' -- but don't think of
it as a character, because it isn't one.

The letter 'é' is 130. Why I should have it as -126 ???
The problem is that you ignore foreign languages and all their special
characters like é or è or à or £ or...

Much depends on the type to which you are converting, and
on why you are performing the conversion.

Most the conversions are indirect, or because some operation with characters
is done by promoting, etc etc.

It sounds very much as if you are dealing with "raw" numbers,
not with numbers that correspond to characters. If so, it's
quite strange that you are using strcmp() on assemblages of these
numbers, because strcmp() isn't well-suited to the task.

Sure, if we accept that 'é' is not a character THEN obviously
"strcmp is not well suited to the task.

What function should I use then?

A little thought about the artificiality of number-to-glyph
correspondences will remove much of the difficulty.

No. A little thought will make you use unsigned chars everywhere.
UNLESS you want signed small integers!

lawrence.jones · Nov 14, 2009

Ben Bacarisse said:
The most annoying is using the character class tests isxxxx.
Technically, a cast is needed to be portable:

char *cp = ...;
...
if (isdigit((unsigned char)*cp)) ...

Which has the potential to misbehave on ones' complement machines if
*cp is -0 (you might get 0 rather than UCHAR_MAX), so it's better to
cast the pointer:

if (isdigit(*(unsigned char *)cp)) ...

Eric Sosman · Nov 14, 2009

John said:
'\0' is not part of the string, it's a pseudo length specifier, and
conceptually, should not be treated as part of the string. You can get
away with it in this case, but it's a bad programming habit to rely on
environmental assumptions.

The '\0' *is* a part of the string. 7.1.1p1:

"A /string/ is a contiguous sequence of characters
terminated by and including the first null character. [...]"

The "environmental assumption" is thus on the same level as the
assumption that stdout designates a FILE*.

With real length specifiers, you wouldn't test one position beyond the
end of the string, so why do it with NUL terminated strings? It's just
a stupid C trick for some dubious performance gain. For my use of that
code, the performance gain doesn't amount to a drop in a bucket.

Okay: If your complaint is "C strings shouldn't be That Way,"
fine. We had a huge and unenlightening wrangle over this issue
just a month or so ago. But if the presence of the '\0' bothers
you, it's hard to see how `while (*exam && ...)' assuages your
worries, given its explicit '\0' test.

I would rather think portably, as in from one language to another. I
may use tricks when performance really matters, but then I would include
some remark about my choice and why.

Ah, but how do you set off the remarks, without the non-portable
assumption that comments are surrounded by /*...*/ or by //...'\n'?
At some point you simply *must* assume that the language you use is
as described by the relevant documentation, or you cannot use the
language.

John Kelly · Nov 14, 2009

'\0' is not part of the string, it's a pseudo length specifier, and
conceptually, should not be treated as part of the string. You can get
away with it in this case, but it's a bad programming habit to rely on
environmental assumptions.

Click to expand...

The '\0' *is* a part of the string. 7.1.1p1:

"A /string/ is a contiguous sequence of characters
terminated by and including the first null character. [...]"

So they say. But conceptually, I think otherwise.

If your complaint is "C strings shouldn't be That Way,"

No, when I use C, I work around its limitations.

fine. We had a huge and unenlightening wrangle over this issue
just a month or so ago. But if the presence of the '\0' bothers
you, it's hard to see how `while (*exam && ...)' assuages your
worries, given its explicit '\0' test.

The string is data and the '\0' is metadata. The standard say it's all
data, but that's what someone else said. I think the '\0' is metadata,
serving as a pseudo length specifier.

I'm not worried the standard will change and break my program. I could
remove the *exam test to reduce the loop test to a single condition, if
performance really mattered. But it doesn't in this case. And leaving
it in reminds me how to think. I don't want to forget how to think.

Ah, but how do you set off the remarks, without the non-portable
assumption that comments are surrounded by /*...*/ or by //...'\n'?
At some point you simply *must* assume that the language you use is
as described by the relevant documentation, or you cannot use the
language.

I think I can use C effectively without being a slave to the standard.

Seebs · Nov 14, 2009

The '\0' *is* a part of the string. 7.1.1p1:
"A /string/ is a contiguous sequence of characters
terminated by and including the first null character. [...]"

Click to expand...

So they say. But conceptually, I think otherwise.

This explains a fair bit.

I think you're mistaken, though. Conceptually, the object includes all its
storage. The terminating null byte is part of the storage of the object;
that's why you have to allocate space for it when allocating a string, for
instance.

No, when I use C, I work around its limitations.

You might find it more rewarding to adapt to the model of a language
when using it.

The string is data and the '\0' is metadata. The standard say it's all
data, but that's what someone else said.

But since the someone else defines the language, they win.

I think the '\0' is metadata, serving as a pseudo length specifier.

It is, yes. Sometimes metadata is mixed in with data for various reasons.

Tables may contain sentinel values. Those values are metadata, but it
doesn't make them not part of the table.

I'm not worried the standard will change and break my program. I could
remove the *exam test to reduce the loop test to a single condition, if
performance really mattered. But it doesn't in this case. And leaving
it in reminds me how to think. I don't want to forget how to think.

I think you would do better to adopt idioms which remind you to think
like C, not idioms which remind you to think that you're programming
something else but using C to express it for unknown reasons.

I think I can use C effectively without being a slave to the standard.

Oh, certainly. But you can't use C effectively without making good use
of the standard. Going beyond what the standard allows can make sense
in some contexts. Pretending it doesn't offer the guarantees that it does,
however, is crippling.

-s

Nick · Nov 14, 2009

jacob navia said:
Eric Sosman a Ã©crit :

The letter 'Ã©' is 130. Why I should have it as -126 ???
The problem is that you ignore foreign languages and all their special
characters like Ã© or Ã¨ or Ã or Â£ or...

No it's not. It's 195 168.

The problem is that you assume everything is the same.

So in a bigger type it should be 43459.

Flash Gordon · Nov 14, 2009

jacob said:
Eric Sosman a écrit :

The letter 'é' is 130. Why I should have it as -126 ???
The problem is that you ignore foreign languages and all their special
characters like é or è or à or £ or...

Why should you care if they are negative? They are not 0 and they
represent the appropriate character.

Most the conversions are indirect, or because some operation with
characters
is done by promoting, etc etc.

I still can't see why you hit this. I can't think of any cases where
I've needed to compare a character specifically with an unsigned number.

I can't think of any time I've needed to compare a character to a size_t.

Sure, if we accept that 'é' is not a character THEN obviously
"strcmp is not well suited to the task.

What function should I use then?

If it's a string then strcmp. This is not a problem because strcmp will
handle this case perfectly.

No. A little thought will make you use unsigned chars everywhere.
UNLESS you want signed small integers!

Whilst I agree that the definition of a char is less helpful than it
could be, I don't think using unsigned char throughout solves the problem.

Eric Sosman · Nov 14, 2009

jacob said:
Eric Sosman a écrit :

[...]

Click to expand...

The letter 'é' is 130. Why I should have it as -126 ???

The numeric value corresponding to 'é' is 130 in some
encodings, -126 in others, and for all I know 250 in still
others. Why should you care what the number is, as long
as you get an 'é' when you want one?

[...]

It sounds very much as if you are dealing with "raw" numbers,
not with numbers that correspond to characters. If so, it's
quite strange that you are using strcmp() on assemblages of these
numbers, because strcmp() isn't well-suited to the task.

Click to expand...

I wrote this because you kept on about numbers, numbers,
numbers, and not characters. But perhaps I misguessed, and
you've confused numbers and characters. You're now talking
about the character 'é', which you insist "is" the number 130.
That's a needless confusion, and seems to be the source of
your grief.

Would you say that the ancient physician Galen was born
around AD 'é', or that Edison's first successful light bulb
test took place 'é' years ago, or that Smarty Jones won the
'é'th running of the Kentucky Derby? If not, why do you say
that 'é' "is" 130?

Sure, if we accept that 'é' is not a character THEN obviously
"strcmp is not well suited to the task.

What function should I use then?

If you want to store the code for the character 'é', store it
in a char. If that char is part of a string, you can use strcmp()
on it.

If you want to store the number 130 in a small space, store
it in an unsigned char. Don't use strcmp() on it.

Eric Sosman · Nov 14, 2009

John said:
'\0' is not part of the string, it's a pseudo length specifier, and
conceptually, should not be treated as part of the string. You can get
away with it in this case, but it's a bad programming habit to rely on
environmental assumptions.

Click to expand...

The '\0' *is* a part of the string. 7.1.1p1:

"A /string/ is a contiguous sequence of characters
terminated by and including the first null character. [...]"

Click to expand...

So they say. But conceptually, I think otherwise.

Okay, so the C you're talking about is not the C that
"they" talk about, where "they" are several international
and national standards organizations.

No, when I use C, I work around its limitations.

Which C do you mean here? Kelly C, or internationally
agreed-upon C?

The string is data and the '\0' is metadata. The standard say it's all
data, but that's what someone else said. I think the '\0' is metadata,
serving as a pseudo length specifier.

If "the standard say" [sic] isn't good enough for you, what
is there to discuss?

I think I can use C effectively without being a slave to the standard.

Knowledge of is not enslavement to; it's Haddocks' Eyes.

Keith Thompson · Nov 14, 2009

jacob navia said:
The letter 'Ã©' is 130. Why I should have it as -126 ???

Because plain char is signed on the implementation you're using.
(Just curious: What is it on lcc-win?)

[...]

No. A little thought will make you use unsigned chars everywhere.
UNLESS you want signed small integers!

Except that the standard library functions that deal with character
strings use plain char, not unsigned char -- though the plain chars
are often *interpreted* as unsigned chars.

For example, consider strcmp(). Its declaration is:

int strcmp(const char *s1, const char *s2);

and C99 7.21.4 says:

The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters
(both interpreted as unsigned char) that differ in the objects
being compared.

Note that it's not *converted* to unsigned char, it's *interpreted*
as unsigned char. That might have some odd effects on
sign-and-magnitude systems.

Also, an octal or hexadecimal escape sequence in a non-wide character
constant or string literal must be within the range of unsigned char.

In effect, the language and library use type char (which may be
signed or unsigned) to hold values in the range 0..UCHAR_MAX,
typically 0..255. This can theoretically cause problems in some
cases, but in practice everything works out. You've seen a case
where the inconsistency produces a compiler warning, but the code
works as expected anyway (and you can inhibit the warning if you
choose).

There's an implicit assumption that an array of plain char can
safely be interpreted as an array of unsigned char, and vice versa.
I'm not convinced that this assumption is entirely justified by the
normative wording of the standard; on the other hand, it's likely
that the assumption is valid on all existing systems.

Personally, I think the language (including the library) would be
cleaner if plain char were required to be unsigned. But there are
historical reasons for leaving it up to the implementation. (I think
making plain char signed made for significantly more efficient code
on the PDP-11; it's likely the same issue occurred on other systems.)

Phil Carmody · Nov 14, 2009

Ian Collins said:
How do you cope with string literals?

Isn't there a way to persuade GCC to use the unsigned version of
chars? I know on at least on one vendor's gcc for POWER, it defaults
to unsigned, and there's a switch to make it use unsigned instead.

Answering my own question with a grep:
-fsigned-char -funsigned-bitfields -funsigned-char

Whether char being unsigned is enough to make them silently
equivalent to unsigned char in all contexts to gcc, I don't know.

Phil

Phil Carmody · Nov 14, 2009

Nick said:
That first "function" there is wrong, and really the paragraph would be
better written as:

But C doesn't have the concept of "collection of positive integers from
zero to 2^CHAR_BIT terminated by a zero", so won't have functions to
operate on them either.

It might do (where the range is taken as half-open). The implementation
is free to have CHAR_MIN=0 and CHAR_MAX=UCHAR_MAX.

Phil

String operations with unsigned char arrays	2	Mar 27, 2009
Compiling fics-1.7.4	3	May 6, 2011
Warning when comparing char[] to a #define'd string	12	Nov 7, 2008
gcc 4 signed vs unsigned char	22	Jul 26, 2005
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Differing signedness warnings when compiling ruby-odbc.	0	Jan 9, 2006
review of the "container library", part 1/?	18	Mar 1, 2011
M2Crypto-0.20.2, SWIG-2.0.0, and OpenSSL-1.0.0a build problem	5	Jul 13, 2010

Problem with gcc

Eric Sosman

Flash Gordon

bartc

Ben Bacarisse

John Kelly

Eric Sosman

John Kelly

Seebs

jacob navia

lawrence.jones

Eric Sosman

John Kelly

Seebs

Nick

Flash Gordon

Eric Sosman

Eric Sosman

Keith Thompson

Phil Carmody

Phil Carmody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads