A question about pointer of array.

Phil Carmody · Dec 23, 2008

Keith Thompson said:
Plain char is appropriate for textual data.

The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

Would you be happy with valid characters having the
value -1, for example?

Phil

vippstar · Dec 23, 2008

The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

He means all the characters in the character execution set, which are
all guaranteed to have a value in the range [1, CHAR_MAX]

Would you be happy with valid characters having the
value -1, for example?

There's no valid and invalid characters, however the value read by
getchar for example, can't be negative. (there's been a discussion
recently about this and I believe the consensus is that "this could
break the implementation")

James Kuyper · Dec 23, 2008

The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

Click to expand...

He means all the characters in the character execution set, which are
all guaranteed to have a value in the range [1, CHAR_MAX]

That limitation applies only to the BASIC execution character set
(6.2.5p3). The extended execution character set is not covered by that
requirement.

There's no valid and invalid characters, however the value read by
getchar for example, can't be negative.

7.19.2p2 says that, under certain strict conditions, if you write data
to a file and read it back again, the result must compare equal to the
original. There's no way that can be possible, given the way that the
standard has defined fputc() and fgetc(), unless
uc == (unsigned char)(int)uc
for all possible values uc of unsigned char.

An implementation is allowed to choose UCHAR_MAX > INT_MAX (this causes
problems for users of character-oriented I/O functions, but none that
can't be dealt with by use of feof() and ferror() - it violates no
requirements imposed by the standard). In this case, the conversion of
unsigned char values > INT_MAX to int is implementation-defined.
However, the requirements of 7.19.2p2 impose significant constraints on
that conversion. Specifically, (int)uc must produce a different value
for every unsigned char value "uc". Since all unsigned char values from
0 to INT_MAX must be converted to the same values in 'int', any unsigned
char value greater than INT_MAX must be converted to a negative value.

Keith Thompson · Dec 23, 2008

Phil Carmody said:
The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

If you want to write code that assumes ISO-8859-1, and that isn't
portable to systems that break that assumption, that's up to you.

Would you be happy with valid characters having the
value -1, for example?

No, I wouldn't be happy with that; why do you ask?

Richard Tobin · Dec 23, 2008

Would you be happy with valid characters having the
value -1, for example?

[/QUOTE]

No, I wouldn't be happy with that; why do you ask?

What's the problem with them having the value -1 when stored in a
plain char on an implementation where char is signed? A character
which had a negative value when converted from an unsigned char to an
int would be a problem because of EOF, but that only arises when
sizeof(int) == 1.

-- Richard

Harald van DÄ³k · Dec 23, 2008

If you want to write code that assumes ISO-8859-1, and that isn't
portable to systems that break that assumption, that's up to you.

I read his message as treating the ISO-8859-1 file as essentially a byte
stream instead of text (and opened in binary mode). The program would
work regardless of the native character set, because it treats the input
as bytes, and doesn't assume that f.e. 0x41 == 'A', it uses 0x41 instead
of 'A' where required.

Regardless of whether I misread anything, would you consider plain char
more or less appropriate than unsigned char for such a program?

jameskuyper · Dec 23, 2008

Harald said:
I read his message as treating the ISO-8859-1 file as essentially a byte
stream instead of text (and opened in binary mode). The program would
work regardless of the native character set, because it treats the input
as bytes, and doesn't assume that f.e. 0x41 == 'A', it uses 0x41 instead
of 'A' where required.

Regardless of whether I misread anything, would you consider plain char
more or less appropriate than unsigned char for such a program?

Plain char is more appropriate than unsigned char whenever you're
making use, directly (or indirectly through third-party libraries) of
any of the many routines in the C standard library, most of them text-
oriented, that take a char* argument rather than an unsigned char*
argument, or if you're using the data in conjusction which character
literals (type char) or string literals (type char*). If your code
makes little or no use of such functions or literals in connection
with the data, then it isn't really treating it as textual data.

CBFalconer · Dec 23, 2008

Phil said:
The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

Would you be happy with valid characters having the
value -1, for example?

That depends. If char is equivalent to signed char on that
installation, yes. If equivalent to unsigned char, no.

Keith Thompson · Dec 23, 2008

What's the problem with them having the value -1 when stored in a
plain char on an implementation where char is signed? A character
which had a negative value when converted from an unsigned char to an
int would be a problem because of EOF, but that only arises when
sizeof(int) == 1.

The question was whether I'd be happy with it. I'm not. I make no
claim that my unhappiness has a rational basis. In other words, I was
being snarky.

Keith Thompson · Dec 23, 2008

Harald van DÄ³k said:
I read his message as treating the ISO-8859-1 file as essentially a byte
stream instead of text (and opened in binary mode). The program would
work regardless of the native character set, because it treats the input
as bytes, and doesn't assume that f.e. 0x41 == 'A', it uses 0x41 instead
of 'A' where required.

Regardless of whether I misread anything, would you consider plain char
more or less appropriate than unsigned char for such a program?

The fact that it's ISO-8859-1 strongly implies that it's text, which
means char is more appropriate than unsigned char. If he uses
unsigned char, then he can't conveniently use any of the standard
library functions that deal with strings, among other things.

Phil Carmody · Dec 23, 2008

Keith Thompson said:
If you want to write code that assumes ISO-8859-1, and that isn't
portable to systems that break that assumption, that's up to you.

I thought that what I wrote looked like a question designed
to elicit either a "yes" or a "no" as an answer.

Obviously my questions are non-portable.

No, I wouldn't be happy with that; why do you ask?

Because lower case y with a diaeresis would be -1 as a plain
char on systems where a plain char is implemented as an 8 bit
two's complement signed integer. (What happens in one's
complement systems is left as a simple thought-experiment.)

Putting 2 and 2 together, you appear to not be happy dealing
with ISO-8859-1 data using C's 'char' type. Combining that
with the may-exist-or-not 2 from above, you appear to think
that ISO-8859-1 data isn't textual data.

Which does invite the question "how are you defining 'textual'?",
but wouldn't really cause any waves. I am found mem*-ing such data
far more often than I am str*-ing it, certainly.

Phil

Keith Thompson · Dec 24, 2008

Phil Carmody said:
I thought that what I wrote looked like a question designed
to elicit either a "yes" or a "no" as an answer.

Obviously my questions are non-portable.

Sorry, I thought the question was meant to be rhetorical.

Yes, in my opinion you should consider such text as being textual
data.

Because lower case y with a diaeresis would be -1 as a plain
char on systems where a plain char is implemented as an 8 bit
two's complement signed integer. (What happens in one's
complement systems is left as a simple thought-experiment.)

Putting 2 and 2 together, you appear to not be happy dealing
with ISO-8859-1 data using C's 'char' type. Combining that
with the may-exist-or-not 2 from above, you appear to think
that ISO-8859-1 data isn't textual data.

Don't read too much into my statement of unhappiness. All I meant
(apart from being a bit sarcastic) was that I'm a bit uncomfortable
with the fact that some valid character values are negative in some
implementations. I'm aware of the historical reasons for this (early
C implementations used 7-bit ASCII stored in 8-bit bytes, so the issue
didn't really arise, and signed char made for better code on the
PDP-11 and probably other CPUs).

But we're stuck with it, and there's not much to be done about it
(except perhaps for implementations making plain char unsigned
whenever possible).

8-bit ISO-8859-1 characters are still textual data.

Which does invite the question "how are you defining 'textual'?",
but wouldn't really cause any waves. I am found mem*-ing such data
far more often than I am str*-ing it, certainly.

The language is fairly strongly biased in favor of using plain char
for textual data. A string literal defines an array of plain char.
fgets() and fputs() operate on arrays of plain char. And so forth.
If you're treating textual data as arrays of unsigned char, you may be
doing things that happen to work on your implementation but are not
strictly portable. I can't be more specific without seeing some of
your code.

Phil Carmody · Dec 24, 2008

Keith Thompson said:
Sorry, I thought the question was meant to be rhetorical.

NP. ASCII's a flawed medium.

Yes, in my opinion you should consider such text as being textual
data.

Interesting. I'd like to, I really would.

Don't read too much into my statement of unhappiness. All I meant
(apart from being a bit sarcastic) was that I'm a bit uncomfortable
with the fact that some valid character values are negative in some
implementations.

100% agreement. Except my discomfort is probably higher than yours.

I'm aware of the historical reasons for this (early
C implementations used 7-bit ASCII stored in 8-bit bytes, so the issue
didn't really arise, and signed char made for better code on the
PDP-11 and probably other CPUs).

In 7-bit-text land, leaving the compiler to do whatever is optimal
(a term left deliberately undefined) on the underlying architecture
makes perfect sense. It wasn't a design with a particularly long
future to it - but while retrospect is 20/20, foresight rarely is.

But we're stuck with it, and there's not much to be done about it
(except perhaps for implementations making plain char unsigned
whenever possible).

8-bit ISO-8859-1 characters are still textual data.

The language is fairly strongly biased in favor of using plain char
for textual data. A string literal defines an array of plain char.

At least that one already has an exception which indicates some
flexibilty to 'do what I want', rather than pedantically stick to
only one possibility. Whilst most of the time the string literal
"abc" defines an array of 4 plain chars, if it's in the context
of an initialiser for an array of 3 chars, it no longer has the
4th character which is clear DWIW behaviour.

fgets() and fputs() operate on arrays of plain char. And so forth.
If you're treating textual data as arrays of unsigned char, you may be
doing things that happen to work on your implementation but are not
strictly portable. I can't be more specific without seeing some of
your code.

I'm sure Nokia has 4 years of my check-ins from the late 1990s and
early 2000s on backup tape somewhere. ;-) Grep for swear words.
Much of the time, I have no problem of just abstracting away the
'textual' nature of the data, and don't mind viewing stuff as just
'octets'. As long as 'strlen' works when passed an unsigned char *,
I normally have nothing more 'textual' to do with the data. Data's
just data.

Phil

jameskuyper · Dec 24, 2008

[email protected] said:
Only in C++. In C, character literals have type int.

I wasn't thinking of C++; I just plain forgot. Thanks for the
correction.

Tim Rentsch · Jan 7, 2009

James Kuyper said:
Plain char is appropriate for textual data.
The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

Click to expand...

He means all the characters in the character execution set, which are
all guaranteed to have a value in the range [1, CHAR_MAX]

Click to expand...

That limitation applies only to the BASIC execution character set
(6.2.5p3). The extended execution character set is not covered by that
requirement.

There's no valid and invalid characters, however the value read by
getchar for example, can't be negative.

Click to expand...

7.19.2p2 says that, under certain strict conditions, if you write data
to a file and read it back again, the result must compare equal to the
original. There's no way that can be possible, given the way that the
standard has defined fputc() and fgetc(), unless
uc == (unsigned char)(int)uc
for all possible values uc of unsigned char.

An implementation is allowed to choose UCHAR_MAX > INT_MAX (this causes
problems for users of character-oriented I/O functions, but none that
can't be dealt with by use of feof() and ferror() - it violates no
requirements imposed by the standard). In this case, the conversion of
unsigned char values > INT_MAX to int is implementation-defined.
However, the requirements of 7.19.2p2 impose significant constraints on
that conversion. Specifically, (int)uc must produce a different value
for every unsigned char value "uc". Since all unsigned char values from
0 to INT_MAX must be converted to the same values in 'int', any unsigned
char value greater than INT_MAX must be converted to a negative value.

I expect you'll have a ready answer for this, but I'd like to ask
anyway. The functions you're talking about work on characters;
admittedly, character values converted to (unsigned char), but still
characters. Consider these cases:

(1) Some values of (unsigned char) don't correspond to
characters;

(2) The width of (char) equals the width of (unsigned char),
but (char) has a trap representation (such as unsigned zero);
or

(3) The width of (char) is less than the width of (unsigned char).

I've read 7.19.2p2, and also 7.19.7.1 and 7.19.7.3 (for fgetc() and
fputc()). It's clear that any /character/ value must match as
you say, but it isn't obvious that every (unsigned char) value
is also a character value. Do you have a citation for this
(and perhaps reasoning to go with it, if the cited section
is ambiguous)?

Array of structs function pointer	10	Jul 16, 2023
Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Adding adressing of IPv6 to program	1	Feb 16, 2023
Pointer assigned by a function problem...	13	Sep 2, 2008
How do ensure atomic update of a shared global in a multi-threadedapplication?	7	Mar 16, 2011
a constant pointer to constant data and ...	3	Apr 19, 2014
Problem on using a casted void pointer	10	Sep 23, 2007
Need help! Following code isnt working fully Comparison of integer and pointer	0	Nov 20, 2022

A question about pointer of array.

Phil Carmody

vippstar

James Kuyper

Keith Thompson

Richard Tobin

Harald van DÄ³k

jameskuyper

CBFalconer

Keith Thompson

Keith Thompson

Phil Carmody

Keith Thompson

Phil Carmody

jameskuyper

Tim Rentsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads