A question about pointer of array.

P

Phil Carmody

Keith Thompson said:
Plain char is appropriate for textual data.

The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

Would you be happy with valid characters having the
value -1, for example?

Phil
 
V

vippstar

The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

He means all the characters in the character execution set, which are
all guaranteed to have a value in the range [1, CHAR_MAX]
Would you be happy with valid characters having the
value -1, for example?

There's no valid and invalid characters, however the value read by
getchar for example, can't be negative. (there's been a discussion
recently about this and I believe the consensus is that "this could
break the implementation")
 
J

James Kuyper

The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

He means all the characters in the character execution set, which are
all guaranteed to have a value in the range [1, CHAR_MAX]

That limitation applies only to the BASIC execution character set
(6.2.5p3). The extended execution character set is not covered by that
requirement.
There's no valid and invalid characters, however the value read by
getchar for example, can't be negative.

7.19.2p2 says that, under certain strict conditions, if you write data
to a file and read it back again, the result must compare equal to the
original. There's no way that can be possible, given the way that the
standard has defined fputc() and fgetc(), unless
uc == (unsigned char)(int)uc
for all possible values uc of unsigned char.

An implementation is allowed to choose UCHAR_MAX > INT_MAX (this causes
problems for users of character-oriented I/O functions, but none that
can't be dealt with by use of feof() and ferror() - it violates no
requirements imposed by the standard). In this case, the conversion of
unsigned char values > INT_MAX to int is implementation-defined.
However, the requirements of 7.19.2p2 impose significant constraints on
that conversion. Specifically, (int)uc must produce a different value
for every unsigned char value "uc". Since all unsigned char values from
0 to INT_MAX must be converted to the same values in 'int', any unsigned
char value greater than INT_MAX must be converted to a negative value.
 
K

Keith Thompson

Phil Carmody said:
The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

If you want to write code that assumes ISO-8859-1, and that isn't
portable to systems that break that assumption, that's up to you.
Would you be happy with valid characters having the
value -1, for example?

No, I wouldn't be happy with that; why do you ask?
 
R

Richard Tobin

Would you be happy with valid characters having the
value -1, for example?
[/QUOTE]
No, I wouldn't be happy with that; why do you ask?

What's the problem with them having the value -1 when stored in a
plain char on an implementation where char is signed? A character
which had a negative value when converted from an unsigned char to an
int would be a problem because of EOF, but that only arises when
sizeof(int) == 1.

-- Richard
 
H

Harald van Dijk

If you want to write code that assumes ISO-8859-1, and that isn't
portable to systems that break that assumption, that's up to you.

I read his message as treating the ISO-8859-1 file as essentially a byte
stream instead of text (and opened in binary mode). The program would
work regardless of the native character set, because it treats the input
as bytes, and doesn't assume that f.e. 0x41 == 'A', it uses 0x41 instead
of 'A' where required.

Regardless of whether I misread anything, would you consider plain char
more or less appropriate than unsigned char for such a program?
 
J

jameskuyper

Harald said:
I read his message as treating the ISO-8859-1 file as essentially a byte
stream instead of text (and opened in binary mode). The program would
work regardless of the native character set, because it treats the input
as bytes, and doesn't assume that f.e. 0x41 == 'A', it uses 0x41 instead
of 'A' where required.

Regardless of whether I misread anything, would you consider plain char
more or less appropriate than unsigned char for such a program?

Plain char is more appropriate than unsigned char whenever you're
making use, directly (or indirectly through third-party libraries) of
any of the many routines in the C standard library, most of them text-
oriented, that take a char* argument rather than an unsigned char*
argument, or if you're using the data in conjusction which character
literals (type char) or string literals (type char*). If your code
makes little or no use of such functions or literals in connection
with the data, then it isn't really treating it as textual data.
 
C

CBFalconer

Phil said:
The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

Would you be happy with valid characters having the
value -1, for example?

That depends. If char is equivalent to signed char on that
installation, yes. If equivalent to unsigned char, no.
 
K

Keith Thompson

What's the problem with them having the value -1 when stored in a
plain char on an implementation where char is signed? A character
which had a negative value when converted from an unsigned char to an
int would be a problem because of EOF, but that only arises when
sizeof(int) == 1.

The question was whether I'd be happy with it. I'm not. I make no
claim that my unhappiness has a rational basis. In other words, I was
being snarky.
 
K

Keith Thompson

Harald van Dijk said:
I read his message as treating the ISO-8859-1 file as essentially a byte
stream instead of text (and opened in binary mode). The program would
work regardless of the native character set, because it treats the input
as bytes, and doesn't assume that f.e. 0x41 == 'A', it uses 0x41 instead
of 'A' where required.

Regardless of whether I misread anything, would you consider plain char
more or less appropriate than unsigned char for such a program?

The fact that it's ISO-8859-1 strongly implies that it's text, which
means char is more appropriate than unsigned char. If he uses
unsigned char, then he can't conveniently use any of the standard
library functions that deal with strings, among other things.
 
P

Phil Carmody

Keith Thompson said:
If you want to write code that assumes ISO-8859-1, and that isn't
portable to systems that break that assumption, that's up to you.

I thought that what I wrote looked like a question designed
to elicit either a "yes" or a "no" as an answer.

Obviously my questions are non-portable.
No, I wouldn't be happy with that; why do you ask?

Because lower case y with a diaeresis would be -1 as a plain
char on systems where a plain char is implemented as an 8 bit
two's complement signed integer. (What happens in one's
complement systems is left as a simple thought-experiment.)

Putting 2 and 2 together, you appear to not be happy dealing
with ISO-8859-1 data using C's 'char' type. Combining that
with the may-exist-or-not 2 from above, you appear to think
that ISO-8859-1 data isn't textual data.

Which does invite the question "how are you defining 'textual'?",
but wouldn't really cause any waves. I am found mem*-ing such data
far more often than I am str*-ing it, certainly.

Phil
 
K

Keith Thompson

Phil Carmody said:
I thought that what I wrote looked like a question designed
to elicit either a "yes" or a "no" as an answer.

Obviously my questions are non-portable.

Sorry, I thought the question was meant to be rhetorical.

Yes, in my opinion you should consider such text as being textual
data.
Because lower case y with a diaeresis would be -1 as a plain
char on systems where a plain char is implemented as an 8 bit
two's complement signed integer. (What happens in one's
complement systems is left as a simple thought-experiment.)

Putting 2 and 2 together, you appear to not be happy dealing
with ISO-8859-1 data using C's 'char' type. Combining that
with the may-exist-or-not 2 from above, you appear to think
that ISO-8859-1 data isn't textual data.

Don't read too much into my statement of unhappiness. All I meant
(apart from being a bit sarcastic) was that I'm a bit uncomfortable
with the fact that some valid character values are negative in some
implementations. I'm aware of the historical reasons for this (early
C implementations used 7-bit ASCII stored in 8-bit bytes, so the issue
didn't really arise, and signed char made for better code on the
PDP-11 and probably other CPUs).

But we're stuck with it, and there's not much to be done about it
(except perhaps for implementations making plain char unsigned
whenever possible).

8-bit ISO-8859-1 characters are still textual data.
Which does invite the question "how are you defining 'textual'?",
but wouldn't really cause any waves. I am found mem*-ing such data
far more often than I am str*-ing it, certainly.

The language is fairly strongly biased in favor of using plain char
for textual data. A string literal defines an array of plain char.
fgets() and fputs() operate on arrays of plain char. And so forth.
If you're treating textual data as arrays of unsigned char, you may be
doing things that happen to work on your implementation but are not
strictly portable. I can't be more specific without seeing some of
your code.
 
P

Phil Carmody

Keith Thompson said:
Sorry, I thought the question was meant to be rhetorical.

NP. ASCII's a flawed medium.
Yes, in my opinion you should consider such text as being textual
data.

Interesting. I'd like to, I really would.
Don't read too much into my statement of unhappiness. All I meant
(apart from being a bit sarcastic) was that I'm a bit uncomfortable
with the fact that some valid character values are negative in some
implementations.

100% agreement. Except my discomfort is probably higher than yours.
I'm aware of the historical reasons for this (early
C implementations used 7-bit ASCII stored in 8-bit bytes, so the issue
didn't really arise, and signed char made for better code on the
PDP-11 and probably other CPUs).

In 7-bit-text land, leaving the compiler to do whatever is optimal
(a term left deliberately undefined) on the underlying architecture
makes perfect sense. It wasn't a design with a particularly long
future to it - but while retrospect is 20/20, foresight rarely is.
But we're stuck with it, and there's not much to be done about it
(except perhaps for implementations making plain char unsigned
whenever possible).

8-bit ISO-8859-1 characters are still textual data.


The language is fairly strongly biased in favor of using plain char
for textual data. A string literal defines an array of plain char.

At least that one already has an exception which indicates some
flexibilty to 'do what I want', rather than pedantically stick to
only one possibility. Whilst most of the time the string literal
"abc" defines an array of 4 plain chars, if it's in the context
of an initialiser for an array of 3 chars, it no longer has the
4th character which is clear DWIW behaviour.
fgets() and fputs() operate on arrays of plain char. And so forth.
If you're treating textual data as arrays of unsigned char, you may be
doing things that happen to work on your implementation but are not
strictly portable. I can't be more specific without seeing some of
your code.

I'm sure Nokia has 4 years of my check-ins from the late 1990s and
early 2000s on backup tape somewhere. ;-) Grep for swear words.
Much of the time, I have no problem of just abstracting away the
'textual' nature of the data, and don't mind viewing stuff as just
'octets'. As long as 'strlen' works when passed an unsigned char *,
I normally have nothing more 'textual' to do with the data. Data's
just data.

Phil
 
T

Tim Rentsch

James Kuyper said:
Plain char is appropriate for textual data.
The text I deal with from day to day is all ISO-8859-1.
Should I consider such text as being 'textual data'?

He means all the characters in the character execution set, which are
all guaranteed to have a value in the range [1, CHAR_MAX]

That limitation applies only to the BASIC execution character set
(6.2.5p3). The extended execution character set is not covered by that
requirement.
There's no valid and invalid characters, however the value read by
getchar for example, can't be negative.

7.19.2p2 says that, under certain strict conditions, if you write data
to a file and read it back again, the result must compare equal to the
original. There's no way that can be possible, given the way that the
standard has defined fputc() and fgetc(), unless
uc == (unsigned char)(int)uc
for all possible values uc of unsigned char.

An implementation is allowed to choose UCHAR_MAX > INT_MAX (this causes
problems for users of character-oriented I/O functions, but none that
can't be dealt with by use of feof() and ferror() - it violates no
requirements imposed by the standard). In this case, the conversion of
unsigned char values > INT_MAX to int is implementation-defined.
However, the requirements of 7.19.2p2 impose significant constraints on
that conversion. Specifically, (int)uc must produce a different value
for every unsigned char value "uc". Since all unsigned char values from
0 to INT_MAX must be converted to the same values in 'int', any unsigned
char value greater than INT_MAX must be converted to a negative value.

I expect you'll have a ready answer for this, but I'd like to ask
anyway. The functions you're talking about work on characters;
admittedly, character values converted to (unsigned char), but still
characters. Consider these cases:

(1) Some values of (unsigned char) don't correspond to
characters;

(2) The width of (char) equals the width of (unsigned char),
but (char) has a trap representation (such as unsigned zero);
or

(3) The width of (char) is less than the width of (unsigned char).

I've read 7.19.2p2, and also 7.19.7.1 and 7.19.7.3 (for fgetc() and
fputc()). It's clear that any /character/ value must match as
you say, but it isn't obvious that every (unsigned char) value
is also a character value. Do you have a citation for this
(and perhaps reasoning to go with it, if the cited section
is ambiguous)?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,796
Messages
2,569,645
Members
45,368
Latest member
EwanMacvit

Latest Threads

Top