char data to unsigned char

J

Jonathan Lee

OK stupid question time.

What's the proper way to process char data as unsigned char? A lot of
the things I write are only defined on "raw" data. Like, huffman
decoding, or block ciphers. But a lot of this data comes from files or
strings which are char sources. The way I've been handling this until
now is to simply reinterpret_cast<> a pointer to a char buffer into an
unsigned char pointer. Like

// convenience
void encrypt(const char* data, std::size_t n) {
encrypt(reinterpet_cast<const unsigned char*>(data, n);
}

// "real" function
void encrypt(const unsigned char* data, std::size_t n) {
..
}

I don't see any guarantee in the standard that this will work, and
it's bugging me.

So... really what should I be doing?

--Jonathan
 
A

Alf P. Steinbach

OK stupid question time.

What's the proper way to process char data as unsigned char? A lot of
the things I write are only defined on "raw" data. Like, huffman
decoding, or block ciphers. But a lot of this data comes from files or
strings which are char sources. The way I've been handling this until
now is to simply reinterpret_cast<> a pointer to a char buffer into an
unsigned char pointer. Like

// convenience
void encrypt(const char* data, std::size_t n) {
encrypt(reinterpet_cast<const unsigned char*>(data, n);
}

// "real" function
void encrypt(const unsigned char* data, std::size_t n) {
..
}

I don't see any guarantee in the standard that this will work, and
it's bugging me.

Don't let it. There's no formal guarantee about assigning back to char, but (1)
you're probably not doing any assigning back to plain char, and (2) that lack of
formal guarantee is just in support of sign-and-magnitude char's on the ENIAC
(some member of the C++ committee fancies the ENIAC). Nobody uses the ENIAC any
more, and besides, there's no C++ compiler for that machine.

So... really what should I be doing?

Exactly what you're doing. :)

Well, except that I'd prefer to use a signed size spec instead of unsigned, but
hey, that's not something that I can say you "should" be doing, just that
avoiding mixing signed and unsigned in expressions can save a lot of work.


Cheers & hth.,

- Alf
 
J

James Kanze

On 02.05.2010 03:43, * Jonathan Lee:

[...]
Don't let it. There's no formal guarantee about assigning back
to char, but (1) you're probably not doing any assigning back
to plain char, and (2) that lack of formal guarantee is just
in support of sign-and-magnitude char's on the ENIAC (some
member of the C++ committee fancies the ENIAC). Nobody uses
the ENIAC any more, and besides, there's no C++ compiler for
that machine.

First, I don't know whether the ENIAC used signed magnitude, but
there are machines being sold today (so presumably also used)
which use signed magnitude or one's complement. The one with
one's complement definitely has a C++ compiler. Not everything
is a PC. (Of course, absolute portability isn't a requirement
for all applications, and if you're using, say, WinMain instead
of main, you can pretty ignore the existance of "unusual"
architectures.)

Which doesn't mean your basic premise is wrong. The C++
standard is a little vague about this, but I'd argue that the
intent is that copying any POD type as a char (e.g. through a
char*) is value preserving, regardless of the original type.
(IIRC, this is not the case for C, which only guarantees value
preservation in the case of unsigned char. But I could be wrong
about this.) And of course, this is trivial to guarantee on any
machine which meets the guarantees for unsigned char---just make
plain char unsigned. (This is what Univac does for both the
2200 and the MCP architectures---one's complement and signed
magnitude, respectively.)

Independently of the standard... One of the most common idioms
in C is something like:

char* p;
int c = getchar();
while (c != EOF && c != '\n')
*p ++ = c;

This only works if you are able to assign an int with the values
0...UCHAR_MAX to a char without loss of information. Something
which is explicitly *not* guaranteed by the standard. But
something which is so ubiquous that no one would dare violate
it. (And as I said above, it is guaranteed that you can
implement it, relatively cheaply, in fact, by making plain char
unsigned.)
 
J

Jonathan Lee

(And as I said above, it is guaranteed that you can
implement it, relatively cheaply, in fact, by making plain char
unsigned.)

James, are you suggesting that I enforce char is unsigned? Like

#if (CHAR_MIN != 0)
#error "Set your compiler options so that char is unsigned"
#endif

--Jonathan
 
J

James Kanze

James, are you suggesting that I enforce char is unsigned? Like
#if (CHAR_MIN != 0)
#error "Set your compiler options so that char is unsigned"
#endif

Certainly not.

Logically, if, as I believe, char is intended to contain
character data, it was a mistake to allow it to be signed---I
don't know of a single character encoding which has negative
values. (And to be fair: it was a mistake, in 1990, when the C
standard was adopted. In the original implementations of C, on
a PDP-11, there were valid reasons for allowing signed.)

But it's far too late to change that now. All we can to is
count on the fact that no compiler would dare break the classic
idiom I mentionned. Which means that whatever the other
characteristics of the compiler, assigning an unsigned char to a
char will not loose information, and that information will be
used in an expected fashion in the implementation of iostream.
For a 2's complement machine which does the conversions simply
by copying bits (by far the most frequent case), this will work
automatically. For other machines or implementations, the
simplest solution is to make plain char unsigned (and this is
what all of the implementations I know of for such machines do).

As for an option which changes the representation of plain
char---that's probably the worst solution possible, since it
means that (formally) char isn't the same thing as char in two
different modules. (Practically, again, on a 2's complement
which does conversions by just copying bits, you can get away
with it, since it doesn't matter.)
 
J

Jonathan Lee

Certainly not.

Alright, cool. I couldn't tell if you meant compiler writers
made plain char unsigned, or if you meant for me to do it. I
thought the latter was a bit strange...

--Jonathan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top