Questions on conversions between char* to unsigned char* and vice versa

N

Navaneeth

I have few questions on conversions between "char*" to "unsigned char*" and vice versa. I am assuming casting "unsigned char*" to "char*" is safe because "char" can hold all the values that an "unsigned char" can hold.

But conversion of "char*" to "unsigned char*" won't be safe as "char" can hold more values. Is this understanding correct? On what cases "char*" will have negative values?

I have never seen negative values on a "char*" string. So is that safe to do conversion from "char*" to "unsigned char*"?

By conversion, I mean using casting - char* c = (char*) string; where string is a "unsigned char*".

Why I am using unsigned char
 
A

Angus

I have few questions on conversions between "char*" to "unsigned char*" and vice versa. I am assuming casting "unsigned char*" to "char*" is safe because "char" can hold all the values that an "unsigned char" can hold.

But conversion of "char*" to "unsigned char*" won't be safe as "char" can hold more values. Is this understanding correct? On what cases "char*" will have negative values?

I have never seen negative values on a "char*" string. So is that safe to do conversion from "char*" to "unsigned char*"?

By conversion, I mean using casting - char* c = (char*) string; where string is a "unsigned char*".

Why I am using unsigned char
------

If any one wondering, why I use unsigned char - I use it for doing some UTF8 processing on the string. I need to use that to skip the multi-byte sequences correctly.

Any help would be great!

In ASCII (and maybe also EBCIDIC, not sure) all the printing
characters are are represented as positive numbers - ie only lower 7
bits are used so converting printable characters either way should
make no difference.

That also assumes your target machine is using twos compliment system.

If you are using extended characters then possibly you may have
problems.
 
B

Ben Bacarisse

Navaneeth said:
I have few questions on conversions between "char*" to "unsigned
char*" and vice versa. I am assuming casting "unsigned char*" to
"char*" is safe because "char" can hold all the values that an
"unsigned char" can hold.

But conversion of "char*" to "unsigned char*" won't be safe as "char"
can hold more values. Is this understanding correct? On what cases
"char*" will have negative values?

There's been some confusion in the answers you've had. For one thing,
they reinforce your idea that the conversion of a char * to an unsigned
char * might be related to the range of values the char and unsigned
char can represent. This is not the case.

You can convert from a char * to an unsigned char * because the language
standard permits this.

Once you have done so, the characters pointed to are not converted when
you access them. Conversion has a special meaning in C, and it does not
apply here. Having done:

unsigned char *up = (unsigned char *)cp;

*up (or up[0]) does not convert anything. It simple reinterprets the
first byte of whatever cp pointed to as an unsigned char -- i.e. as a
number from 0 to UCHAR_MAX (almost always 255).
I have never seen negative values on a "char*" string. So is that safe
to do conversion from "char*" to "unsigned char*"?

Yes, and it is safe regardless of whether there are negative char values.

You may view *any* object at all (and a string of chars is no different in
the respect) by converting a pointer to it to an unsigned char and
examining the bytes of the object by using that converted pointer.
By conversion, I mean using casting - char* c = (char*) string; where
string is a "unsigned char*".

This is also safe, but much less useful. char is an odd type -- it may
be signed or it may be unsigned so it is less useful that unsigned char
for examining objects. However, it safe to do this pointer conversion
and you'll do it often if you are working with unsigned char * and you
have to call library functions that expect a char * parameter.
Why I am using unsigned char
------

If any one wondering, why I use unsigned char - I use it for doing
some UTF8 processing on the string. I need to use that to skip the
multi-byte sequences correctly.

That's a perfectly valid reason to use unsigned char. You can do all
this using char * rather than unsigned char *, but I think the code is
clearer if you use unsigned char.
 
K

Keith Thompson

Navaneeth said:
I have few questions on conversions between "char*" to "unsigned
char*" and vice versa. I am assuming casting "unsigned char*" to
"char*" is safe because "char" can hold all the values that an
"unsigned char" can hold.

char cannot necessarily hold all the values that an unsigned char can hold.

(Plain) char may be either signed or unsigned, depending on the
implementation. If it's unsigned, it has exactly the same range
as unsigned char. But if it's signed, it can hold negative values.
Very commonly, the range of char is -128 .. +127, and the range of
unsigned char is 0 .. 255.

ASCII only specifies character values from 0 to 127, but there are
a number of extended-ASCII character sets (Latin-1, for example)
that specify character values from 0 to 255. This makes dealing
with Latin-1 characters as (signed) char slightly awkward.

(EBCDIC is an 8-bit encoding; systems that use EBCDIC (almost?) always
make plain char unsigned.)
 
S

Seebs

I have few questions on conversions between "char*" to "unsigned
char*" and vice versa. I am assuming casting "unsigned char*" to
"char*" is safe because "char" can hold all the values that an
"unsigned char" can hold.

This is true if, and only if, you are on a system where "char" and "unsigned
char" have the exact same range of values. Otherwise, there will be values
that you can store in "unsigned char" that can't be stored in "char".
But conversion of "char*" to "unsigned char*" won't be safe as
"char" can hold more values.

No, it can't. At least, so far as I recall, it's absolutely necessary
that "unsigned char" have at least as many possible values as "char".
Is this understanding correct? On what
cases "char*" will have negative values?

Negative values are not coherent for pointers. You probably meant "char".
The answer is, if you're on an implementation where "char" is a signed type,
then sometimes it could have negative values.
I have never seen negative values on a "char*" string. So is that
safe to do conversion from "char*" to "unsigned char*"?
Maybe.

By conversion, I mean using casting - char* c = (char*) string;
where string is a "unsigned char*".

Maybe.

You haven't explained what you mean by "safe", though. If you convert any
numeric value whatsoever to "unsigned char", it is guaranteed "safe" in that
it cannot cause a processor trap, or result in a value that is not valid
for "unsigned char". It may, however, not be the value you expected to get.
For instance, on most modern CPUs, if you convert any of 256, 512, or 1024 to
unsigned char, you will quite safely and reliably get the value 0. But it
won't crash.
If any one wondering, why I use unsigned char - I use it for doing
some UTF8 processing on the string. I need to use that to skip the
multi-byte sequences correctly.

So you probably do. But before you go reinventing the wheel, why not check
to see what your implementation has for existing UTF-8 support.

If you're at a level of experience where you're not quite sure about how
char and unsigned char interact, I would suggest that you are probably not
ready to reliably and consistently implement UTF-8. If you're doing it just
to learn, hey, sounds like a fun project, good luck with that. If you're
doing it because you want to get something done, though, consider using the
existing code that already does it correctly.

-s
 
B

Barry Schwarz

In ASCII (and maybe also EBCIDIC, not sure) all the printing
characters are are represented as positive numbers - ie only lower 7
bits are used so converting printable characters either way should
make no difference.

In EBCDIC, upper case letters range between 0xC1 and 0xE9 (and they
are not contiguous). Digits range from 0xF1 to 0xF9. Definitely not
the lower 7 bits. On EBCDIC systems, char defaults to unsigned char
to avoid negative values for normal characters.
 
B

Ben Bacarisse

Seebs said:
On 2010-12-31, Navaneeth <[email protected]> wrote:

Maybe.

You haven't explained what you mean by "safe", though. If you convert any
numeric value whatsoever to "unsigned char", it is guaranteed "safe" in that
it cannot cause a processor trap, or result in a value that is not valid
for "unsigned char". It may, however, not be the value you expected to get.
For instance, on most modern CPUs, if you convert any of 256, 512, or 1024 to
unsigned char, you will quite safely and reliably get the value 0. But it
won't crash.

Did you miss the * in the question? I am not sure why you are talking
about converting numbers to unsigned char. That is not what is being
asked about.

<snip>
 
S

Seebs

Did you miss the * in the question?
Yes.

I am not sure why you are talking
about converting numbers to unsigned char. That is not what is being
asked about.

Probably because elsewhere there was a * that looked spurious, so I started
translating everything to questions about conversions between values -- in
particular, because of the assertion that char could hold more values than
unsigned char. At least, I think that was how it happened; my brain is a
mysterious place.

-s
 
K

Keith Thompson

Ben Pfaff said:
It's not just a default. Having plain char be signed would be
nonconforming in an EBCDIC environment.

Unless CHAR_BIT > 8, but I presume that all existing EBCDIC-based
systems have CHAR_BIT==8. (If EBCDIC had caught on more widely
than it did, there could easily have been, for example, EBCDIC-based
DSPs with CHAR_BIT==32.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top