Questions on conversions between char* to unsigned char* and vice versa

Discussion in 'C Programming' started by Navaneeth, Dec 31, 2010.

  1. Navaneeth

    Navaneeth Guest

    I have few questions on conversions between "char*" to "unsigned char*" and vice versa. I am assuming casting "unsigned char*" to "char*" is safe because "char" can hold all the values that an "unsigned char" can hold.

    But conversion of "char*" to "unsigned char*" won't be safe as "char" can hold more values. Is this understanding correct? On what cases "char*" will have negative values?

    I have never seen negative values on a "char*" string. So is that safe to do conversion from "char*" to "unsigned char*"?

    By conversion, I mean using casting - char* c = (char*) string; where string is a "unsigned char*".

    Why I am using unsigned char
    ------

    If any one wondering, why I use unsigned char - I use it for doing some UTF8 processing on the string. I need to use that to skip the multi-byte sequences correctly.

    Any help would be great!
     
    Navaneeth, Dec 31, 2010
    #1
    1. Advertising

  2. Navaneeth

    Angus Guest

    On Dec 31, 11:19 am, Navaneeth <> wrote:
    > I have few questions on conversions between "char*" to "unsigned char*" and vice versa. I am assuming casting "unsigned char*" to "char*" is safe because "char" can hold all the values that an "unsigned char" can hold.
    >
    > But conversion of "char*" to "unsigned char*" won't be safe as "char" can hold more values. Is this understanding correct? On what cases "char*" will have negative values?
    >
    > I have never seen negative values on a "char*" string. So is that safe to do conversion from "char*" to "unsigned char*"?
    >
    > By conversion, I mean using casting - char* c = (char*) string; where string is a "unsigned char*".
    >
    > Why I am using unsigned char
    > ------
    >
    > If any one wondering, why I use unsigned char - I use it for doing some UTF8 processing on the string. I need to use that to skip the multi-byte sequences correctly.
    >
    > Any help would be great!


    In ASCII (and maybe also EBCIDIC, not sure) all the printing
    characters are are represented as positive numbers - ie only lower 7
    bits are used so converting printable characters either way should
    make no difference.

    That also assumes your target machine is using twos compliment system.

    If you are using extended characters then possibly you may have
    problems.
     
    Angus, Dec 31, 2010
    #2
    1. Advertising

  3. Navaneeth <> writes:

    > I have few questions on conversions between "char*" to "unsigned
    > char*" and vice versa. I am assuming casting "unsigned char*" to
    > "char*" is safe because "char" can hold all the values that an
    > "unsigned char" can hold.
    >
    > But conversion of "char*" to "unsigned char*" won't be safe as "char"
    > can hold more values. Is this understanding correct? On what cases
    > "char*" will have negative values?


    There's been some confusion in the answers you've had. For one thing,
    they reinforce your idea that the conversion of a char * to an unsigned
    char * might be related to the range of values the char and unsigned
    char can represent. This is not the case.

    You can convert from a char * to an unsigned char * because the language
    standard permits this.

    Once you have done so, the characters pointed to are not converted when
    you access them. Conversion has a special meaning in C, and it does not
    apply here. Having done:

    unsigned char *up = (unsigned char *)cp;

    *up (or up[0]) does not convert anything. It simple reinterprets the
    first byte of whatever cp pointed to as an unsigned char -- i.e. as a
    number from 0 to UCHAR_MAX (almost always 255).

    > I have never seen negative values on a "char*" string. So is that safe
    > to do conversion from "char*" to "unsigned char*"?


    Yes, and it is safe regardless of whether there are negative char values.

    You may view *any* object at all (and a string of chars is no different in
    the respect) by converting a pointer to it to an unsigned char and
    examining the bytes of the object by using that converted pointer.

    > By conversion, I mean using casting - char* c = (char*) string; where
    > string is a "unsigned char*".


    This is also safe, but much less useful. char is an odd type -- it may
    be signed or it may be unsigned so it is less useful that unsigned char
    for examining objects. However, it safe to do this pointer conversion
    and you'll do it often if you are working with unsigned char * and you
    have to call library functions that expect a char * parameter.

    > Why I am using unsigned char
    > ------
    >
    > If any one wondering, why I use unsigned char - I use it for doing
    > some UTF8 processing on the string. I need to use that to skip the
    > multi-byte sequences correctly.


    That's a perfectly valid reason to use unsigned char. You can do all
    this using char * rather than unsigned char *, but I think the code is
    clearer if you use unsigned char.

    --
    Ben.
     
    Ben Bacarisse, Dec 31, 2010
    #3
  4. Navaneeth <> writes:
    > I have few questions on conversions between "char*" to "unsigned
    > char*" and vice versa. I am assuming casting "unsigned char*" to
    > "char*" is safe because "char" can hold all the values that an
    > "unsigned char" can hold.


    char cannot necessarily hold all the values that an unsigned char can hold.

    (Plain) char may be either signed or unsigned, depending on the
    implementation. If it's unsigned, it has exactly the same range
    as unsigned char. But if it's signed, it can hold negative values.
    Very commonly, the range of char is -128 .. +127, and the range of
    unsigned char is 0 .. 255.

    ASCII only specifies character values from 0 to 127, but there are
    a number of extended-ASCII character sets (Latin-1, for example)
    that specify character values from 0 to 255. This makes dealing
    with Latin-1 characters as (signed) char slightly awkward.

    (EBCDIC is an 8-bit encoding; systems that use EBCDIC (almost?) always
    make plain char unsigned.)

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Dec 31, 2010
    #4
  5. Navaneeth

    Seebs Guest

    Re: Questions on conversions between char* to unsigned char* andvice versa

    On 2010-12-31, Navaneeth <> wrote:
    > I have few questions on conversions between "char*" to "unsigned
    >char*" and vice versa. I am assuming casting "unsigned char*" to
    >"char*" is safe because "char" can hold all the values that an
    >"unsigned char" can hold.


    This is true if, and only if, you are on a system where "char" and "unsigned
    char" have the exact same range of values. Otherwise, there will be values
    that you can store in "unsigned char" that can't be stored in "char".

    > But conversion of "char*" to "unsigned char*" won't be safe as
    >"char" can hold more values.


    No, it can't. At least, so far as I recall, it's absolutely necessary
    that "unsigned char" have at least as many possible values as "char".

    >Is this understanding correct? On what
    >cases "char*" will have negative values?


    Negative values are not coherent for pointers. You probably meant "char".
    The answer is, if you're on an implementation where "char" is a signed type,
    then sometimes it could have negative values.

    > I have never seen negative values on a "char*" string. So is that
    > safe to do conversion from "char*" to "unsigned char*"?


    Maybe.

    > By conversion, I mean using casting - char* c = (char*) string;
    > where string is a "unsigned char*".


    Maybe.

    You haven't explained what you mean by "safe", though. If you convert any
    numeric value whatsoever to "unsigned char", it is guaranteed "safe" in that
    it cannot cause a processor trap, or result in a value that is not valid
    for "unsigned char". It may, however, not be the value you expected to get.
    For instance, on most modern CPUs, if you convert any of 256, 512, or 1024 to
    unsigned char, you will quite safely and reliably get the value 0. But it
    won't crash.

    > If any one wondering, why I use unsigned char - I use it for doing
    > some UTF8 processing on the string. I need to use that to skip the
    > multi-byte sequences correctly.


    So you probably do. But before you go reinventing the wheel, why not check
    to see what your implementation has for existing UTF-8 support.

    If you're at a level of experience where you're not quite sure about how
    char and unsigned char interact, I would suggest that you are probably not
    ready to reliably and consistently implement UTF-8. If you're doing it just
    to learn, hey, sounds like a fun project, good luck with that. If you're
    doing it because you want to get something done, though, consider using the
    existing code that already does it correctly.

    -s
    --
    Copyright 2010, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
    I am not speaking for my employer, although they do rent some of my opinions.
     
    Seebs, Jan 1, 2011
    #5
  6. On Fri, 31 Dec 2010 04:49:03 -0800 (PST), Angus
    <> wrote:

    >In ASCII (and maybe also EBCIDIC, not sure) all the printing
    >characters are are represented as positive numbers - ie only lower 7
    >bits are used so converting printable characters either way should
    >make no difference.


    In EBCDIC, upper case letters range between 0xC1 and 0xE9 (and they
    are not contiguous). Digits range from 0xF1 to 0xF9. Definitely not
    the lower 7 bits. On EBCDIC systems, char defaults to unsigned char
    to avoid negative values for normal characters.

    --
    Remove del for email
     
    Barry Schwarz, Jan 1, 2011
    #6
  7. Seebs <> writes:

    > On 2010-12-31, Navaneeth <> wrote:

    <snip>
    >> I have never seen negative values on a "char*" string. So is that
    >> safe to do conversion from "char*" to "unsigned char*"?

    >
    > Maybe.
    >
    >> By conversion, I mean using casting - char* c = (char*) string;
    >> where string is a "unsigned char*".

    >
    > Maybe.
    >
    > You haven't explained what you mean by "safe", though. If you convert any
    > numeric value whatsoever to "unsigned char", it is guaranteed "safe" in that
    > it cannot cause a processor trap, or result in a value that is not valid
    > for "unsigned char". It may, however, not be the value you expected to get.
    > For instance, on most modern CPUs, if you convert any of 256, 512, or 1024 to
    > unsigned char, you will quite safely and reliably get the value 0. But it
    > won't crash.


    Did you miss the * in the question? I am not sure why you are talking
    about converting numbers to unsigned char. That is not what is being
    asked about.

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Jan 1, 2011
    #7
  8. Navaneeth

    Seebs Guest

    Re: Questions on conversions between char* to unsigned char* andvice versa

    On 2011-01-01, Ben Bacarisse <> wrote:
    > Did you miss the * in the question?


    Yes.

    > I am not sure why you are talking
    > about converting numbers to unsigned char. That is not what is being
    > asked about.


    Probably because elsewhere there was a * that looked spurious, so I started
    translating everything to questions about conversions between values -- in
    particular, because of the assertion that char could hold more values than
    unsigned char. At least, I think that was how it happened; my brain is a
    mysterious place.

    -s
    --
    Copyright 2010, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
    I am not speaking for my employer, although they do rent some of my opinions.
     
    Seebs, Jan 1, 2011
    #8
  9. Ben Pfaff <> writes:
    > Barry Schwarz <> writes:
    >> In EBCDIC, upper case letters range between 0xC1 and 0xE9 (and they
    >> are not contiguous). Digits range from 0xF1 to 0xF9. Definitely not
    >> the lower 7 bits. On EBCDIC systems, char defaults to unsigned char
    >> to avoid negative values for normal characters.

    >
    > It's not just a default. Having plain char be signed would be
    > nonconforming in an EBCDIC environment.


    Unless CHAR_BIT > 8, but I presume that all existing EBCDIC-based
    systems have CHAR_BIT==8. (If EBCDIC had caught on more widely
    than it did, there could easily have been, for example, EBCDIC-based
    DSPs with CHAR_BIT==32.)

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Jan 1, 2011
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tim  Wong
    Replies:
    5
    Views:
    22,001
    Tim Wong
    Jan 21, 2005
  2. Replies:
    3
    Views:
    8,894
  3. Navaneeth
    Replies:
    3
    Views:
    339
    Seebs
    Jan 1, 2011
  4. Navaneeth
    Replies:
    1
    Views:
    341
    Ben Bacarisse
    Jan 4, 2011
  5. Navaneeth
    Replies:
    3
    Views:
    354
    Thad Smith
    Jan 5, 2011
Loading...

Share This Page