I want unsigned char * string literals

Discussion in 'C Programming' started by Michael B Allen, Jul 22, 2007.

  1. Hello,

    Early on I decided that all text (what most people call "strings" [1])
    in my code would be unsigned char *. The reasoning is that the elements
    of these arrays are decidedly not signed. In fact, they may not even
    represent complete characters. At this point I think of text as simple
    binary blobs. What charset, character encoding and termination they use
    should not be exposed in the interface used to operate on them.

    But now I have a dilemma. C string literals are signed char *. With GCC
    4 warning about every sign mismatch, my code is spewing warnings all
    over the place and I'm trying to figure out what to do about it.

    My current thought is to define a Windows style _T macro:

    #define _T(s) ((unsigned char *)s)

    Use "text" functions like:

    int
    text_copy(const unsigned char *src, unsigned char *dst, int n)
    {
    while (n-- && *src) {
    *dst++ = *src++;
    ...

    And abolish the use of traditional string functions (at least for "text").

    The code might then look like the following:

    unsigned char buf[255];
    text_copy(_T("hello, world"), buf, sizeof(buf));

    What do you think?

    If I do the above I have a lot of work to do so if someone has a better
    idea I'd really like to hear about it.

    Mike

    PS: If you have an opinion that is unfavorable (but professional) let's
    hear it.

    [1] I use the term "text" to mean stuff that may actually be displayed
    to a user (possibly in a foreign country). I use the term "string"
    to represent traditional 8 bit zero terminated char * arrays.
     
    Michael B Allen, Jul 22, 2007
    #1
    1. Advertising

  2. Michael B Allen

    Eric Sosman Guest

    Michael B Allen wrote:
    > Hello,
    >
    > Early on I decided that all text (what most people call "strings" [1])
    > in my code would be unsigned char *. The reasoning is that the elements
    > of these arrays are decidedly not signed. In fact, they may not even
    > represent complete characters. At this point I think of text as simple
    > binary blobs. What charset, character encoding and termination they use
    > should not be exposed in the interface used to operate on them.
    >
    > But now I have a dilemma. C string literals are signed char *.


    Well, no. String literals (in typical contexts) generate
    anonymous arrays of char -- just plain char, not signed char
    or unsigned char. Plain char is signed on some systems and
    signed on others, but it is a type of its own nevertheless.

    (People seem to have a hard time with the notion that char
    behaves like one of signed char or unsigned char, but is a
    type distinct from both. The same people seem to have no
    trouble with the fact that int is a type distinct from both
    short and long, even though on most systems it behaves exactly
    like one or the other. Go figure.)

    > With GCC
    > 4 warning about every sign mismatch, my code is spewing warnings all
    > over the place and I'm trying to figure out what to do about it.


    "Don't Do That." The compiler is telling you that the
    square peg is a poor fit for the round hole, no matter how
    hard you push on it.

    > My current thought is to define a Windows style _T macro:
    >
    > #define _T(s) ((unsigned char *)s)


    ... invading the namespace reserved to the implementation,
    thus making the code non-portable to any implementation that
    decides to use _T as one of its own identifiers. If you really
    want to pursue this folly, change the macro name. And put
    parens around the use of the argument, too.

    > Use "text" functions like:
    >
    > int
    > text_copy(const unsigned char *src, unsigned char *dst, int n)
    > {
    > while (n-- && *src) {
    > *dst++ = *src++;
    > ...
    >
    > And abolish the use of traditional string functions (at least for "text").


    You'll also need to find substitutes for the *printf family,
    for getenv, for the strto* family, for asctime and ctime, for
    most of the locale mechanism, for ...

    > The code might then look like the following:
    >
    > unsigned char buf[255];
    > text_copy(_T("hello, world"), buf, sizeof(buf));
    >
    > What do you think?


    I think you want some other programming language, possibly
    Java. If you try to do this in C, you will waste an inordinate
    amount of time and effort struggling against the language and
    (especially) against the library.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jul 22, 2007
    #2
    1. Advertising

  3. "Michael B Allen" <> wrote in message
    news:...
    > Hello,
    >
    > Early on I decided that all text (what most people call "strings" [1])
    > in my code would be unsigned char *. The reasoning is that the
    > elements of these arrays are decidedly not signed. In fact, they may not
    > even represent complete characters. At this point I think of text as
    > simple binary blobs. What charset, character encoding and termination
    > they use should not be exposed in the interface used to operate on
    > them.
    >

    char * for a list of human readable characters.
    unsigned char *for a list of arbitrary bytes - almost always octets.
    signed char * - very rare. Sometimes you might need a tiny integer. I will
    resist mentioning my campaign for 64 bit ints.

    unsigned char really ought to be "byte". Unfortunately a bad decison was
    taken to treat characters and bytes the same way, and now we are stuck with
    sizeof(char) == 1 byte.

    If you start using unsigned char* for strings then, as you have found, you
    will merrily break all the calls to string library functions. This can be
    patched up by a cast, but the real answer is not to do that in the first
    place.
    Very rarely are you interested in the actual encoding of a character. A few
    exceptions arise when you want to code lookup tables for speed, or write
    low-level routines to convert from decimal to machine letter, or put text
    into binary files in an agreed coding, but they are very few.

    --
    Free games and programming goodies.
    http://www.personal.leeds.ac.uk/~bgy1mm
     
    Malcolm McLean, Jul 22, 2007
    #3
  4. On Sun, 22 Jul 2007 15:02:31 -0400
    Eric Sosman <> wrote:

    > > With GCC
    > > 4 warning about every sign mismatch, my code is spewing warnings all
    > > over the place and I'm trying to figure out what to do about it.

    >
    > "Don't Do That." The compiler is telling you that the
    > square peg is a poor fit for the round hole, no matter how
    > hard you push on it.


    Hi Eric,

    Trying to put a square peg in a round hole does not fairly characterize
    casting char * to unsigned char *.

    > > My current thought is to define a Windows style _T macro:
    > >
    > > #define _T(s) ((unsigned char *)s)

    >
    > ... invading the namespace reserved to the implementation,
    > thus making the code non-portable to any implementation that
    > decides to use _T as one of its own identifiers. If you really
    > want to pursue this folly, change the macro name. And put
    > parens around the use of the argument, too.


    I didn't invade the namespace, MS did. Which is to say that symbol is
    unlikely to be use for anything other than what MS (and I) are using it
    for.

    But I don't see why I can't use a different symbol and retain
    compatibility with the Windows platform. I will do that.

    > > Use "text" functions like:
    > >
    > > int
    > > text_copy(const unsigned char *src, unsigned char *dst, int n)
    > > {
    > > while (n-- && *src) {
    > > *dst++ = *src++;
    > > ...
    > >
    > > And abolish the use of traditional string functions (at least for "text").


    > You'll also need to find substitutes for the *printf family,
    > for getenv, for the strto* family, for asctime and ctime, for
    > most of the locale mechanism, for ...


    That's not a big deal. I suspect that in the end I would only end up
    wrapping very few functions. I don't really use any of the above directly
    as it is.

    Note that if you need a truly internationalized solution (everyone should)
    you can't use a lot of the traditional C string functions anyway. Strncpy
    and ctype stuff is useless. Consider that web servers almost invariably
    run in the C locale so anything that depends on the locale mechanism is
    of limited use.

    > > The code might then look like the following:
    > >
    > > unsigned char buf[255];
    > > text_copy(_T("hello, world"), buf, sizeof(buf));
    > >
    > > What do you think?

    >
    > I think you want some other programming language, possibly
    > Java. If you try to do this in C, you will waste an inordinate
    > amount of time and effort struggling against the language and
    > (especially) against the library.


    I would love to use Java the language. Unfortunately it's libraries,
    host OS integration, multi-threading and networking capabilities and
    just about everything else is not suitable for my purposes. C++ seems
    like an over design to me but I've never really tried to use it. The
    C language itself is ideal for me. I don't think deficiencies in text
    processing should deter me from using it.

    So I take it you just use char * for text?

    It doesn't bother you that char * isn't the appropriate type for what
    is effectively a binary blob especially when most of the str* functions
    don't handle internationalized text anyway?

    Mike
     
    Michael B Allen, Jul 22, 2007
    #4
  5. Michael B Allen <> writes:
    > Early on I decided that all text (what most people call "strings" [1])
    > in my code would be unsigned char *. The reasoning is that the elements
    > of these arrays are decidedly not signed. In fact, they may not even
    > represent complete characters. At this point I think of text as simple
    > binary blobs. What charset, character encoding and termination they use
    > should not be exposed in the interface used to operate on them.
    >
    > But now I have a dilemma. C string literals are signed char *. With GCC
    > 4 warning about every sign mismatch, my code is spewing warnings all
    > over the place and I'm trying to figure out what to do about it.

    [...]

    No, C string literals have type 'array[N] of char'; in most, but not
    all, contexts, this is implicity converted to 'char*. (Consider
    'sizeof "hello, world"'.)

    My main point isn't that they're arrays rather than pointers, but that
    they're arrays of (plain) char, not of signed char. Plain char is
    equivalent to *either* signed char or unsigned char, but is still a
    distinct type from either of them. It appears that plain char is
    signed in your implementation.

    I know this doesn't answer your actual question; hopefully someone
    else can help with that.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Jul 22, 2007
    #5
  6. Michael B Allen

    pete Guest

    Michael B Allen wrote:
    >
    > Hello,
    >
    > Early on I decided that all text (what most people call "strings" [1])
    > in my code would be unsigned char *.
    > The reasoning is that the elements
    > of these arrays are decidedly not signed. In fact, they may not even
    > represent complete characters. At this point I think of text as simple
    > binary blobs. What charset,
    > character encoding and termination they use
    > should not be exposed in the interface used to operate on them.
    >
    > But now I have a dilemma. C string literals are signed char *.


    They are arrays of plain char,
    which may be either a signed or unsigned type.

    > With GCC
    > 4 warning about every sign mismatch, my code is spewing warnings all
    > over the place and I'm trying to figure out what to do about it.
    >
    > My current thought is to define a Windows style _T macro:
    >
    > #define _T(s) ((unsigned char *)s)
    >
    > Use "text" functions like:
    >
    > int
    > text_copy(const unsigned char *src, unsigned char *dst, int n)
    > {
    > while (n-- && *src) {
    > *dst++ = *src++;
    > ...
    >
    > And abolish the use of traditional string functions
    > (at least for "text").
    >
    > The code might then look like the following:
    >
    > unsigned char buf[255];
    > text_copy(_T("hello, world"), buf, sizeof(buf));
    >
    > What do you think?
    >
    > If I do the above I have a lot of work to do
    > so if someone has a better idea
    > I'd really like to hear about it.
    >
    > Mike
    >
    > PS: If you have an opinion that is unfavorable
    > (but professional) let's hear it.


    The solution is obvious: use arrays of char to contain strings.

    Using arrays of unsigned char to hold strings
    creates a problem for you, but solves nothing.

    If I have a problem
    that is caused by using arrays of char to hold strings,
    I'm unaware of what the problem is.

    --
    pete
     
    pete, Jul 22, 2007
    #6
  7. On Sun, 22 Jul 2007 22:02:42 GMT
    pete <> wrote:

    > > The code might then look like the following:
    > >
    > > unsigned char buf[255];
    > > text_copy(_T("hello, world"), buf, sizeof(buf));
    > >
    > > What do you think?
    > >
    > > If I do the above I have a lot of work to do
    > > so if someone has a better idea
    > > I'd really like to hear about it.
    > >
    > > Mike
    > >
    > > PS: If you have an opinion that is unfavorable
    > > (but professional) let's hear it.

    >
    > The solution is obvious: use arrays of char to contain strings.
    >
    > Using arrays of unsigned char to hold strings
    > creates a problem for you, but solves nothing.
    >
    > If I have a problem
    > that is caused by using arrays of char to hold strings,
    > I'm unaware of what the problem is.


    Hi pete,

    I accept that there's no technical problem with using char. But I just
    can't get over the fact that char isn't the right type for text.

    If you read data from binary file would you read it into a char buffer
    or unsigned char buffer?

    Type char is not the correct type for text. It is mearly adequate for
    a traditional C 7 bit encoded "string". But char is not the right type
    for binary blobs of "text" used in internationalized programs.

    The only problem with using unsigned char is string literals and that
    seems like a weak reason to make all downstream functions use char.

    Also, technically speaking, if I used char all internationalized string
    functions eventually have to cast char to unsigned char so that it could
    decode and encode and interpret whole characters.

    If compilers allowed the user to specify what the type for string literals
    was, that would basically solve this "problem".

    Mike
     
    Michael B Allen, Jul 23, 2007
    #7
  8. Michael B Allen

    Eric Sosman Guest

    Michael B Allen wrote:
    > On Sun, 22 Jul 2007 15:02:31 -0400
    > Eric Sosman <> wrote:
    >
    >>> With GCC
    >>> 4 warning about every sign mismatch, my code is spewing warnings all
    >>> over the place and I'm trying to figure out what to do about it.

    >> "Don't Do That." The compiler is telling you that the
    >> square peg is a poor fit for the round hole, no matter how
    >> hard you push on it.

    >
    > Hi Eric,
    >
    > Trying to put a square peg in a round hole does not fairly characterize
    > casting char * to unsigned char *.


    Sorry: my mistake. I ought to have said round peg and
    square hole. My apologies.

    >>> My current thought is to define a Windows style _T macro:
    >>>
    >>> #define _T(s) ((unsigned char *)s)

    >> ... invading the namespace reserved to the implementation,
    >> thus making the code non-portable to any implementation that
    >> decides to use _T as one of its own identifiers. If you really
    >> want to pursue this folly, change the macro name. And put
    >> parens around the use of the argument, too.

    >
    > I didn't invade the namespace, MS did. Which is to say that symbol is
    > unlikely to be use for anything other than what MS (and I) are using it
    > for.
    >
    > But I don't see why I can't use a different symbol and retain
    > compatibility with the Windows platform. I will do that.


    Sorry again; I have no idea what you're talking about.
    Whatever it is doesn't seem to be C, in which identifiers
    beginning with _ and a capital letter belong to the implementation
    and not to the programmer.

    >>> Use "text" functions like:
    >>>
    >>> int
    >>> text_copy(const unsigned char *src, unsigned char *dst, int n)
    >>> {
    >>> while (n-- && *src) {
    >>> *dst++ = *src++;
    >>> ...
    >>>
    >>> And abolish the use of traditional string functions (at least for "text").

    >
    >> You'll also need to find substitutes for the *printf family,
    >> for getenv, for the strto* family, for asctime and ctime, for
    >> most of the locale mechanism, for ...

    >
    > That's not a big deal. I suspect that in the end I would only end up
    > wrapping very few functions. I don't really use any of the above directly
    > as it is.


    Not even printf? Are you writing for freestanding environments
    where most of the Standard library is absent?

    > Note that if you need a truly internationalized solution (everyone should)
    > you can't use a lot of the traditional C string functions anyway. Strncpy
    > and ctype stuff is useless.


    I'll agree with you about strncpy, but not about <ctype.h>.

    > Consider that web servers almost invariably
    > run in the C locale so anything that depends on the locale mechanism is
    > of limited use.


    Well, that's really not a C problem, or at least not a "C-
    only" problem. C's locale support is, admittedly, an afterthought
    if not actually a wart, and doesn't generalize to multi-threaded
    environments. But then, C itself has no notion of multiple threads,
    so what can you expect?

    >>> The code might then look like the following:
    >>>
    >>> unsigned char buf[255];
    >>> text_copy(_T("hello, world"), buf, sizeof(buf));
    >>>
    >>> What do you think?

    >> I think you want some other programming language, possibly
    >> Java. If you try to do this in C, you will waste an inordinate
    >> amount of time and effort struggling against the language and
    >> (especially) against the library.

    >
    > I would love to use Java the language. Unfortunately it's libraries,
    > host OS integration, multi-threading and networking capabilities and
    > just about everything else is not suitable for my purposes. C++ seems
    > like an over design to me but I've never really tried to use it. The
    > C language itself is ideal for me. I don't think deficiencies in text
    > processing should deter me from using it.


    Then go ahead; nobody's stopping you. But if you've made up
    your mind to use C, then use C and not some Frankenstein's monster
    made of parts from one language and parts from the other. If text
    processing is important to you and C's text processing isn't rich
    enough for your needs, then either seek another language or add
    your own text-processing libraries to C. But don't try to retrofit
    C's admittedly primitive text-processing to suit your more advanced
    goals; all you're doing is putting lipstick on a pig.

    > So I take it you just use char * for text?


    That I do.

    > It doesn't bother you that char * isn't the appropriate type for what
    > is effectively a binary blob especially when most of the str* functions
    > don't handle internationalized text anyway?


    You haven't explained just why you find char* inadequate,
    and the only virtue of unsigned char* you've mentioned is that it's
    unsigned. I don't see how that helps with internationalization.

    Are you looking for wchar_t, by any chance?

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jul 23, 2007
    #8
  9. Michael B Allen <> writes:
    [...]
    > I accept that there's no technical problem with using char. But I just
    > can't get over the fact that char isn't the right type for text.


    But that's exactly what it's *supposed* to be. If you're saying it
    doesn't meet that requirement, I don't disagree. Personally, I think
    it would make more sense i most environments for plain char to be
    unsigned.

    > If you read data from binary file would you read it into a char buffer
    > or unsigned char buffer?


    Probably an unsigned char buffer, but a binary file could be anything.
    It if contained 8-bit signed data, I'd use signed char.

    > Type char is not the correct type for text. It is mearly adequate for
    > a traditional C 7 bit encoded "string". But char is not the right type
    > for binary blobs of "text" used in internationalized programs.
    >
    > The only problem with using unsigned char is string literals and that
    > seems like a weak reason to make all downstream functions use char.
    >
    > Also, technically speaking, if I used char all internationalized string
    > functions eventually have to cast char to unsigned char so that it could
    > decode and encode and interpret whole characters.
    >
    > If compilers allowed the user to specify what the type for string literals
    > was, that would basically solve this "problem".


    Not really; the standard functions that take strings would still
    require pointers to plain char.

    As I said, IMHO making plain char unsigned is the best solution in
    most environments. I don't know why that hasn't caught on. Perhaps
    there's to much badly writen code that assumes plain char is signed.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Jul 23, 2007
    #9
  10. Michael B Allen

    pete Guest

    Michael B Allen wrote:
    >
    > Hello,
    >
    > Early on I decided that all text (what most people call "strings" [1])
    > in my code would be unsigned char *.
    > The reasoning is that the elements
    > of these arrays are decidedly not signed. In fact, they may not even
    > represent complete characters. At this point I think of text as simple
    > binary blobs. What charset,
    > character encoding and termination they use
    > should not be exposed in the interface used to operate on them.
    >
    > But now I have a dilemma. C string literals are signed char *.
    > With GCC
    > 4 warning about every sign mismatch, my code is spewing warnings all
    > over the place and I'm trying to figure out what to do about it.
    >
    > My current thought is to define a Windows style _T macro:
    >
    > #define _T(s) ((unsigned char *)s)
    >
    > Use "text" functions like:
    >
    > int
    > text_copy(const unsigned char *src, unsigned char *dst, int n)
    > {
    > while (n-- && *src) {
    > *dst++ = *src++;
    > ...
    >
    > And abolish the use of traditional string functions
    > (at least for "text").
    >
    > The code might then look like the following:
    >
    > unsigned char buf[255];
    > text_copy(_T("hello, world"), buf, sizeof(buf));
    >
    > What do you think?
    >
    > If I do the above I have a lot of work to do
    > so if someone has a better
    > idea I'd really like to hear about it.
    >
    > Mike
    >
    > PS: If you have an opinion that is unfavorable
    > (but professional) let's hear it.
    >
    > [1] I use the term "text" to mean stuff that may actually be displayed
    > to a user (possibly in a foreign country). I use the term "string"
    > to represent traditional 8 bit zero terminated char * arrays.


    I think it might be simpler to retain the char interface,
    and then cast inside your functions:

    int
    text_copy(const char *src, char *dst, int n)
    {
    unsigned char *s1 = ( unsigned char *)dst;
    const unsigned char *s2 = (const unsigned char *)src;

    while (n != 0 && *s2 != '\0') {
    *s1++ = *s2++;
    --n;
    }
    while (n-- != 0) {
    *s1++ = '\0';
    }

    --
    pete
     
    pete, Jul 23, 2007
    #10
  11. Michael B Allen

    Eric Sosman Guest

    Keith Thompson wrote:
    > Michael B Allen <> writes:
    > [...]
    >> I accept that there's no technical problem with using char. But I just
    >> can't get over the fact that char isn't the right type for text.

    >
    > But that's exactly what it's *supposed* to be. If you're saying it
    > doesn't meet that requirement, I don't disagree. Personally, I think
    > it would make more sense i most environments for plain char to be
    > unsigned.
    >
    >> If you read data from binary file would you read it into a char buffer
    >> or unsigned char buffer?

    >
    > Probably an unsigned char buffer, but a binary file could be anything.
    > It if contained 8-bit signed data, I'd use signed char.
    >
    >> Type char is not the correct type for text. It is mearly adequate for
    >> a traditional C 7 bit encoded "string". But char is not the right type
    >> for binary blobs of "text" used in internationalized programs.
    >>
    >> The only problem with using unsigned char is string literals and that
    >> seems like a weak reason to make all downstream functions use char.
    >>
    >> Also, technically speaking, if I used char all internationalized string
    >> functions eventually have to cast char to unsigned char so that it could
    >> decode and encode and interpret whole characters.
    >>
    >> If compilers allowed the user to specify what the type for string literals
    >> was, that would basically solve this "problem".

    >
    > Not really; the standard functions that take strings would still
    > require pointers to plain char.
    >
    > As I said, IMHO making plain char unsigned is the best solution in
    > most environments. I don't know why that hasn't caught on. Perhaps
    > there's to much badly writen code that assumes plain char is signed.


    The historical background for C's ambiguity is fairly
    clear: The "load byte" instruction sign-extended on some
    machines and zero-extended on others (and on some, simply
    left the high-order bits of the destination register alone).
    Had C mandated either sign-extension or zero-extension, it
    would have added extra instructions to every single character
    fetch on the un-favored architectures.

    Nowadays it is a good trade to hide such minor matters
    behind a veneer of "programmer friendliness," but the economics
    (i.e., the relative cost of computer time and programmer time)
    were different when C was devised. It would, I think, be an act
    of supreme arrogance and stupidity to maintain that today's
    economic balance is the end state, subject to no further change.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jul 23, 2007
    #11
  12. Eric Sosman <> writes:
    > Keith Thompson wrote:

    [...]
    >> As I said, IMHO making plain char unsigned is the best solution in
    >> most environments. I don't know why that hasn't caught on. Perhaps
    >> there's to much badly writen code that assumes plain char is signed.

    >
    > The historical background for C's ambiguity is fairly
    > clear: The "load byte" instruction sign-extended on some
    > machines and zero-extended on others (and on some, simply
    > left the high-order bits of the destination register alone).
    > Had C mandated either sign-extension or zero-extension, it
    > would have added extra instructions to every single character
    > fetch on the un-favored architectures.
    >
    > Nowadays it is a good trade to hide such minor matters
    > behind a veneer of "programmer friendliness," but the economics
    > (i.e., the relative cost of computer time and programmer time)
    > were different when C was devised. It would, I think, be an act
    > of supreme arrogance and stupidity to maintain that today's
    > economic balance is the end state, subject to no further change.


    I'm not (necessarily) suggesting that the standard should require
    plain char to be unsigned. What I'm suggesting is that most current
    implementations should probably choose to make plain char unsigned.
    Many of them make it signed, perhaps for backward compatibility, but
    IMHO it's a poor tradeoff.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Jul 23, 2007
    #12
  13. On Mon, 23 Jul 2007 01:31:22 GMT
    pete <> wrote:

    > Michael B Allen wrote:
    > >
    > > Hello,
    > >
    > > Early on I decided that all text (what most people call "strings" [1])
    > > in my code would be unsigned char *.
    > > The reasoning is that the elements
    > > of these arrays are decidedly not signed. In fact, they may not even
    > > represent complete characters. At this point I think of text as simple
    > > binary blobs. What charset,
    > > character encoding and termination they use
    > > should not be exposed in the interface used to operate on them.
    > >
    > > But now I have a dilemma. C string literals are signed char *.
    > > With GCC
    > > 4 warning about every sign mismatch, my code is spewing warnings all
    > > over the place and I'm trying to figure out what to do about it.
    > >
    > > My current thought is to define a Windows style _T macro:
    > >
    > > #define _T(s) ((unsigned char *)s)
    > >
    > > Use "text" functions like:
    > >
    > > int
    > > text_copy(const unsigned char *src, unsigned char *dst, int n)
    > > {
    > > while (n-- && *src) {
    > > *dst++ = *src++;
    > > ...
    > >
    > > And abolish the use of traditional string functions
    > > (at least for "text").
    > >
    > > The code might then look like the following:
    > >
    > > unsigned char buf[255];
    > > text_copy(_T("hello, world"), buf, sizeof(buf));
    > >
    > > What do you think?
    > >
    > > If I do the above I have a lot of work to do
    > > so if someone has a better
    > > idea I'd really like to hear about it.
    > >
    > > Mike
    > >
    > > PS: If you have an opinion that is unfavorable
    > > (but professional) let's hear it.
    > >
    > > [1] I use the term "text" to mean stuff that may actually be displayed
    > > to a user (possibly in a foreign country). I use the term "string"
    > > to represent traditional 8 bit zero terminated char * arrays.

    >
    > I think it might be simpler to retain the char interface,
    > and then cast inside your functions:
    >
    > int
    > text_copy(const char *src, char *dst, int n)
    > {
    > unsigned char *s1 = ( unsigned char *)dst;
    > const unsigned char *s2 = (const unsigned char *)src;


    Hi pete,

    Ok, I'm giving in. I asked, I got an answer and you guys are right.

    Even though char is wrong, it's just another little legacy wart with
    no serious technical impact other than the fact that to inspect bytes
    within the text one should cast to unsigned char first. So if casting
    has to occur, doing it in the base functions is a lot more elegant than
    casting every string literal throughout the entire codebase.

    But in hope that someday compilers will provide an option for char to
    be unsigned, I have started to replaced all instances of the char type
    with my own typedef so that when that day comes I can tweak one line of
    code and have what I want.

    Actually I see GCC has a -funsigned-char option that seems to be what
    I want but it didn't seem to have any effect on the warnings.

    Mike
     
    Michael B Allen, Jul 23, 2007
    #13
  14. Michael B Allen

    Ian Collins Guest

    Michael B Allen wrote:
    >
    > Actually I see GCC has a -funsigned-char option that seems to be what
    > I want but it didn't seem to have any effect on the warnings.
    >

    Could it be that it simply makes char unsigned?

    --
    Ian Collins.
     
    Ian Collins, Jul 23, 2007
    #14
  15. Michael B Allen

    Alan Curry Guest

    In article <>,
    Michael B Allen <> wrote:
    >
    >Actually I see GCC has a -funsigned-char option that seems to be what
    >I want but it didn't seem to have any effect on the warnings.


    -funsigned-char affects the compiler's behavior, possibly causing your
    program to behave differently, but it doesn't make your code correct. Correct
    code works when compiled with either -fsigned-char or -funsigned-char.
    The warning is designed to help you make your code correct, by alerting you
    when you've done something which might not work the same if you changed from
    -funsigned-char to -fsigned-char (or from gcc to some other compiler that
    doesn't let you choose)

    If you got different warnings depending on your -f[un]signed-char option,
    you'd have to compile your code twice to see all the possible warnings. That
    wouldn't be friendly.

    --
    Alan Curry
     
    Alan Curry, Jul 23, 2007
    #15
  16. Michael B Allen

    Eric Sosman Guest

    Michael B Allen wrote:
    > [...]
    >
    > Even though char is wrong, it's just another little legacy wart with
    > no serious technical impact other than the fact that to inspect bytes
    > within the text one should cast to unsigned char first. [...]


    It is unnecessary to cast anything in order to "inspect"
    a character in a string. *cptr == 'A' and *cptr == 'ß' work
    just fine (on systems that have a ß character), and there's
    no need to cast either *cptr or the constant.

    Perhaps you're unhappy about the casting that *is* needed
    for the <ctype.h> functions, and I share your unhappiness.
    But that's not really a consequence of the sign ambiguity of
    char; rather, it follows from the functions' having a domain
    consisting of all char values *plus* EOF. Were it not for the
    need to handle EOF -- a largely useless addition, IMHO -- there
    would be no need to cast when using <ctype.h>.

    However, that's far from the worst infelicity in the C
    library. The original Standard tried (mostly) to codify
    C-as-it-was, not to replace it with C-remade-in-trendy-mode.
    The <ctype.h> functions -- and their treatment of EOF -- were
    already well-established before the first Standard was written,
    and the writers had little choice but to accept them.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jul 23, 2007
    #16
  17. On Mon, 23 Jul 2007 09:02:04 -0400
    Eric Sosman <> wrote:

    > Michael B Allen wrote:
    > > [...]
    > >
    > > Even though char is wrong, it's just another little legacy wart with
    > > no serious technical impact other than the fact that to inspect bytes
    > > within the text one should cast to unsigned char first. [...]

    >
    > It is unnecessary to cast anything in order to "inspect"
    > a character in a string. *cptr == 'A' and *cptr == 'ß' work
    > just fine (on systems that have a ß character), and there's
    > no need to cast either *cptr or the constant.


    Hi Eric,

    The above code will not work with non-latin1 character encodings (most
    importantly UTF-8). That will severely limit it's portability from an i18n
    perspective (e.g. no CJK). And even domestically you're going to run into
    trouble soon. Standards related to Kebreros, LDAP, GSSAPI and many more
    are basically saying they don't care about codepages anymore. Everything
    is going to be UTF-8 (except on Windows which will of course continue
    to use wchar_t).

    > Perhaps you're unhappy about the casting that *is* needed
    > for the <ctype.h> functions, and I share your unhappiness.
    > But that's not really a consequence of the sign ambiguity of
    > char; rather, it follows from the functions' having a domain
    > consisting of all char values *plus* EOF. Were it not for the
    > need to handle EOF -- a largely useless addition, IMHO -- there
    > would be no need to cast when using <ctype.h>.


    Forget casting, the ctype functions don't even work at all if the high
    bit is on. Ctype only works with ASCII.

    > However, that's far from the worst infelicity in the C
    > library. The original Standard tried (mostly) to codify
    > C-as-it-was, not to replace it with C-remade-in-trendy-mode.
    > The <ctype.h> functions -- and their treatment of EOF -- were
    > already well-established before the first Standard was written,
    > and the writers had little choice but to accept them.


    Ok. A little history is nice. But I really think these discussions
    should be punctuated with saying that the C standard library is basically
    useless at this point.

    ctype - useless for i18n
    errno - a classic non-standard standard
    locale - no context object so it can't be safely used in libraries
    setjmp - not portable
    signal - no comment necessary
    stdio - no context object to keep state separate (e.g. can't mix wide
    and non-wide I/O)
    stdlib - malloc has no context object
    string - useless for i18n

    If we're ever going to create a new "standard" library for C the first
    step is to admit that the one we have now is useless for anything but
    hello world programs.

    Mike
     
    Michael B Allen, Jul 23, 2007
    #17
  18. Michael B Allen

    Eric Sosman Guest

    Michael B Allen wrote On 07/23/07 12:53,:
    > On Mon, 23 Jul 2007 09:02:04 -0400
    > Eric Sosman <> wrote:
    >> [...]
    >> Perhaps you're unhappy about the casting that *is* needed
    >>for the <ctype.h> functions, and I share your unhappiness.
    >>But that's not really a consequence of the sign ambiguity of
    >>char; rather, it follows from the functions' having a domain
    >>consisting of all char values *plus* EOF. Were it not for the
    >>need to handle EOF -- a largely useless addition, IMHO -- there
    >>would be no need to cast when using <ctype.h>.

    >
    >
    > Forget casting, the ctype functions don't even work at all if the high
    > bit is on. Ctype only works with ASCII.


    First, C does not assume ASCII character encodings,
    and runs happily on systems that do not use ASCII. The
    only constraints on the encoding are (1) that the available
    characters include a specified set of "basic" characters,
    (2) that the codes for the basic characters be non-negative,
    and (3) that the codes for the characters '0' through '9'
    be consecutive and ascending. Any encoding that meets
    these requirements -- ASCII or not -- is acceptable for C.

    Second, the <ctype.h> functions are required to accept
    arguments whose values cover the entire range of unsigned
    char (plus EOF). Half those values have the high bit set,
    and the <ctype.h> functions cannot ignore that half.

    > Ok. A little history is nice. But I really think these discussions
    > should be punctuated with saying that the C standard library is basically
    > useless at this point.


    If you think so, then why use C? You're planning on
    throwing away the entire library and changing the handling
    of text in fundamental ways (ways that go far beyond your
    initial "I want unsigned text" plea). The result would be
    a programming language in which existing C programs would
    not run and perhaps would not compile; why are so you set
    on calling this new and different language "C?" Call it
    "D" or "Sanskrit" or "Baloney" if you like, but it ain't C.

    --
     
    Eric Sosman, Jul 23, 2007
    #18
  19. Michael B Allen said:

    <snip>

    > Forget casting, the ctype functions don't even work at all if the high
    > bit is on. Ctype only works with ASCII.


    Strange, that - I've used it with EBCDIC, with the high bit set, and it
    worked just fine. I wonder what I'm doing wrong.

    > If we're ever going to create a new "standard" library for C the first
    > step is to admit that the one we have now is useless for anything but
    > hello world programs.


    The standard C library could be a lot, lot better, it's true, but it's
    surprising just how much can be done with it if you try.

    --
    Richard Heathfield <http://www.cpax.org.uk>
    Email: -www. +rjh@
    Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
    "Usenet is a strange place" - dmr 29 July 1999
     
    Richard Heathfield, Jul 23, 2007
    #19
  20. On Mon, 23 Jul 2007 13:31:24 -0400
    Eric Sosman <> wrote:

    > Michael B Allen wrote On 07/23/07 12:53,:
    > > On Mon, 23 Jul 2007 09:02:04 -0400
    > > Eric Sosman <> wrote:
    > >> [...]
    > >> Perhaps you're unhappy about the casting that *is* needed
    > >>for the <ctype.h> functions, and I share your unhappiness.
    > >>But that's not really a consequence of the sign ambiguity of
    > >>char; rather, it follows from the functions' having a domain
    > >>consisting of all char values *plus* EOF. Were it not for the
    > >>need to handle EOF -- a largely useless addition, IMHO -- there
    > >>would be no need to cast when using <ctype.h>.

    > >
    > >
    > > Forget casting, the ctype functions don't even work at all if the high
    > > bit is on. Ctype only works with ASCII.

    >
    > First, C does not assume ASCII character encodings,
    > and runs happily on systems that do not use ASCII. The
    > only constraints on the encoding are (1) that the available
    > characters include a specified set of "basic" characters,
    > (2) that the codes for the basic characters be non-negative,
    > and (3) that the codes for the characters '0' through '9'
    > be consecutive and ascending. Any encoding that meets
    > these requirements -- ASCII or not -- is acceptable for C.


    True. I forgot about EBCDIC and such (thanks Richard).

    But that is just a pedantic distraction from the real point which is that
    your code will not work with non-latin1 encodings and that is going to
    seriously impact it's portablity.

    > Second, the <ctype.h> functions are required to accept
    > arguments whose values cover the entire range of unsigned
    > char (plus EOF). Half those values have the high bit set,
    > and the <ctype.h> functions cannot ignore that half.


    #include <stdio.h>
    #include <ctype.h>

    #define CH 0xdf

    int
    main()
    {
    printf("%c %d %x\n", CH, CH, CH);

    printf("isalnum=%d\n", isalnum(CH));
    printf("isalpha=%d\n", isalpha(CH));
    printf("iscntrl=%d\n", iscntrl(CH));
    printf("isdigit=%d\n", isdigit(CH));
    printf("isgraph=%d\n", isgraph(CH));
    printf("islower=%d\n", islower(CH));
    printf("isupper=%d\n", isupper(CH));
    printf("isprint=%d\n", isprint(CH));
    printf("ispunct=%d\n", ispunct(CH));
    printf("isspace=%d\n", isspace(CH));

    return 0;
    }

    $ LANG=en_US.ISO-8859-1 ./t
    ß 223 df
    isalnum=0
    isalpha=0
    iscntrl=0
    isdigit=0
    isgraph=0
    islower=0
    isupper=0
    isprint=0
    ispunct=0
    isspace=0

    Again, even if these functions did work they *still* wouldn't handle
    non-latin1 encodings (e.g. UTF-8).

    > > Ok. A little history is nice. But I really think these discussions
    > > should be punctuated with saying that the C standard library is basically
    > > useless at this point.

    >
    > If you think so, then why use C? You're planning on
    > throwing away the entire library and changing the handling
    > of text in fundamental ways (ways that go far beyond your
    > initial "I want unsigned text" plea). The result would be
    > a programming language in which existing C programs would
    > not run and perhaps would not compile; why are so you set
    > on calling this new and different language "C?" Call it
    > "D" or "Sanskrit" or "Baloney" if you like, but it ain't C.


    I think that you should consider the possability that programming
    requirements are changing and that discussing the history of C will have
    no impact on that. Anyone who could move to Java or .NET already has. The
    rest of us are doing systems programming that needs to be C (like me).

    If standards mandate UTF-8 your techniques will have to change or you're
    going to be doing a lot of painful character encoding conversions at
    interface boundries.

    Mike
     
    Michael B Allen, Jul 23, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Goche
    Replies:
    8
    Views:
    16,502
  2. Replies:
    1
    Views:
    463
    Diez B. Roggisch
    Jun 1, 2005
  3. Junmin H.
    Replies:
    20
    Views:
    1,031
    Charlie Gordon
    Sep 20, 2007
  4. Alex Vinokur
    Replies:
    9
    Views:
    812
    James Kanze
    Oct 13, 2008
  5. pozz
    Replies:
    12
    Views:
    759
    Tim Rentsch
    Mar 20, 2011
Loading...

Share This Page