size_t, ssize_t and ptrdiff_t

Discussion in 'C Programming' started by James Harris, Oct 12, 2013.

  1. James Harris

    James Harris Guest

    This post is really about how size_t, ssize_t and ptrdiff_t are intended to
    be used but first, have I got the following right about the basics?

    * size_t: able to hold the size of the largest object, always unsigned,
    returned by sizeof.

    * ssize_t: possibly not a C standard type, perhaps part of Posix, able to
    hold the size of most objects or -1 but usually able to also hold a wider
    range of negative numbers than just -1

    * ptrdiff_t: signed difference between two pointers to parts of the same
    object

    It's not so much their mechanics but the intents and limits puzzle me a bit
    so....

    Are all three needed? With a clean slate, would different definitions be
    better? Does C have it right to just have the two and was Posix right to add
    a signed version of size_t? Would any implementation of ssize_t ever be
    different from that of ptrdiff_t?

    It seems there is or was the potential for code pointers and data pointers
    to be different sizes, e.g. as in the old segmentation models where one
    could be 16 bits and the other could be larger. If so, should there be
    pointer difference and size variants for code and data or should the old
    models simply never have existed? (Rhetorical!) With x86-64 should C have
    different sizes of code and data pointers? (I sure hope not.)

    If an implementation allowed a single object to be bigger than half the
    address space could operations on it break code using ssize_t and ptrdiff_t,
    when the result is out of range of the signed type?

    These are the only types I am aware of which are designed specifically to
    represent quantities of bytes of memory. Does C have any others that I have
    missed?

    James
     
    James Harris, Oct 12, 2013
    #1
    1. Advertisements

  2. Yes, also the type you pass to malloc(), so able to hold the size of the
    largest possibly dynamic buffer.
    size_t is defined to be an unsigned type. So ssize_t is needed to patch things
    up. It's intended as signed replacement for size_t. But it's usually the same
    width. So it's possible to generate overflow errors by subtracting two size_ts.
    Probably not a real problem.
    Not really. My view is that int should be the natural integer size for the
    machine, therefore the default choice for an index variable, also the type
    you pass to malloc. If you somehow manage to declare an object bigger than
    an int, either you're using a weird and wonderful architecture, or it's
    a massive thing that dominates your program's memory management strategy.
    So you have to handle it specially.
    So size_t, ssize_t, ptrdiff_t and int should be the same thing, ideally.
    Unfortunately you can construct odd cases where this might not work, and
    you have the problem that on 64 bit architectures, 32 bits are often
    significantly faster than 64. Since you rarely need more than 2 billion
    things (where did that data come from if you've 2 billion records?),
    int is defined as 32 bits. Then the whole language collapses in a mass
    of fixes and weird identifiers and special cases.
     
    Malcolm McLean, Oct 12, 2013
    #2
    1. Advertisements

  3. James Harris

    Ike Naar Guest

    Is amd64 a "weird and wonderful" architecture?
    It has 32-bit int while size_t and ssize_t are 64-bit.
     
    Ike Naar, Oct 12, 2013
    #3
  4. Yes, mostly. It's possible, in principle, for an implementation to
    permit objects bigger than size_t bytes. malloc() is inherently limited
    to SIZE_MAX bytes, but calloc() isn't, and object declarations aren't.
    You just couldn't apply sizeof to such an object and get a meaninful
    result. But in practice, implementations don't support objects bigger
    than SIZE_MAX bytes, and we can assume that as an upper bound for this
    discussion.
    You can drop "possibly" and "perhaps". ssize_t is not mentioned by ISO
    C, but is defined by POSIZ.

    POSIX requires ssize_t to represent -1, but the C rules for signed
    integer types require the range to be symmetric except possibly for one
    extra negative value, implying that ssize_t can hold negative values at
    least down to -SSIZE_MAX.

    Typically ssize_t is the signed type corresponding to the unsigned type
    size_t, but as far as I can tell POSIX doesn't require that.

    Note that printf's "%zu" format takes an argument of type size_t, and
    "%zd" takes an argument of the signed type corresponding to size_t.
    Neither ISO C nor POSIX has a name that's guaranteed to refer to this
    type. (The Linux printf(3) man page says that "%zd" is for ssize_t;
    apparently that's a guarantee made by Linux but not by POSIX or ISO C.)
    Right. Note that it's not guaranteed that pointer subtraction cannot
    overflow, but implementations typically make ptrdiff_t big enough to
    prevent that.
    An implementation might make ssize_t a signed type of the same size as
    size_t, and ptrdiff_t a *wider* signed type to guarantee that pointer
    subtraction never overflows.
    By "code pointers", I presume you mean function pointers. In ISO C,
    there's no requirement, or even implication, that function pointers and
    object pointers are the same size. C does not support arithmetic on
    function pointers or sizeof on functions, so function pointers are
    irrelevant to the characteristics of size_t, ssize_t, and ptrdiff_t.

    The definition of the POSIX dlsym() function implies that a function
    pointer (at least one defined in a shared library) can be stored in a
    void* without loss of information. But I don't think POSIX says
    anything about arithmetic on function pointers.
    There's no requirement for ssize_t or ptrdiff_t to be the same size as
    size_t, and I think there have been implementations where ptrdiff_t is
    wider than size_t.

    On the other hand, there's no requirement for an implementation to make
    ptrdiff_t wide enough to prevent overflow on pointer subtraction.
    Not that I can think of.
     
    Keith Thompson, Oct 12, 2013
    #4
  5. But how many objects of greater than 2GB does the average program need?
    Remember we're talking about flat objects here, not a tree with lots of
    embedded images, for example.
    Such objects might become common as memory sizes increase and people start
    using computer for presently unheard-of applications. But somehow I don't
    think people will be using a language which insists that index variables
    be something called a size_t.
     
    Malcolm McLean, Oct 12, 2013
    #5
  6. James Harris

    BartC Guest

    In a new language, you don't really want untidy features such as these. I
    think even in C itself, they were bolted on decades later. The problems they
    are trying to solve can be appreciated, but how do other languages deal with
    them?
    In general, if a 32-bit (char) pointer can cover a 0 to 4 billion range,
    then the difference between two pointers is going to need a range of +/- 4
    billion. However a pointer might have that range, yet a single object might
    be limited to 2 billion in size. They are solving different problems.

    An actual language however could simply not allow one pointer to be
    subtracted from another (solving that problem!). I think even C only allows
    this between two pointers within the same object; so if objects have a
    2-billion limit, then that also solves the problem in this instance.
    There would be something wrong if an object was bigger than could be
    represented by ssize_t.

    To simplify the problems a little, in most cases the choices for all these
    types are going to be either signed or unsigned, and either 32 or 64 bits!
    Four options. Signed 64-bits covers all the possibilities, if you want to
    keep things simple.
     
    BartC, Oct 13, 2013
    #6
  7. James Harris

    James Harris Guest

    You mean that an int should be the size of an address? That makes sense. At
    least, the integer type which is most used should be the one that is the
    same size as an address. Then the programmer by default gets a "safe"
    integer and yet is not precluded from choosing a smaller one if desired.

    Two issues, i.e., 1. that implementations don't always work that way and 2.
    programs often index arrays with ints, intersect in an unfortunate way. AIUI
    one should probably make loop index variables size_t but they are often made
    int type because int is more familiar and normally works.

    As well as being wide enough for any index size_t is also unsigned so will
    not run into any wraparound issues.
    That's true for ILP32 (at least if the int is unsigned) but because x86-64
    often uses I32LP64 won't this become more of an issue as programs get to
    deal with increased object sizes?
    I think that's what got me in to this issue. As computers provide larger and
    larger memories it seems quite possible that programs will have to cope with
    increasingly large objects. Such objects don't have to be created in memory.
    They could simply be mapped into memory or at least mapped into the address
    space so may end up being very large. If an int is four bytes wide then
    arrays of chars, shorts and ints will wrap at 2Gby, 4Gby, 8Gby or 16Gby
    depending on the element type and whether the index is defined as signed or
    not. So a large object may seem to work but then fail mysteriously as it
    gets larger.

    The actual numbers don't matter much. The main point seems to be that a
    piece of code could work perfectly when tested on smaller datasets even if
    some of those datasets are large but then fail (and, most importantly, fail
    silently) on larger ones. Silent failures are the worst as they may go
    completely undetected.

    James
     
    James Harris, Oct 14, 2013
    #7
  8. So the index variable needs to be size_t. So we use size_t for variables
    holding sizes of memory, counts of things in memory, index variables, and
    of course intermediate values used for calculating indices. That's probably
    a majority of integers in the typical program. But only a few of those
    are to hold sizes of things in bytes. So size_t is our default integer type,
    and it doesn't usually hold a size. So why not call it something else, e.g.
    "int"?
    Then the signed/unsigned problem is a serious one. If you make variables
    unsigned, then intermediate calculations which can go negative may fail
    or give confusing results. If make them signed, you lose a bit, so
    occasionally ptr1 - ptr2 might overflow, unless you increase the width,
    realistically by a factor of two, just to handle the corner case of massive
    objects covering half the address space.
     
    Malcolm McLean, Oct 14, 2013
    #8
  9. James Harris

    James Harris Guest

    Unless you are proposing that I32LP64 systems be outlawed (are you?) I would
    say that "int" is no good for two reasons: first, int is already used and
    second, int is signed, as you mention below.

    It does seem a good idea to me to have signed and unsigned integers of the
    same width as an address, preferably with names that are cross-platform.
    Yes, making a double-width signed integer (which would be able to hold all
    possible single-width signed and unsigned values) might seem simple but
    could be wasteful and slow. On a 64-bit machine that would require a 128-bit
    signed integer. Useful but possibly overkill?

    AIUI C tends to "promote" signed integers to unsigned ones which can be
    unfortunate when both are used in an expression but, that aside, would it be
    sufficient to have address-width signed and unsigned integers and,
    otherwise, leave the programmer responsible for dealing with wrapping?

    A name for such integers would need to be convenient but is otherwise
    unimportant. As mentioned, int is reserved and is also signed. How about

    sigint /* signed address-width integer */
    unsint /* unsigned address-width integer */

    or

    si /* signed address-width integer */
    ui /* unsigned address-width integer */

    I hate proposing specific names as the names themselves are less important
    than the concept so these are just for the purposes of illustration and
    chosen so they don't conflict with any reserved names. The point is to query
    whether those two address-width integers would be a good idea that can be
    used regardless of the ILP model the implementation is using. It would
    result in things like

    for (ui i = 0; i < object_size; i++)

    James
     
    James Harris, Oct 14, 2013
    #9
  10. James Harris

    BartC Guest

    But addresses now are odd sizes. A desktop PC with byte-addressed memory
    might easily have more then 32-bits of addressing, whether virtual or
    physical.

    The choice of int width however will usually be 32 or 64.

    And what about file-sizes; what type of int do you use to ensure a file of
    any size can be represented? Files can be much larger than memory. Do you
    create, C-style, a FILEsize_t type?
    Why 128-bits? 64-bits signed can represent any possible address (I'm sure,
    until very recently, it could individually address all the RAM in every
    computer in the world), any difference between two addresses, and can index
    an array of any length, and of any element size.

    If you mean being able to deal with overflows of arithmetic ops on arbitrary
    64-bit values, then that's a different matter (and switching to 128-bits
    doesn't solve the problem, it just moves it along: if you allow A*B to be
    calculable using 128 bits, then the user will just do A*B*C!)
     
    BartC, Oct 14, 2013
    #10
  11. How about

    intptr_t
    uintptr_t

    which have been defined in <stdint.h> since C99?

    Those types aren't necessarily the same size as pointers (note that
    different pointer types may have different sizes). The requirement is
    that converting a void* to either intptr_t or uintptr_t and back to
    void* again yields the original value.
     
    Keith Thompson, Oct 14, 2013
    #11
  12. A large majority of programs now written will never need to address
    any object over 2GB. For an array of double, a 32 bit signed int
    can address 16GB.

    If you are working with square matrices, a 32 bit int is big
    enough until you run out of a 64 bit address space.

    For current programs running on current machines, 32 bit code
    running on a 64 bit OS is probably the best choice. Even though
    a single program doesn't need more than a 32 bit int can index,
    people do run multiple programs.
    There are probably some programs written today that, in their
    lifetime, will need to index arrays with index values larger
    than int. But not so many of them.
    (snip)

    -- glen
     
    glen herrmannsfeldt, Oct 14, 2013
    #12
  13. James Harris

    James Harris Guest

    AIUI any data pointer can be converted to a void * and back again so is the
    combined implication that intptr_t and uintptr_t can hold the bits of a
    pointer to any data type?

    I'm sure the answer is there but at the moment I'm confused as to why these
    as well as size_t and ptrdiff_t have been defined. Maybe some architectures
    would resolve these four to more than two different types of integer...?

    James
     
    James Harris, Oct 14, 2013
    #13
  14. James Harris

    James Harris Guest

    Are you sure that applies to virtual addresses? I thought that one would
    need to use segmentation and would need to avoid paging to get wider logical
    addresses.
    Perhaps off topic but to me the width of a file pointer is indepedent of the
    size of memory. The required pointer size is a property of the file
    capacity.
    Good point, I think. ;-)
    No, I wasn't proposing that.

    James
     
    James Harris, Oct 14, 2013
    #14
  15. Assuming a 64-bit OS, applications can generally be either 32-bit or
    64-bit, which refers to the address space available to them and thus
    their pointer size.

    On most such systems, most apps are still compiled for 32-bit mode,
    64-bit mode is only used if it's expected that the app will need the
    larger address space. However, on x86, 64-bit mode also means one gets
    extra registers, a faster calling convention, etc., so it's used even
    for apps that don't need the larger address space.

    Prior to the existence of 64-bit apps (and OSes), all that we had was
    32-bit apps, so paging was required if you needed more than 2-4GB of
    data. Today, nearly all such apps have been recompiled for 64-bit.

    S
     
    Stephen Sprunk, Oct 14, 2013
    #15
  16. James Harris

    Joe Pfeiffer Guest

    How does paging get you access to more than 2-4GB of data?. Paging is
    transparent to the program, and does not extend the program's logical
    address space. Getting a greater than 32 bit logical address space with
    a 32 bit pointer would require something like segmentation or overlays.
     
    Joe Pfeiffer, Oct 14, 2013
    #16
  17. (snip)
    Yes.

    Even a small program often needs to be able to process large files.

    Now, there were some problems in unix that might not have been
    necessary.

    If a program doesn't do any fseek()/ftell() then it should be
    able to process files of unlimited size. It turns out that,
    at least in many unix systems, that isn't true.

    (There were times when

    cat file1 | program | cat > file2

    worked but

    program < file1 > file2

    didn't. Hopefully all fixed by now.)

    Also, a program should be independent of the size of the
    disk the files are in.

    I have seen programs that refuse to install on disk partitions
    with more than 2G (and less than 4G) available. (In the days when
    larger disks weren't quite as common as today.) They used signed 32
    bit integers to compute the available space, and didn't notice
    the overflow.

    -- glen
     
    glen herrmannsfeldt, Oct 14, 2013
    #17
  18. (snip)
    But also IA32 systems with 36 bit physical address space were still
    limited by a 32 bit MMU. A little bit different design would have
    allowed many more years before we needed 64 bits.

    Note that often the OS needs 64 bit addressing, even when individual
    programs don't.

    -- glen
     
    glen herrmannsfeldt, Oct 14, 2013
    #18
  19. AFAICT, he was referring to app-visible paging. For instance, Windows
    Server allowed apps to have a "window" within their 32-bit address space
    that was variably mapped within a much larger virtual address space. It
    was up to application programmers to move that "window" around to access
    the various bits of data they needed.

    With a 64-bit address space, of course, that became unnecessary and
    quickly fell out of favor; now the OS transparently maps your data into
    memory whenever you access it, via a completely unrelated scheme also
    called "paging".

    A few generations earlier, DOS had a similar "Expanded Memory" (EMS)
    scheme that did basically the same thing to exceed real mode's 20-bit
    address space. Similarly, EMS quickly fell out of favor when a 32-bit
    address space, called "Extended Memory" (XMS), came into use.

    Overlays were prior to _that_ and more focused on dealing with the
    limited _physical_ RAM than the limited address space.

    S
     
    Stephen Sprunk, Oct 14, 2013
    #19
  20. James Harris

    BartC Guest

    From an AMD manual:

    "In 64-bit mode, programs generate virtual (linear) addresses that can be up
    to 64 bits in size. ... physical addresses that can be up to 52 bits in
    size"

    Anyway I thought one of the points of using 64-bits was to get past the
    2GB/4GB barrier? If that's not important, that could be reflected in the
    build model where standard ints and pointers can be 32-bits (but still
    leaving the problem of needing an unsigned type to make full use of 4GB).
    It's the same (language) issue of having a suitable type to denote the size
    of some data, or for an offset or index within the data. Perhaps what I'm
    saying is, the language doesn't care how applications cope with files, why
    should it do so with arrays and strings? (Be throwing in a type such a
    'size_t'.)
     
    BartC, Oct 14, 2013
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.