size_t, ssize_t and ptrdiff_t

Discussion in 'C Programming' started by James Harris, Oct 12, 2013.

  1. James Harris

    James Harris Guest

    ....
    Yes. Again, the offset of a part of the disk requires an integer which is
    sized suitably for the disk and not for a particular program.
    That's exactly the kind of issue I was talking about. The software company
    could have tested that their app installs on many different operating
    systems but, because of lax management of integer sizes, missed that it
    would miscalculate in such limited circumstances as you mention and then
    such a problem only gets noticed by customers. Some issues could be much
    more important and produce incorrect results and not be noticed by customers
    for years. In the UK recently there was a report of a high street retailer
    that, because of a faulty piece of software, had been underpaying some of
    its staff for some time. It doesn't look good when such errors are
    eventually found out.

    James
     
    James Harris, Oct 14, 2013
    #21
    1. Advertisements

  2. James Harris

    James Harris Guest

    Sorry, I misunderstood. I thought when you mentioned odd sizes you were
    thinking about 36-bit in 32-bit mode or similar, which can only be physical.

    One of the good parts of the 64-bit design is that all 64 bits of the linear
    addresses are used. Programs cannot squirrel away extra meaning to otherwise
    unused bits of addresses. Hence the canonical format and the perception, at
    least, that x86-64 addresses are signed... ;-)
    I'm not sure. I was really just thinking that if a program uses N-bit
    addresses it should possibly also have N-bit signed and unsigned integer
    types so as to make it easy to work with addresses and any other integer
    which accesses memory including array indices. For example, a program
    running under an environment which has 16-bit addresses should have some
    data type that results in 16-bit signed and unsigned integers. In fact, that
    should be the default size for integers, if there is such a thing as a
    default, or the easiest to specify if not. That doesn't prevent the
    programmer choosing other sizes of integers but makes it easiest to take the
    safer action.

    The discussion has pointed out that the situation is a little more complex.
    Some environments have multiple sizes of pointer. For those, ISTM
    appropriate to have corresponding sizes of signed and unsigned integer.

    FWIW, as well as the old x86-16 segmented modes I wonder if similar
    non-simple pointers may one day be needed for NUMA architectures. From what
    I can find, at the moment they are limited to using a field within a wide
    address to identify the node that the RAM sits in but it is probably a good
    idea to keep in mind the idea that pointers may one day need to be segmented
    again.

    Also, some pointers may profitably be replaced by (object, offset) pairs. A
    bit off topic here. I just mention it for completeness.

    James
     
    James Harris, Oct 14, 2013
    #22
    1. Advertisements

  3. Yes. The standard doesn't directly say that you can convert an int* to
    uintptr_t and back again without loss of information, but it would take
    a perverse implementation for it to fail.
    size_t and ptrdiff_t apply only to single objects. size_t is the type
    of the result of sizeof (and a parameter and/or result type for a number
    of standard library functions). ptrdiff_t is the result of pointer
    subtraction, which is defined only within a single object (or just past
    the end of it).

    The intptr_t types, on the other hand, have to hold the converted
    value of any valid void* pointer, which can point to any byte of *any*
    object in the currently executing program.

    For many systems the distinction doesn't matter; you'll have, say, a
    32-bit address space, and size_t, void*, et al will all be the same
    size. But it's entirely possible to have a 64-bit address space while
    limiting the size of any single object to 32 bits (or 32 and 16).
     
    Keith Thompson, Oct 14, 2013
    #23
  4. Data has got to represent something.
    But lots of devices are now spewing out huge amounts of data. For instance
    an image for human viewing isn't really going to go above about 4096 x 4096,
    because there's a limit to the number of pixels a human can distinguish in
    his visual field. But a lot of microscopic slides aren't intended for direct
    human viewing, the images can be extremely large.
     
    Malcolm McLean, Oct 14, 2013
    #24
  5. James Harris

    Joe Pfeiffer Guest

    Ah, OK -- I'd argue that what he's describing has more in common with
    overlays than with OS-provided paging: the programer is using a
    single area of the program's logical address space to view different
    parts of data or code (though overlays required the program to
    physically move the data while this "paging" scheme could be built on
    top of OS-provided paging easily). I'd disagree that overlays focussed
    on the limited physical memory rather than address space; in fact, the
    first time I encountered it was on a CDC 6400 in which the logical
    address space was of variable size (and the more you wanted the more it
    cost) enforced by a limit register, and the physical address space was
    much larger than the logical space.
     
    Joe Pfeiffer, Oct 14, 2013
    #25
  6. James Harris

    James Harris Guest

    As this is going off the topic of C have copied to and set followups to
    comp.lang.misc.

    For context, discussion is about

    * converting between integers and pointers
    * combining integers with pointers in arithmetic
    * what sizes of integers to use
    * what signedness those integers should have

    C types discussed: size_t, ssize_t, ptrdiff_t and, latterly, intptr_t and
    uintptr_t.
    IME languages sometimes take an overly simplistic approach to pointers. Most
    I have seen disallow any access to pointers except for assignment and
    comparison. That may be a good approach - that's a separate discussion - but
    this thread was about interworking between pointers and integers, assuming a
    language makes that possible. What integer types should be available? My
    opening suggestion was that signed and unsigned integers of the same size as
    addresses should be the defaults. Then those integers, N, could be combined
    with pointers, P, with operations such as the following where -> indicates
    the mapping to a result.

    P -> N
    N -> P
    P1 - P2 -> N
    P1 + N -> P2
    P[N] -> element

    Using address-sized integers for all memory-accesses including indexing
    would allow array indices to be large enough for even the largest possible
    array.
    It might be good to allow arbitrary pointers to be subtracted especially for
    systems programming.
    If running under a 32-bit address space I would dislike the idea of being
    restricted to 31 bits for a single object. I know that objects are seldom
    that large and OSes often take a lot of address space for themselves but I
    cannot see a good reason why an object larger than 2Gby should not be
    possible. Also, it might be that a program wants to calculate the distance
    between the base of the stack (traditionally in high memory) and the code
    (traditionally down low). That could easily be more than 2Gby in size. So
    allowing for 32-bit representations seems a good idea. However, perhaps it
    should be the programmer's responsibility to use suitable signedness.
    Simple sounds good ... as long as simple isn't a synonym for
    over-simplified!

    James
     
    James Harris, Oct 14, 2013
    #26
  7. On x86-64, yes. On other architectures, perhaps not. Hopefully
    everyone has learned from past mistakes in that area, but history shows
    that humans aren't particularly good at that.
    I'm not sure that should be visible to applications since the physical
    location may change over time as the data is paged in and out, the
    thread migrates from one core to another, etc. Some (read-only) pages
    may even be duplicated across multiple nodes for performance reasons.

    My understanding is that NUMA systems allocate a new page, or page an
    old one in, on the "current" node, assuming memory is available there,
    but they don't migrate a writable page that is on the "wrong" node.
    Indeed, some existing systems (e.g. AS/400) do that. However, the
    industry seems to be consistently moving from segmentation, which makes
    fine-grained access control easier, to flat memory spaces, which are
    apparently easier to implement C on.

    Somewhat related: fat pointers for bounds checking.

    S
     
    Stephen Sprunk, Oct 14, 2013
    #27
  8. I noticed this happened a lot back before >2GB drives were common;
    attempting to install old software would often fail for "insufficient"
    disk space, probably due to overflow in the comparison logic, even when
    the GUI showed there was 100+ times as much as needed available.

    Yes, this indicates insufficient testing, but when such programs came
    out, there may not have been any such disks available to test with! And
    typical corporate policy only allows replacing equipment every three
    years or so for accounting reasons, so it persisted even after such
    drives first became common.

    I haven't seen many such problems since that era, though.

    Some OSes "solved" this by having two sets of API calls, one that
    returned 32-bit values (with saturation) and another that returned
    64-bit values. The problem is that the values were unsigned, so if the
    caller stuffed them in a signed type, the 32-bit API would still
    commonly lead to failures with >2GB drives/files. Oops.

    S
     
    Stephen Sprunk, Oct 14, 2013
    #28
  9. Not all images are intended to be viewed in their entirety, nor could
    they be due to the limitations of current displays. But it's easier to
    have one image (at ridiculous resolution) and let the display code deal
    with pan/zoom than to deal with the complexities of tiling--to a point.

    I've not yet seen a case where individual dimensions exceed the range of
    a 32-bit integer, but the total number of pixels often does. Even
    consumer cameras (and phones!) are now in the tens of millions of
    pixels, which is getting dangerously close to that limit.

    S
     
    Stephen Sprunk, Oct 14, 2013
    #29
  10. I'm posting this just to comp.lang.c because I have some C-specific
    things to say.

    C doesn't *forbid* subtraction of arbitrary pointers, it merely says
    that such a subtraction has undefined behavior unless both pointers
    point to elements of the same array object or just past the end of it
    (where a single object can be treated as 1-element array).

    If arbitrary pointer subtraction makes sense on a particular system,
    then a compiler for that system will probably support it with the
    semantics you expect. Or you can convert both operands to intptr_t,
    do a well-defined integer subtraction, and convert the result back to a
    pointer -- though the semantics may differ from those of pointer
    subtraction.

    The reason C doesn't define the result of arbitrary pointer subtraction
    is that there's no consistent definition across all possible systems
    that C can support. On a system where a pointer consists of, say, a
    segment descriptor plus a byte offset, subtraction of pointers to
    distinct objects may not even be possible.

    But if you want to write non-portable code that happens to work on the
    system(s) you're intersted in, C can be a good language for that, even
    if the behavior is defined by your compiler rather than by the language
    standard.
     
    Keith Thompson, Oct 14, 2013
    #30
  11. (snip, I wrote)
    Yes, so indexing needs to be more than 16 bits.

    But 32 bit indexing will get you up to 2147483647 x 2147483647,
    which is more than extremely large. Assuming we are discussing
    visible light images, the wavelength is greater than 400nm.
    I could multiply 400nm by 2147483647, but I think I will leave
    it at that.

    So, even in the case of extremely large images, 32 bit indexing
    is enough. (If one wants to copy the whole image in a 1D array,
    then, yes, 32 bits might not be enough.)

    -- glen
     
    glen herrmannsfeldt, Oct 14, 2013
    #31
  12. Reminds me of stories about doing doubly linked lists storing in
    each list element the XOR of the pointers to the two neighboring
    elements. If you know where you came from, you can find the next
    list element in either direction. Seems to me that you can also do
    it with the difference between the two pointers, though you need
    to know which direction you are going.


    JVM doesn't support any way of reversibly looking at the bits
    of an object reference. If a class doesn't have a toString(),
    then many give a hex representation of the reference (pointer)
    value, but there is no way to reverse that.

    Other machines from the past used similarly opaque addressing.
    Even on such a system, (A-B)+B could be A, and A-(A-B) could be B.
    Also, (A^B)^B could be A, and (A^B)^B could be A. As long as you
    can see the bits, that should be true. It is systems like JVM that
    disallow it.
    But do add some comments explaining what it requires.


    -- glen
     
    glen herrmannsfeldt, Oct 14, 2013
    #32
  13. In case anyone else was as curious as I, Google says:
    2 147 483 647 * 400 nanometers =
    858.993459 meters

    So, yeah, it's unlikely anyone will exceed 2147483647x2147483647, at
    least in an image intended to be viewed in its entirety; throw in pan
    and zoom in the display, though, and it's theoretically possible.

    S
     
    Stephen Sprunk, Oct 14, 2013
    #33
  14. (snip, I wrote)
    TeX does all its typesetting calculations in 32 bits with 16 bits
    after the binary point, in units of printers points. (1/72.27 inch).

    The unit sp (scaled point) is smaller than visible light.
    The maximum isn't quite as big as above, such that someone might
    exceed it for a billboard. But you can always apply a magnification
    factor, and probbly should for a billboard.

    -- glen
     
    glen herrmannsfeldt, Oct 15, 2013
    #34
  15. James Harris

    Rosario1903 Guest

    4G is enought for contain all code one could imagine without data
    so the problem is only for data...

    but store in mem all program as 64 bit program instead of 32 bit
    program double its size in memory and one has to deal with unfriendly
    64 bit numbers...

    for me even pointer could be 64 bit 32 bit 16 bit 8 bit
    as integers because they are [unsigned?] integers
     
    Rosario1903, Oct 15, 2013
    #35
  16. That'a a bit of an issue.
    But if you've got 64 bits of address space, you've almost certainly got lots
    of memory. It's likely that one or two structures will dominate your
    memory take, and there's no point at all optimising the remaining 99%.
    Those might have integer members you want to represent specially, but we're
    only talking about a few identifiers in the whole program.
     
    Malcolm McLean, Oct 15, 2013
    #36
  17. James Harris

    James Kuyper Guest

    On 10/14/2013 01:43 PM, glen herrmannsfeldt wrote:
    ....
    How, precisely, did it go wrong? What had to be fixed?
     
    James Kuyper, Oct 15, 2013
    #37
  18. I believe the program died when it got to 2GB on either the output
    or input file. I don't remember what the ERRNO was.

    As well as I remember, even with redirected I/O, a program is
    allowed to use fseek() and ftell(), and, with 32 bit int, would
    seek to or read the wrong value. To protect against corruption
    (such as fseek() to the wrong place) the system kills the program.

    If I remember right, that was Solaris about 1998. Programs like cat
    used fseek64() and ftell64(), and were linked with a special option,
    such that they were allowed to read/write big files.

    -- glen
     
    glen herrmannsfeldt, Oct 15, 2013
    #38
  19. Or the program cratered on its own when it got unexpected results, e.g.
    a negative file position from ftell(), which seems likely.

    When you redirect with < or >, the OS connects stdin or stdout to the
    named file rather than the console; it's still a _file_. Using "cat"
    meant that stdin and stdout were connected to a _pipe_ instead, which
    gives fseek() and ftell() well-defined behavior that apparently didn't
    crash the program.
    AFAIK, there was no need for programs to be "linked with a special
    option" to get access to fseek64()/ftell64(); those should have been
    included in the normal 32-bit libc as soon as the OS itself supported
    large files. Likewise, the 64-bit libc should have supported large
    files from the start, via both interfaces.

    There are a few possibilities I can see:

    1. cat used fseek64() and ftell64(), which use a "long long" offset
    rather than the "long" offset used by fseek() and ftell().

    2. cat used fseek() and ftell(), but it had a 64-bit "long" since it was
    compiled in 64-bit mode. (Solaris is I32LP64.)

    3. cat didn't use fseek() or ftell() at all.

    S
     
    Stephen Sprunk, Oct 15, 2013
    #39
  20. James Harris

    James Kuyper Guest

    I wasn't really looking for the symptoms, but the cause, and more
    precisely, how the cause of those symptoms was fixed.
    Yes, but I don't understand why that made a difference - I would have
    thought that any fseek() or ftell() occurring in "program" above that
    would cause problems when executing

    program < file1 > file2

    would cause the exact same problem when doing

    cat file1 | program | cat > file2

    How was re-direction of program output in unix handled such that the way
    "cat" is written determines whether or not an fseek() in "program" will
    fail? I would not have expected the way "cat" was written to matter, so
    long as it actually does what "cat" is supposed to do.
    Why would "cat" ever need to use fseek64() or ftell64()? As far as I can
    see, it never needs to keep more than one character of input in memory
    at a time, and never has any need to skip forward or backward through
    either the input or output files.
     
    James Kuyper, Oct 15, 2013
    #40
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.