size_t, ssize_t and ptrdiff_t

James Harris · Oct 14, 2013

....

Also, a program should be independent of the size of the
disk the files are in.

Yes. Again, the offset of a part of the disk requires an integer which is
sized suitably for the disk and not for a particular program.

I have seen programs that refuse to install on disk partitions
with more than 2G (and less than 4G) available. (In the days when
larger disks weren't quite as common as today.) They used signed 32
bit integers to compute the available space, and didn't notice
the overflow.

That's exactly the kind of issue I was talking about. The software company
could have tested that their app installs on many different operating
systems but, because of lax management of integer sizes, missed that it
would miscalculate in such limited circumstances as you mention and then
such a problem only gets noticed by customers. Some issues could be much
more important and produce incorrect results and not be noticed by customers
for years. In the UK recently there was a report of a high street retailer
that, because of a faulty piece of software, had been underpaying some of
its staff for some time. It doesn't look good when such errors are
eventually found out.

James

James Harris · Oct 14, 2013

BartC said:
From an AMD manual:

"In 64-bit mode, programs generate virtual (linear) addresses that can be
up to 64 bits in size. ... physical addresses that can be up to 52 bits in
size"

Sorry, I misunderstood. I thought when you mentioned odd sizes you were
thinking about 36-bit in 32-bit mode or similar, which can only be physical.

One of the good parts of the 64-bit design is that all 64 bits of the linear
addresses are used. Programs cannot squirrel away extra meaning to otherwise
unused bits of addresses. Hence the canonical format and the perception, at
least, that x86-64 addresses are signed... ;-)

Anyway I thought one of the points of using 64-bits was to get past the
2GB/4GB barrier? If that's not important, that could be reflected in the
build model where standard ints and pointers can be 32-bits (but still
leaving the problem of needing an unsigned type to make full use of 4GB).

It's the same (language) issue of having a suitable type to denote the
size of some data, or for an offset or index within the data. Perhaps what
I'm saying is, the language doesn't care how applications cope with files,
why should it do so with arrays and strings? (Be throwing in a type such a
'size_t'.)

I'm not sure. I was really just thinking that if a program uses N-bit
addresses it should possibly also have N-bit signed and unsigned integer
types so as to make it easy to work with addresses and any other integer
which accesses memory including array indices. For example, a program
running under an environment which has 16-bit addresses should have some
data type that results in 16-bit signed and unsigned integers. In fact, that
should be the default size for integers, if there is such a thing as a
default, or the easiest to specify if not. That doesn't prevent the
programmer choosing other sizes of integers but makes it easiest to take the
safer action.

The discussion has pointed out that the situation is a little more complex.
Some environments have multiple sizes of pointer. For those, ISTM
appropriate to have corresponding sizes of signed and unsigned integer.

FWIW, as well as the old x86-16 segmented modes I wonder if similar
non-simple pointers may one day be needed for NUMA architectures. From what
I can find, at the moment they are limited to using a field within a wide
address to identify the node that the RAM sits in but it is probably a good
idea to keep in mind the idea that pointers may one day need to be segmented
again.

Also, some pointers may profitably be replaced by (object, offset) pairs. A
bit off topic here. I just mention it for completeness.

James

Keith Thompson · Oct 14, 2013

James Harris said:
AIUI any data pointer can be converted to a void * and back again so is the
combined implication that intptr_t and uintptr_t can hold the bits of a
pointer to any data type?

Yes. The standard doesn't directly say that you can convert an int* to
uintptr_t and back again without loss of information, but it would take
a perverse implementation for it to fail.

I'm sure the answer is there but at the moment I'm confused as to why these
as well as size_t and ptrdiff_t have been defined. Maybe some architectures
would resolve these four to more than two different types of integer...?

size_t and ptrdiff_t apply only to single objects. size_t is the type
of the result of sizeof (and a parameter and/or result type for a number
of standard library functions). ptrdiff_t is the result of pointer
subtraction, which is defined only within a single object (or just past
the end of it).

The intptr_t types, on the other hand, have to hold the converted
value of any valid void* pointer, which can point to any byte of *any*
object in the currently executing program.

For many systems the distinction doesn't matter; you'll have, say, a
32-bit address space, and size_t, void*, et al will all be the same
size. But it's entirely possible to have a 64-bit address space while
limiting the size of any single object to 32 bits (or 32 and 16).

Malcolm McLean · Oct 14, 2013

There are probably some programs written today that, in their
lifetime, will need to index arrays with index values larger
than int. But not so many of them.

Data has got to represent something.
But lots of devices are now spewing out huge amounts of data. For instance
an image for human viewing isn't really going to go above about 4096 x 4096,
because there's a limit to the number of pixels a human can distinguish in
his visual field. But a lot of microscopic slides aren't intended for direct
human viewing, the images can be extremely large.

Joe Pfeiffer · Oct 14, 2013

Stephen Sprunk said:
AFAICT, he was referring to app-visible paging. For instance, Windows
Server allowed apps to have a "window" within their 32-bit address space
that was variably mapped within a much larger virtual address space. It
was up to application programmers to move that "window" around to access
the various bits of data they needed.

With a 64-bit address space, of course, that became unnecessary and
quickly fell out of favor; now the OS transparently maps your data into
memory whenever you access it, via a completely unrelated scheme also
called "paging".

A few generations earlier, DOS had a similar "Expanded Memory" (EMS)
scheme that did basically the same thing to exceed real mode's 20-bit
address space. Similarly, EMS quickly fell out of favor when a 32-bit
address space, called "Extended Memory" (XMS), came into use.

Overlays were prior to _that_ and more focused on dealing with the
limited _physical_ RAM than the limited address space.

Ah, OK -- I'd argue that what he's describing has more in common with
overlays than with OS-provided paging: the programer is using a
single area of the program's logical address space to view different
parts of data or code (though overlays required the program to
physically move the data while this "paging" scheme could be built on
top of OS-provided paging easily). I'd disagree that overlays focussed
on the limited physical memory rather than address space; in fact, the
first time I encountered it was on a CDC 6400 in which the logical
address space was of variable size (and the more you wanted the more it
cost) enforced by a limit register, and the physical address space was
much larger than the logical space.

James Harris · Oct 14, 2013

BartC said:
news:[email protected]...

As this is going off the topic of C have copied to and set followups to
comp.lang.misc.

For context, discussion is about

* converting between integers and pointers
* combining integers with pointers in arithmetic
* what sizes of integers to use
* what signedness those integers should have

C types discussed: size_t, ssize_t, ptrdiff_t and, latterly, intptr_t and
uintptr_t.

In a new language, you don't really want untidy features such as these. I
think even in C itself, they were bolted on decades later. The problems
they are trying to solve can be appreciated, but how do other languages
deal with them?

IME languages sometimes take an overly simplistic approach to pointers. Most
I have seen disallow any access to pointers except for assignment and
comparison. That may be a good approach - that's a separate discussion - but
this thread was about interworking between pointers and integers, assuming a
language makes that possible. What integer types should be available? My
opening suggestion was that signed and unsigned integers of the same size as
addresses should be the defaults. Then those integers, N, could be combined
with pointers, P, with operations such as the following where -> indicates
the mapping to a result.

P -> N
N -> P
P1 - P2 -> N
P1 + N -> P2
P[N] -> element

Using address-sized integers for all memory-accesses including indexing
would allow array indices to be large enough for even the largest possible
array.

In general, if a 32-bit (char) pointer can cover a 0 to 4 billion range,
then the difference between two pointers is going to need a range of +/- 4
billion. However a pointer might have that range, yet a single object
might
be limited to 2 billion in size. They are solving different problems.

An actual language however could simply not allow one pointer to be
subtracted from another (solving that problem!). I think even C only
allows
this between two pointers within the same object; so if objects have a
2-billion limit, then that also solves the problem in this instance.

It might be good to allow arbitrary pointers to be subtracted especially for
systems programming.

There would be something wrong if an object was bigger than could be
represented by ssize_t.

If running under a 32-bit address space I would dislike the idea of being
restricted to 31 bits for a single object. I know that objects are seldom
that large and OSes often take a lot of address space for themselves but I
cannot see a good reason why an object larger than 2Gby should not be
possible. Also, it might be that a program wants to calculate the distance
between the base of the stack (traditionally in high memory) and the code
(traditionally down low). That could easily be more than 2Gby in size. So
allowing for 32-bit representations seems a good idea. However, perhaps it
should be the programmer's responsibility to use suitable signedness.

To simplify the problems a little, in most cases the choices for all these
types are going to be either signed or unsigned, and either 32 or 64 bits!
Four options. Signed 64-bits covers all the possibilities, if you want to
keep things simple.

Simple sounds good ... as long as simple isn't a synonym for
over-simplified!

James

Stephen Sprunk · Oct 14, 2013

Sorry, I misunderstood. I thought when you mentioned odd sizes you
were thinking about 36-bit in 32-bit mode or similar, which can only
be physical.

One of the good parts of the 64-bit design is that all 64 bits of the
linear addresses are used. Programs cannot squirrel away extra
meaning to otherwise unused bits of addresses. Hence the canonical
format and the perception, at least, that x86-64 addresses are
signed... ;-)

On x86-64, yes. On other architectures, perhaps not. Hopefully
everyone has learned from past mistakes in that area, but history shows
that humans aren't particularly good at that.

FWIW, as well as the old x86-16 segmented modes I wonder if similar
non-simple pointers may one day be needed for NUMA architectures.
From what I can find, at the moment they are limited to using a field
within a wide address to identify the node that the RAM sits in but
it is probably a good idea to keep in mind the idea that pointers may
one day need to be segmented again.

I'm not sure that should be visible to applications since the physical
location may change over time as the data is paged in and out, the
thread migrates from one core to another, etc. Some (read-only) pages
may even be duplicated across multiple nodes for performance reasons.

My understanding is that NUMA systems allocate a new page, or page an
old one in, on the "current" node, assuming memory is available there,
but they don't migrate a writable page that is on the "wrong" node.

Also, some pointers may profitably be replaced by (object, offset)
pairs. A bit off topic here. I just mention it for completeness.

Indeed, some existing systems (e.g. AS/400) do that. However, the
industry seems to be consistently moving from segmentation, which makes
fine-grained access control easier, to flat memory spaces, which are
apparently easier to implement C on.

Somewhat related: fat pointers for bounds checking.

S

Stephen Sprunk · Oct 14, 2013

That's exactly the kind of issue I was talking about. The software
company could have tested that their app installs on many different
operating systems but, because of lax management of integer sizes,
missed that it would miscalculate in such limited circumstances as
you mention and then such a problem only gets noticed by customers.

I noticed this happened a lot back before >2GB drives were common;
attempting to install old software would often fail for "insufficient"
disk space, probably due to overflow in the comparison logic, even when
the GUI showed there was 100+ times as much as needed available.

Yes, this indicates insufficient testing, but when such programs came
out, there may not have been any such disks available to test with! And
typical corporate policy only allows replacing equipment every three
years or so for accounting reasons, so it persisted even after such
drives first became common.

I haven't seen many such problems since that era, though.

Some OSes "solved" this by having two sets of API calls, one that
returned 32-bit values (with saturation) and another that returned
64-bit values. The problem is that the values were unsigned, so if the
caller stuffed them in a signed type, the 32-bit API would still
commonly lead to failures with >2GB drives/files. Oops.

S

Stephen Sprunk · Oct 14, 2013

Data has got to represent something. But lots of devices are now
spewing out huge amounts of data. For instance an image for human
viewing isn't really going to go above about 4096 x 4096, because
there's a limit to the number of pixels a human can distinguish in
his visual field. But a lot of microscopic slides aren't intended for
direct human viewing, the images can be extremely large.

Not all images are intended to be viewed in their entirety, nor could
they be due to the limitations of current displays. But it's easier to
have one image (at ridiculous resolution) and let the display code deal
with pan/zoom than to deal with the complexities of tiling--to a point.

I've not yet seen a case where individual dimensions exceed the range of
a 32-bit integer, but the total number of pixels often does. Even
consumer cameras (and phones!) are now in the tens of millions of
pixels, which is getting dangerously close to that limit.

S

Keith Thompson · Oct 14, 2013

I'm posting this just to comp.lang.c because I have some C-specific
things to say.

James Harris said:
It might be good to allow arbitrary pointers to be subtracted especially for
systems programming.

C doesn't *forbid* subtraction of arbitrary pointers, it merely says
that such a subtraction has undefined behavior unless both pointers
point to elements of the same array object or just past the end of it
(where a single object can be treated as 1-element array).

If arbitrary pointer subtraction makes sense on a particular system,
then a compiler for that system will probably support it with the
semantics you expect. Or you can convert both operands to intptr_t,
do a well-defined integer subtraction, and convert the result back to a
pointer -- though the semantics may differ from those of pointer
subtraction.

The reason C doesn't define the result of arbitrary pointer subtraction
is that there's no consistent definition across all possible systems
that C can support. On a system where a pointer consists of, say, a
segment descriptor plus a byte offset, subtraction of pointers to
distinct objects may not even be possible.

But if you want to write non-portable code that happens to work on the
system(s) you're intersted in, C can be a good language for that, even
if the behavior is defined by your compiler rather than by the language
standard.

glen herrmannsfeldt · Oct 14, 2013

(snip, I wrote)

Data has got to represent something.
But lots of devices are now spewing out huge amounts of data.
For instance an image for human viewing isn't really going to
go above about 4096 x 4096, because there's a limit to the
number of pixels a human can distinguish in his visual field.
But a lot of microscopic slides aren't intended for direct
human viewing, the images can be extremely large.

Yes, so indexing needs to be more than 16 bits.

But 32 bit indexing will get you up to 2147483647 x 2147483647,
which is more than extremely large. Assuming we are discussing
visible light images, the wavelength is greater than 400nm.
I could multiply 400nm by 2147483647, but I think I will leave
it at that.

So, even in the case of extremely large images, 32 bit indexing
is enough. (If one wants to copy the whole image in a 1D array,
then, yes, 32 bits might not be enough.)

-- glen

glen herrmannsfeldt · Oct 14, 2013

Keith Thompson said:
I'm posting this just to comp.lang.c because I have
some C-specific things to say.

C doesn't *forbid* subtraction of arbitrary pointers, it merely says
that such a subtraction has undefined behavior unless both pointers
point to elements of the same array object or just past the end of it
(where a single object can be treated as 1-element array).

Reminds me of stories about doing doubly linked lists storing in
each list element the XOR of the pointers to the two neighboring
elements. If you know where you came from, you can find the next
list element in either direction. Seems to me that you can also do
it with the difference between the two pointers, though you need
to know which direction you are going.

If arbitrary pointer subtraction makes sense on a particular system,
then a compiler for that system will probably support it with the
semantics you expect. Or you can convert both operands to intptr_t,
do a well-defined integer subtraction, and convert the result back to a
pointer -- though the semantics may differ from those of pointer
subtraction.

JVM doesn't support any way of reversibly looking at the bits
of an object reference. If a class doesn't have a toString(),
then many give a hex representation of the reference (pointer)
value, but there is no way to reverse that.

Other machines from the past used similarly opaque addressing.

The reason C doesn't define the result of arbitrary pointer subtraction
is that there's no consistent definition across all possible systems
that C can support. On a system where a pointer consists of, say, a
segment descriptor plus a byte offset, subtraction of pointers to
distinct objects may not even be possible.

Even on such a system, (A-B)+B could be A, and A-(A-B) could be B.
Also, (A^B)^B could be A, and (A^B)^B could be A. As long as you
can see the bits, that should be true. It is systems like JVM that
disallow it.

But if you want to write non-portable code that happens to work on the
system(s) you're intersted in, C can be a good language for that, even
if the behavior is defined by your compiler rather than by the language
standard.

But do add some comments explaining what it requires.

-- glen

Stephen Sprunk · Oct 14, 2013

But 32 bit indexing will get you up to 2147483647 x 2147483647,
which is more than extremely large. Assuming we are discussing
visible light images, the wavelength is greater than 400nm.
I could multiply 400nm by 2147483647, but I think I will leave
it at that.

In case anyone else was as curious as I, Google says:
2 147 483 647 * 400 nanometers =
858.993459 meters

So, yeah, it's unlikely anyone will exceed 2147483647x2147483647, at
least in an image intended to be viewed in its entirety; throw in pan
and zoom in the display, though, and it's theoretically possible.

S

glen herrmannsfeldt · Oct 14, 2013

(snip, I wrote)

In case anyone else was as curious as I, Google says:
2 147 483 647 * 400 nanometers =
858.993459 meters

So, yeah, it's unlikely anyone will exceed 2147483647x2147483647, at
least in an image intended to be viewed in its entirety; throw in pan
and zoom in the display, though, and it's theoretically possible.

TeX does all its typesetting calculations in 32 bits with 16 bits
after the binary point, in units of printers points. (1/72.27 inch).

The unit sp (scaled point) is smaller than visible light.
The maximum isn't quite as big as above, such that someone might
exceed it for a billboard. But you can always apply a magnification
factor, and probbly should for a billboard.

-- glen

Rosario1903 · Oct 15, 2013

It seems there is or was the potential for code pointers and data pointers
to be different sizes, e.g. as in the old segmentation models where one
could be 16 bits and the other could be larger. If so, should there be
pointer difference and size variants for code and data or should the old
models simply never have existed? (Rhetorical!) With x86-64 should C have
different sizes of code and data pointers? (I sure hope not.)

If an implementation allowed a single object to be bigger than half the
address space could operations on it break code using ssize_t and ptrdiff_t,
when the result is out of range of the signed type?

These are the only types I am aware of which are designed specifically to
represent quantities of bytes of memory. Does C have any others that I have
missed?

James

4G is enought for contain all code one could imagine without data
so the problem is only for data...

but store in mem all program as 64 bit program instead of 32 bit
program double its size in memory and one has to deal with unfriendly
64 bit numbers...

for me even pointer could be 64 bit 32 bit 16 bit 8 bit
as integers because they are [unsigned?] integers

Malcolm McLean · Oct 15, 2013

On Sat, 12 Oct 2013 11:26:04 +0100, "James Harris" wrote:

but store in mem all program as 64 bit program instead of 32 bit
program double its size in memory and one has to deal with unfriendly
64 bit numbers...

That'a a bit of an issue.
But if you've got 64 bits of address space, you've almost certainly got lots
of memory. It's likely that one or two structures will dominate your
memory take, and there's no point at all optimising the remaining 99%.
Those might have integer members you want to represent specially, but we're
only talking about a few identifiers in the whole program.

James Kuyper · Oct 15, 2013

On 10/14/2013 01:43 PM, glen herrmannsfeldt wrote:
....

Now, there were some problems in unix that might not have been
necessary.

If a program doesn't do any fseek()/ftell() then it should be
able to process files of unlimited size. It turns out that,
at least in many unix systems, that isn't true.

(There were times when

cat file1 | program | cat > file2

worked but

program < file1 > file2

didn't. Hopefully all fixed by now.)

How, precisely, did it go wrong? What had to be fixed?

glen herrmannsfeldt · Oct 15, 2013

James Kuyper said:
On 10/14/2013 01:43 PM, glen herrmannsfeldt wrote:
...

How, precisely, did it go wrong? What had to be fixed?

I believe the program died when it got to 2GB on either the output
or input file. I don't remember what the ERRNO was.

As well as I remember, even with redirected I/O, a program is
allowed to use fseek() and ftell(), and, with 32 bit int, would
seek to or read the wrong value. To protect against corruption
(such as fseek() to the wrong place) the system kills the program.

If I remember right, that was Solaris about 1998. Programs like cat
used fseek64() and ftell64(), and were linked with a special option,
such that they were allowed to read/write big files.

-- glen

Stephen Sprunk · Oct 15, 2013

I believe the program died when it got to 2GB on either the output
or input file. I don't remember what the ERRNO was.

As well as I remember, even with redirected I/O, a program is
allowed to use fseek() and ftell(), and, with 32 bit int, would
seek to or read the wrong value. To protect against corruption
(such as fseek() to the wrong place) the system kills the program.

Or the program cratered on its own when it got unexpected results, e.g.
a negative file position from ftell(), which seems likely.

When you redirect with < or >, the OS connects stdin or stdout to the
named file rather than the console; it's still a _file_. Using "cat"
meant that stdin and stdout were connected to a _pipe_ instead, which
gives fseek() and ftell() well-defined behavior that apparently didn't
crash the program.

If I remember right, that was Solaris about 1998. Programs like cat
used fseek64() and ftell64(), and were linked with a special option,
such that they were allowed to read/write big files.

AFAIK, there was no need for programs to be "linked with a special
option" to get access to fseek64()/ftell64(); those should have been
included in the normal 32-bit libc as soon as the OS itself supported
large files. Likewise, the 64-bit libc should have supported large
files from the start, via both interfaces.

There are a few possibilities I can see:

1. cat used fseek64() and ftell64(), which use a "long long" offset
rather than the "long" offset used by fseek() and ftell().

2. cat used fseek() and ftell(), but it had a 64-bit "long" since it was
compiled in 64-bit mode. (Solaris is I32LP64.)

3. cat didn't use fseek() or ftell() at all.

S

James Kuyper · Oct 15, 2013

I believe the program died when it got to 2GB on either the output
or input file. I don't remember what the ERRNO was.

I wasn't really looking for the symptoms, but the cause, and more
precisely, how the cause of those symptoms was fixed.

As well as I remember, even with redirected I/O, a program is
allowed to use fseek() and ftell(), and, with 32 bit int, would
seek to or read the wrong value. To protect against corruption
(such as fseek() to the wrong place) the system kills the program.

Yes, but I don't understand why that made a difference - I would have
thought that any fseek() or ftell() occurring in "program" above that
would cause problems when executing

program < file1 > file2

would cause the exact same problem when doing

cat file1 | program | cat > file2

How was re-direction of program output in unix handled such that the way
"cat" is written determines whether or not an fseek() in "program" will
fail? I would not have expected the way "cat" was written to matter, so
long as it actually does what "cat" is supposed to do.

If I remember right, that was Solaris about 1998. Programs like cat
used fseek64() and ftell64(), and were linked with a special option,
such that they were allowed to read/write big files.

Why would "cat" ever need to use fseek64() or ftell64()? As far as I can
see, it never needs to keep more than one character of input in memory
at a time, and never has any need to skip forward or backward through
either the input or output files.

Plauger, size_t and ptrdiff_t	26	Feb 17, 2006
size_t and ptrdiff_t	1	Feb 27, 2014
ssize_t and size_t	8	May 19, 2009
ptrdiff_t	13	Dec 3, 2004
size_t or ssize_t	11	Feb 16, 2006
Strange result from ptrdiff_t and size_t	10	Feb 1, 2007
size_t in a struct	24	May 20, 2011
error: conflicting declaration 'typedef int32_t ssize_t' (mingw versus berkeley db)	5	Nov 26, 2011

size_t, ssize_t and ptrdiff_t

James Harris

James Harris

Keith Thompson

Malcolm McLean

Joe Pfeiffer

James Harris

Stephen Sprunk

Stephen Sprunk

Stephen Sprunk

Keith Thompson

glen herrmannsfeldt

glen herrmannsfeldt

Stephen Sprunk

glen herrmannsfeldt

Rosario1903

Malcolm McLean

James Kuyper

glen herrmannsfeldt

Stephen Sprunk

James Kuyper

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads