Sizes of pointers

Stephen Sprunk · Aug 3, 2013

if i have one generic operation *:AxA->A and AcN [can be
A=32 bit integers unsigned for example A=0..0xFFFFFFFF or 64 bit
integers etc]

if math operations * is not the same thru machines[the * has to
associate the same numbers], and number elements object A
contain are not the same thru machines, than the operation * can
not be portable thru machines that want use it

Click to expand...

....
The only operations that C defines for pointers are:

pointer ? pointer + integer

and

integer ? pointer - pointer

Click to expand...

For those without working Unicode support in their newsreader, like
Rosario1903 apparently, those "?"s were left arrows.

integer... i have seen somewhere pointeraddress/ unsigned too and
pointerADDRESS*unsigned too

C does not define multiplication or division on pointers, so any attempt
to do so is inherently non-portable.

Granted, those operations do have a fairly obvious meaning on systems
with a flat linear address space, but not all C implementations run on
such systems. The world is a lot bigger than x86 and RISC, and one of
C's strengths is its ability to run on a wide variety of systems--and
that is a direct result of leaving certain details of the language
undefined (or vice versa, actually).

S

Keith Thompson · Aug 3, 2013

Rosario1903 said:
malloc() as found in K&RII book would be as above, right?
it return mem for each object right?

if malloc is one as above, for what can count, i disagree

An *implementation* of malloc has to work with the underlying machine
and operating system, and typically is not written in portable code.
A malloc implementation for a system with a typical monolithic
address space, where pointers can be treated as indices into a large
1-dimensional array, is going to work with that model. A malloc
implementation for a different kind of system might, for example,
invoke a system call that allocates a descriptor of some sort.

A lot of the code that implements the C library has to be written
non-portably, and has to be rewritten for different systems.

Higher-level code that *uses* the C library, for the most part,
doesn't and shouldn't care whether the address space is linear
or not.

Keith Thompson · Aug 3, 2013

Rosario1903 said:
i remember something as

(pointerADDRESS%x)!=0
where x=2,4,8 for see the align for pointerAddress

Applying the "%" operator to a pointer value is a constraint violation,
requiring a diagnostic.

You can do something similar by converting the pointer:

if ((uintptr_t)pointer % 4 != 0) {
/* pointer is not 4-byte aligned */
}

But I've worked on systems where that wouldn't work as intended, because
pointers were not represented as simple integers.

Stephen Sprunk · Aug 3, 2013

Well, maybe. It's not that the second pointer isn't there on many
other implementations, it's just that it's the called routine that
usually loads it.

Are you sure about that? I thought that was usually handled by
relocation or PC-relative addressing. PIC adds a layer of indirection
with a GOT/PLT, but AFAIK that has the same value for all functions and
is itself relocated at link time anyway.

S

Stephen Sprunk · Aug 3, 2013

malloc() as found in K&RII book would be as above, right? it return
mem for each object right?

malloc() does not assume a flat linear address space, nor does it
require relative comparison or subtraction of pointers to different
objects to be meaningful.

malloc() works just fine on systems that have segmented addressing; if
it didn't, they couldn't have a C implementation.

S

Stephen Sprunk · Aug 3, 2013

if i have one generic operation *:AxA->A and AcN [can be A=32
bit integers unsigned for example A=0..0xFFFFFFFF or 64 bit integers
etc]

if math operations * is not the same thru machines[the * has to
associate the same numbers], and number elements object A contain
are not the same thru machines, than the operation * can not be
portable thru machines that want use it

Click to expand...

The only operations that C defines for pointers are:

pointer â† pointer + integer

and

integer â† pointer - pointer

Click to expand...

I think you meant to restrict your statement more closely than you
actually did. It's perfectly true that there's a lot of operators that
you can apply to integer types that you cannot apply to pointers,
including all of the ones relevant to his claim: ^, ~, binary*,
pointer+pointer, /, %, >>, <<, binary &, and binary |.

However, all of the following operations are defined for pointers:
(), unary *, unary &, [], function call, ->, ++, --, !, sizeof,
_Alignof(), cast, <, >, <=, >=, ==, !=, &&, ||, ?=, =, ?:, +=, -=,
and the comma operator.

IMHO, the context clearly implied that my statement was constrained to
arithmetic operations.

There are a few arithmetic operators I skipped over, but nearly(?) all
can be defined in terms of the ones I did list. The rest of your list
are non-arithmetic.

S

Tim Rentsch · Aug 3, 2013

Stephen Sprunk said:
Sorry, I must have missed that discussion. It's my favorite
x86-64 quirk, so I rarely pass on an opportunity to mention it.

One _can_ describe x86-64 pointers in unsigned terms, and some
sources go to great lengths to do so;

Other sources describe x86-64 pointers in unsigned terms and do
so very simply and understandably.

it's just that an explanation of certain kernel details is both
shorter and easier to understand when presented in signed terms.

This may something about (some) kernel code, but not about x86-64.
The hardware itself is intrinsically neither signed nor unsigned.
If anything the hardware views addresses as unsigned, because of
how address translation works - page table lookup treats all bits
that it uses as value bits, and none as sign bits.

Stephen Sprunk · Aug 3, 2013

Other sources describe x86-64 pointers in unsigned terms and do so
very simply and understandably.

Having to explain why an unsigned value must be sign-extended isn't
nearly as simple or understandable as saying it's a signed value.

If pointers were unsigned, they'd be zero-extended.

Some sources go to great lengths to call it something other than sign
extension and then define that made-up term in unsigned terminology. A
rose by any other name is still a rose, though, and such explanations
are neither as simple nor as understandable as the signed one.

This may something about (some) kernel code, but not about x86-64.

It applies to both Windows and Linux, and experts from both groups
collaborated with AMD on such details when the architecture was being
defined. The ISA was literally designed to run those OSes, so how those
OSes use the ISA is informative of the intent.

True, someone could design an OS that didn't define the negative half as
being kernel space, in which case sign extension would no longer be so
important, but sign extension would still happen because it's a part of
the instruction set definition. Change a fundamental detail like that
and the result cannot be called AMD64/x86-64 anymore.

The hardware itself is intrinsically neither signed nor unsigned. If
anything the hardware views addresses as unsigned, because of how
address translation works - page table lookup treats all bits that it
uses as value bits, and none as sign bits.

If I extract a subset of bits from an object and look at the result as
an unsigned value, that says nothing about whether the original object
was signed or unsigned.

OTOH, dictating that an object must be sign-extended when converted says
a lot about whether those objects are signed.

S

James Kuyper · Aug 3, 2013

Having to explain why an unsigned value must be sign-extended isn't
nearly as simple or understandable as saying it's a signed value.

Under what circumstances, when using x86-64 pointers, do they need to be
sign-extended? I know nothing about the x86-64 architecture, so please
word your explanation accordingly.

Tim Rentsch · Aug 4, 2013

Stephen Sprunk said:
If you need a 64-bit code pointer _and_ a 64-bit data pointer to
call a function, then one can reasonably (for the sake of brevity)
call that a 128-bit function pointer.

I don't agree. The term 'function pointer' should mean exactly
the kinds of values used in a C implementation and held in objects
having a pointer-to-function type. If a pointer-to-function
object has 64 bits, then there are 64-bit function pointers; if a
pointer-to-function object has 128 bits, then there are 128-bit
function pointers. Any ancillary or auxiliary data structure used
in calling a function, but not part of a pointer-to-function
object representation, should not be called a function pointer,
for brevity or any other reason.

Stephen Sprunk · Aug 4, 2013

Under what circumstances, when using x86-64 pointers, do they need to
be sign-extended? I know nothing about the x86-64 architecture, so
please word your explanation accordingly.

As you may expect from the name, x86-64 has 64-bit registers. Current
implementations only use the low 48 bits of a memory address, though,
and if they are not properly sign-extended to full register width, the
processor must fault.

So, valid addresses are in two ranges, from 0x00000000 00000000 to
0x00007fff ffffffff ("user space") and from 0xffff8000 00000000 to
0xffffffff ffffffff ("kernel space"). IMHO, it is more elegant to
describe this as a single range from -128TB to +128TB, with kernel space
being "negative".

Of particular importance is how x86's 4GB address space is mapped into
x86-64's 256TB address space. User space is easy; it's 0GB to +2GB
regardless of whether you view pointers as signed or unsigned and
therefore the sign/zero extension issue.

Kernel space is another matter. In the signed view, x86's kernel space
is -2GB to 0GB in both 32-bit mode _and_ 64-bit mode. Also, kernels
themselves easily fit into that 2GB, so they can get away with using
(negative) 32-bit pointers that are valid in both modes! That is truly
an elegant solution.

OTOH, if pointers were unsigned and therefore zero-extended rather than
sign-extended, then x86's kernel space would end up inside x86-64's
(larger) user space when the kernel switched to 64-bit mode! That can
be worked around, but the result is clearly inelegant.

S

Stephen Sprunk · Aug 4, 2013

I don't agree. The term 'function pointer' should mean exactly the
kinds of values used in a C implementation and held in objects having
a pointer-to-function type. If a pointer-to-function object has 64
bits, then there are 64-bit function pointers; if a
pointer-to-function object has 128 bits, then there are 128-bit
function pointers. Any ancillary or auxiliary data structure used in
calling a function, but not part of a pointer-to-function object
representation, should not be called a function pointer, for brevity
or any other reason.

You seem to be agreeing with me that if the function's 64-bit code
address and 64-bit data address were combined into a single 128-bit
value, as was reportedly done in the first Itanic implementations, that
could be reasonably called a 128-bit function pointer.

Yes, Itanic today has a layer of indirection so that function pointers
can be 64-bit and therefore converted to/from void*, but that was my
point: the standard allows function pointers to be incompatible with
object pointers, but in practice that doesn't work so well. Even the
Itanic team, which seemed to make a point of combining every bad idea
they had ever heard of into one product, couldn't stomach it.

S

Stephen Sprunk · Aug 4, 2013

Well, there was one other problem and that is that they never
implemented a segment descriptor cache. Every segment selector load
requires loading a 64 bit descriptor. A cache would have sped up
processing in many cases.

Indeed, but by the time x86 chips had even regular data caches,
segmentation was clearly dead. And no cache could make up for not
having enough segment values available in the first place, which is why
implementations couldn't use a segment per object a la AS/400.

With a good descriptor cache, 80386 protected mode, with 4GB
segments, might have delayed the need for 64 bit addressing. (That,
and some way around the 32 bit MMU.)

The 286 offered 16-bit segment offsets within a 24-bit address space, so
using multiple segments had a clear benefit to justify the extra complexity.

The 386 offered 32-bit segment offsets within a 32-bit address space,
plus a more granular paging system, so there was no longer any benefit
to justify the complexity of multiple segments. If they had used a
48-bit base in the descriptor, we'dstill be using IA-32 today. Even a
40-bit base would have bought x86 another decade of life.

Later, PAE increased the physical address space from 32 to 36 bits, but
again they missed the obvious opportunity to enlarge the base, so each
process was still limited to 4GB of virtual address space.

S

Philip Lantz · Aug 4, 2013

James said:
Under what circumstances, when using x86-64 pointers, do they need to be
sign-extended? I know nothing about the x86-64 architecture, so please
word your explanation accordingly.

X86-64 linear (virtual) addresses are 48 bits in the range
0xffff8000,00000000 - 0x00007fff,ffffffff. My code lives in the address
range 0xffffffff,c0000000 - 0xffffffff,c007ffff. I can use instructions
with 32-bit base addresses because the addresses are sign extended
before they are used. (That's the way the instructions are defined.) I
can also load a pointer into a register using an instruction with a 32-
bit operand, because it is sign extended.

If addresses were not sign extended, I would have to load every address
into a register using an instruction with a 64-bit operand. I wouldn't
be able to use direct addressing much, because there are very few
instructions with a 64-bit base address encoded in the instruction.

I put my code there because I don't want it to overlap the physical
addresses of any physical memory. The range of physical memory is 0 -
0xf,ffffffff (higher on some systems). (I could explain why I don't want
that, but I think it would be tangential to my point.) In the case of
normal operating-system kernel code, the requirement is that the kernel
addresses don't overlap the user-mode addresses, which leads to a
similar situation.

If addresses were not treated as signed, they would be in the range 0 -
0x0000ffff,ffffffff. In order to avoid overlapping physical memory, I
could put my code at something like 0x0000ff00,00000000 or possibly
0x0000ffff,c0000000. Then I would have to use 64-bit addresses instead
of sign-extended 32-bit addresses. If that were the case, the
instruction set would probably be defined with many more instructions
with 64-bit base addresses (or possibly 48-bit addresses), making the
instruction set more complicated and programs larger.

I hope this answers your question; I'm not sure it describes when
addresses "need to be sign extended", but it does explain why the
currently defined behavior, which is consistent with signed addresses,
is convenient for my code.

Tim Rentsch · Aug 4, 2013

Stephen Sprunk said:
VAX put user programs at the bottom of the 32 bit address space,
with the system space at the top. VAX was around a long time
before people could afford 4GB. I don't know that it was
described as signed, but the effect was the same.

Currently no-one can fill a 64 bit address space, so there are
tricks to avoid the overhead of working with it.

To avoid huge page table, VAX and S/370 use a two level virtual
address system. (VAX has pagable page tables, S/370 has segments
and pages.) Continuing that, z/Architecture has five levels of
tables. It should take five table references to resolve a
virtual address (that isn't in the TLB) but there is a way to
reduce that for current sized systems.

Click to expand...

x86 is similar, [snip stuff about page tables]

Most importantly, though, all pointers must be sign-extended,
rather than zero-extended, when stored in a 64-bit register. You
could view the result as unsigned, but that is counter-intuitive
and results in an address space with an enormous hole in the
middle. OTOH, if you view them as signed, the result is a single
block of memory centered on zero, with user space as positive and
kernel space as negative. Sign extension also has important
implications for code that must work in both x86 and x86-64
modes, e.g. an OS kernel--not coincidentally the only code that
should be working with negative pointers anyway. [snip unrelated]

IMO it is more natural to think of kernel memory and user memory
as occupying separate address spaces rather than being part of
one combined positive/negative blob; having a hole between them
helps rather than hurts. If you want to think kernel memory as
"negative" and user memory as "positive", and contiguous so a
pointer being decremented in user space falls into kernel space,
you are certainly welcome to do that, but don't insist that
others have to share your perceptual bias.

Stephen Sprunk · Aug 4, 2013

Stephen Sprunk said:
Stephen Sprunk said:

Most importantly, though, [x86-64] pointers must be sign-extended,
rather than zero-extended, when stored in a 64-bit register. You
could view the result as unsigned, but that is counter-intuitive
and results in an address space with an enormous hole in the
middle. OTOH, if you view them as signed, the result is a single
block of memory centered on zero, with user space as positive and
kernel space as negative. ... [snip unrelated]

Click to expand...

IMO it is more natural to think of kernel memory and user memory
as occupying separate address spaces rather than being part of
one combined positive/negative blob; having a hole between them
helps rather than hurts. If you want to think kernel memory as
"negative" and user memory as "positive", and contiguous so a
pointer being decremented in user space falls into kernel space,
you are certainly welcome to do that, but don't insist that
others have to share your perceptual bias.

It's not _my_ perceptual bias; it's how the architects thought of it,
and their view yet survives in various docs, e.g. from GCC's manual:

-mcmodel=kernel
Generate code for the kernel code model. The kernel runs in the
negative 2 GB of the address space. This model has to be used for Linux
kernel code.

It's also evidenced by the use of sign extension, which only makes sense
for signed values, rather than zero extension, which would logically
have been used for unsigned values.

Finally, the valid values of an x86-64 memory address are:

kernel: 0xffff8000,00000000 to 0xffffffff,ffffffff [-256TB to 0)
user: 0x00000000,00000000 to 0x00007fff,ffffffff [0 to +256TB)

which leads to an obvious reimagining of x86 addresses:

kernel: 0x80000000 to 0xffffffff [-2GB to 0)
user: 0x00000000 to 0x7fffffff [0 to +2GB)

which, when sign-extended, evinces an elegance that simply cannot be
explained away as mere perceptual bias:

kernel: 0xffffffff,80000000 to 0xffffffff,ffffffff [-2GB to 0)
user: 0x00000000,00000000 to 0x00000000,7fffffff [0 to +2GB)

The later invention of "canonical form", exactly equivalent to sign
extension but studiously avoiding use of the heretical term "sign",
smacks of Church officials who insisted on the other planets having
wobbly orbits around Earth in stark defiance of the laws of physics
because their dogma was incompatible with the far more elegant (and
correct) view that the planets actually orbit the Sun.

S

Malcolm McLean · Aug 4, 2013

On 04-Aug-13 05:58, Tim Rentsch wrote:

The later invention of "canonical form", exactly equivalent to sign
extension but studiously avoiding use of the heretical term "sign",
smacks of Church officials who insisted on the other planets having
wobbly orbits around Earth in stark defiance of the laws of physics
because their dogma was incompatible with the far more elegant (and
correct) view that the planets actually orbit the Sun.

And of course neither was right. There's no fixed co-ordinate system
through which the planets move. So Galileo's view that the planets orbit
the Sun, and Pope Urban's view that they describe a rather more complicated
path around the Earth, are both equally correct.

Rosario1903 · Aug 4, 2013

C does not define multiplication or division on pointers, so any attempt
to do so is inherently non-portable.

Granted, those operations do have a fairly obvious meaning on systems
with a flat linear address space, but not all C implementations run on
such systems. The world is a lot bigger than x86 and RISC, and one of
C's strengths is its ability to run on a wide variety of systems--and
that is a direct result of leaving certain details of the language
undefined (or vice versa, actually).

S

so C can not be a portable language...
or portable with your meaning of portable

Rosario1903 · Aug 4, 2013

Stephen Sprunk <[email protected]> writes:

Click to expand...

It's not _my_ perceptual bias; it's how the architects thought of it,
and their view yet survives in various docs, e.g. from GCC's manual:

-mcmodel=kernel
Generate code for the kernel code model. The kernel runs in the
negative 2 GB of the address space. This model has to be used for Linux
kernel code.

It's also evidenced by the use of sign extension, which only makes sense
for signed values, rather than zero extension, which would logically
have been used for unsigned values.

Finally, the valid values of an x86-64 memory address are:

kernel: 0xffff8000,00000000 to 0xffffffff,ffffffff [-256TB to 0)
user: 0x00000000,00000000 to 0x00007fff,ffffffff [0 to +256TB)

which leads to an obvious reimagining of x86 addresses:

kernel: 0x80000000 to 0xffffffff [-2GB to 0)
user: 0x00000000 to 0x7fffffffuser: 0x00000000 to 0x7fffffff [0 to +2GB)

where is the problem?
mem and code user in: (0x00000000, 0x7fffffff]
code kernel in: [0x80000000, 0xffffffff]
data reserved in: {0}

all that is unsigned...
i could say the kernel code is pointed from address that has the last
bit 1. [the bit 1 that one can find in 0x80000000]

if P=0x7 16 bit why do you would want to do signed extension operation
on that for 32 bit?

P is in the user space, why want its sign extention:0xffffffff?
[that would be in kernel space?]

the only meaningful extension is 0 extension P=0x7

for me, see your way of dispose code, than
if a=1 [as address] is in the user code... than a-1 or would be
undefinite
the same a-2 a-3 a-4 because in the kernel space

if a=0xFFFF than would be definite a-b if and only if
b (<unsigned) a
if and only if
a-b>0

so in the user mode can not be negative address or address >=
0x80000000 that would be the same

so the user space have not use signed address or address >=
0x80000000 if one see it as unsigned

which, when sign-extended, evinces an elegance that simply cannot be
explained away as mere perceptual bias:

kernel: 0xffffffff,80000000 to 0xffffffff,ffffffff [-2GB to 0)
user: 0x00000000,00000000 to 0x00000000,7fffffff [0 to +2GB)

The later invention of "canonical form", exactly equivalent to sign
extension but studiously avoiding use of the heretical term "sign",

in how is useful sign extension
better than 0 extension?

yes yes the data in kernel has to have the sign extension perhaps
because there are too much ffff in the start

Bart van Ingen Schenau · Aug 4, 2013

Bart van Ingen Schenau said:
Bart van Ingen Schenau said:

Can you give an example how the DS9000 could make a conversion between
pointers to structs fail, given that the pointers involved are required
to have the same representation and alignment (and that the intent of
that requirement is to allow for interchangeability)?

Click to expand...

The pointers have the same alignment, but what they point to need not
have the same alignment. A simple example:

struct smaller { int x; }; // assume size == alignment == 4
struct larger { double z; }; // assume size == alignment == 8
union {
struct smaller foo[2];
struct larger bas;
} it;

(struct larger *) &it.foo[1]; // incorrect alignment

The last line exhibits undefined behavior, explicitly called out as such
under 6.3.2.3 p7. I expect most implemenations won't misbehave in such
cases, but the Standard clearly allows them to.

You are right. That runs afoul of 6.3.2.3/7. But this variation:

struct smaller* p = &it.foo[1];
(struct larger**) &p;

must work due to 6.2.5/26.

A DS9000 could simply choose to check the resulting alignment directly,
perhaps using a HCFBA instruction (that's the "halt and catch fire on
bad alignment" opcode).

Naturally, in the same way that the DS9000 compiler tries it hardest to
put p in a location that is only suitable for a struct smaller*.

Bart v Ingen Schenau

The Horror of pointers...	5	Jan 11, 2025
Different font sizes inside same div	2	Dec 3, 2023
Centering picture element for larger screen sizes	2	Sep 21, 2023
Can I use calc to change multiple parent sizes?	0	Nov 20, 2021
Pointers in python?	1	Feb 6, 2024
Different sizes of data and function pointers on a machine -- void*return type of malloc, calloc, an	23	Jun 25, 2012
Can I change the "root" value for rem sizes?	3	Jul 30, 2023
Help with pointers	1	Mar 13, 2022

Sizes of pointers

Stephen Sprunk

Keith Thompson

Keith Thompson

Stephen Sprunk

Stephen Sprunk

Stephen Sprunk

Tim Rentsch

Stephen Sprunk

James Kuyper

Tim Rentsch

Stephen Sprunk

Stephen Sprunk

Stephen Sprunk

Philip Lantz

Tim Rentsch

Stephen Sprunk

Malcolm McLean

Rosario1903

Rosario1903

Bart van Ingen Schenau

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads