Sizes of pointers

  • Thread starter James Harris \(es\)
  • Start date
E

Eric Sosman

[...]
ISTM that 2's complement integers and 8-bit bytes have pretty much won the
argument. C does a great job of adapting to both worlds.

Let's just say that two's complement and 8-bit bytes are
the "party in power" these days. Will this alliance always
rule? Ascendancy does not imply permanence.

The computing industry is young, and like many human young
it is fashion-driven. Do not assume that today's most popular
fad will not become a forgotten footnote tomorrow.

(Hint: How many qubits make a qubyte?)
 
S

Stephen Sprunk

Copying data between kernel and user space is indeed a cause of
slowness. In at least the traditional OS models that still happens
more than it should. I don't know how well modern OSes have got at
avoiding that copying but the nature of many system calls means there
will still be some.

There has been effort put into reducing the number of copies needed, but
in many cases it is simply impossible to not copy at all.
Changing address spaces doesn't prevent their being some parts of
memory mapped in common so an OS could use common mappings in the
transfer and thus avoid double copying via a bounce buffer.

Since the data could be (nearly) anywhere in user space, unless all of
user space is available to kernel code, you're probably going to need
bounce buffers. Even if you don't happen to need them in all cases,
it's probably faster to use them anyway than spend the extra cycles
figuring out if you need them or not.
Changing page tables usually requires flushing the old TLB which is a
cache but only a cache on page table entries.

I thought if you flushed the page table entries, you had to flush all
the affected data cache lines as well. Some CPUs might simply flush
everything rather than spend the silicon needed to figure out which
lines _don't_ need to be flushed.
Yes, it can hurt performance and was an issue also on x86_32.
However, to reduce the impact some pages can be marked as Global.
Their entries are not flushed from the TLB with the rest. These are
usually pages containing kernel data and code which are to appear in
all address spaces.

.... but the proposal above was to have no memory that appeared in both,
presumably aside from the necessary bounce buffers.
As an aside, and I'm not sure which CPUs implement this, instead of
marking pages as global some processors allow TLB entries to be
tagged with the address space id. Then no flushing is required when
address spaces are changed. The CPU just ignores any entries which
don't have the current address space id (and which are not global)
and refills itself as needed. It's a better scheme because it doesn't
force flushing of entries unnecessarily.

I'm not aware of any x86(-64) chips that do that, but perhaps some have
recently added that.
The x86_32 application address space is strictly 4GB in size.
Addresses may go through segment mapping where you can have 4GB for
each of six segments but they all end up as 32 bits (which then go
through paging). So 4GB+4GB would have to multiplex on to the same
4GB address space (which would then multiplex on to memory). I don't
know what RedHat did but they could have reserved a tiny bit of the
applications' 4GB to help the transition and switched memory mappings
whenever going from user to kernel space.

That's exactly what they did; the (very small) data area mapped into
both user and kernel address spaces is the bounce buffer referred to
above. Kernel code could not "see" user space, just a bounce buffer.
The performance hit was enormous.
Such a scheme wouldn't have been used for all systems because it
would have been slower than normal not just because of the transition
but also because where the kernel needed access to user space it
would have to carry out a mapping of its own.

Exactly my point.
OSes often reserve some addresses so that they run in the same
address space as the apps. That makes communication much easier and a
little bit faster but IMHO they often reserve far more than they need
to, being built in the day when no apps would require 4GB.

The normal split is 2GB+2GB. You can configure Windows to do 3GB+1GB,
but many drivers crash in such a configuration and the OS can easily run
out of kernel space to hold page tables. That's why it's not the default.
As an aside, the old x86_16 could genuinely split all an app's
addressable memory into 64k data and 64k code because it had a 20-bit
address space.

You mean real mode.

The same was also possible in 286 protected mode (16-bit segments in a
24-bit address space), but not 386 protected mode (32-bit segments in a
32-bit address space).

S
 
E

Eric Sosman

Note that Java requires specific sizes for its integer types,
and specific IEEE standard sizes for its floating point types.

Right. And almost from the get-go, Java had to relax its
rules for floating-point to make x86 implementations practical.
(See the "strictfp" keyword.)

As for self-obsolescence -- Well, data sets get bigger all
the time, and "Big Data" is today's trendy term. Yet Java
*cannot* have an array with more than 2^31 elements, not even
on a 64-bit system with forty-two terabytes of RAM.

I'm not saying that Java's choices were uniformly better
or worse than C's; Java certainly gains in definiteness, even
if it loses in portability. The point is that there's a trade-
off: more exactness with more rigidity, more flexibility with
more vagueness -- it's almost Heisenbergian.

The extreme exactness Rosario favors would, I believe, have
made C so rigid that there would have been little interest in
the language. C's success is due in large measure to the ease
with which the language could be implemented on a variety of
machines; had that *not* been easy, who'd have wanted C? It
would have been a high-level assembler for the PDP-11, nothing
more.
 
J

James Kuyper

It depends on the limits defined for C's original floating point.

He's talking about the hypothetical case where C's floating point
semantics had been specified so precisely that only PDP-11 floating
point qualified. This is not just about limits, but also about
specification of the handling of special cases. Take a look at Annex F
for an example of the kinds of special cases whose handling could be
specified. I know nothing about how PDP11 floating point differs from
IEEE, but I presume Eric would not have mentioned it unless there were
some such differences. Imagine the case where Annex F matches PDP11, int
a way that makes it incompatible with IEEE, and imagine that conformance
to Annex F was mandatory, rather than depending upon whether the
implementation choose to pre-#define a macro like __STDC_IEC_559__. How
popular would C be, with such a specification?

....
Personally I'd like to see a software floating point library which focussed
on getting the best possible performance rather than compatibility with the
IEEE standards. I mean a portable library which could produce repeatable
results on any hardware and which used its specifications to avoid the parts
of FP processing which are normally expensive in software. Such a library
could be adapted to any width of floating point number.


"best possible performance" is inherently platform dependent, and
therefore necessarily in conflict with your goal of producing
"repeatable results on any hardware" (unless you mean "repeatable" only
on each platform separately, rather than across platforms - but
"repeatable on each platform separately" is trivial to achieve, and
hardly worth talking about).

....
Why SPARC? Or were you kidding?

Wikipedia <http://en.wikipedia.org/wiki/Write_once,_run_anywhere>
credits Sun, the maker of SPARCs, as the creator of the slogan "Write
once, run anywhere". I can't vouch for the accuracy of his assertion,
but if it's otherwise true, SPARC is a plausible platform for it to be
true about.
 
E

Eric Sosman

Eric Sosman said:
[...]
A more recent language made a pretty serious attempt to
produce exactly the same result on all machines; "write once,
run anywhere" was its slogan. Its designers had to abandon
that goal almost immediately, or else "anywhere" would have
been limited to "any SPARC." Moral: What you seek is *much*
more difficult than you realize.

Why SPARC? Or were you kidding?

The first Java implementations on x86 architectures ran
into a problem: x86' floating-point implementation used extended
precision for intermediate results, only producing 32- or 64-bit
F-P results at the end of a series of calculations. Original
Java, though, demanded a 32- or 64-bit rounded result for every
intermediate step (which is what SPARC's implementation did).
The upshot was that to achieve the results mandated by original
Java, a Java-on-x86 implementation would have had to insert an
extra rounding step after every F-P operation, with a significant
speed penalty (perhaps a pipeline stall).

Result: Java had to relax its rules to allow for x86' extra
precision, and hence had to give up on the ideal of "Same result
everywhere." The "strictfp" keyword was added to Java as a way
to say "Give me adherence to The Rules at whatever speed cost,"
but the default is "fast and loose" and "same result" takes a
back seat.
 
S

Stephen Sprunk

Wikipedia <http://en.wikipedia.org/wiki/Write_once,_run_anywhere>
credits Sun, the maker of SPARCs, as the creator of the slogan
"Write once, run anywhere". I can't vouch for the accuracy of his
assertion, but if it's otherwise true, SPARC is a plausible platform
for it to be true about.

I'm pretty sure he was referring to Java, which originally mandated
SPARC behavior but soon relaxed that to allow x86 behavior as well.
Java may indeed be "write once, run anywhere", but you may not always
get the same results. Oops. Rosario1903 must be so disappointed.

S
 
J

James Harris

....
There has been effort put into reducing the number of copies needed, but
in many cases it is simply impossible to not copy at all.

Copying is unfortunately required by some system calls. Every time a system
call supplies a buffer for the OS to write in to the OS has to copy data. In
that case the design of the system call mandates at least one copy
operation.

....
I thought if you flushed the page table entries, you had to flush all
the affected data cache lines as well. Some CPUs might simply flush
everything rather than spend the silicon needed to figure out which
lines _don't_ need to be flushed.

I can answer best in connection with x86. On that, a reload of the register
which is the root of the page table structures (CR3) only the entries in the
TLB are documented to be flushed. It could take ages to flush the data
cahces to memory so leaving them alone is a good thing. I cannot think of a
reason to flush normal data at the same time as TLB.

OS designers can choose whether x86_32 page table entries are cached in both
TLB and data cache or just in TLB by setting Page Cache Disable (PCD) bits
in, IIRC, the page directory. That made sense when the normal data cache was
very small as it helped avoid populating the data cache with entries which
were also cached in the TLB.

James
 
J

James Harris

Stephen Sprunk said:
I'll grant that 36-bit machines have fallen out of favor, and I'm not
sure how common ones complement and signed magnitude are these days, but
16-bit (and 8-bit) machines are still common, as are 24-bit DSPs.

The industry is less diverse than it used to be, but it's still far more
diverse than Rosario1903 believes it to be. All the world is _not_ an
x86.

Sure. I'm not arguing anyone else's point. Each person can have his own
discussion.

In any case, much of the world is Arm now!
Heck, even simple details like what happens when you shift a signed
value are not standardized. GCC on x86(-64) happens to insert extra
code to "correct" the native value to what they assume a programmer
expects, but there is no requirement to do so, and other compilers or
even GCC on other CPUs may not do so.

This is one of the things that it would be useful either to have
standardised or to allow the programmer to specify what is required. Another
is division and remainder where dividend and divisor have opposite signs.

James
 
J

James Harris

....
"best possible performance" is inherently platform dependent, and
therefore necessarily in conflict with your goal of producing
"repeatable results on any hardware" (unless you mean "repeatable" only
on each platform separately, rather than across platforms - but
"repeatable on each platform separately" is trivial to achieve, and
hardly worth talking about).

Oh, no, I was thinking about representing floating point numbers using
integers - most likely the traditional exponent and mantissa - and
manipulating those using the machine's normal registers just like had to be
done before FP hardware emerged from the test tube. A given floating point
representation would look and behave in exactly the same way on all machines
so the results would be repeatable. Choosing representation and operations
for speed is what I had in mind.

James
 
J

James Harris

Eric Sosman said:
Eric Sosman said:
[...]
A more recent language made a pretty serious attempt to
produce exactly the same result on all machines; "write once,
run anywhere" was its slogan. Its designers had to abandon
that goal almost immediately, or else "anywhere" would have
been limited to "any SPARC." Moral: What you seek is *much*
more difficult than you realize.

Why SPARC? Or were you kidding?

The first Java implementations on x86 architectures ran
into a problem: x86' floating-point implementation used extended
precision for intermediate results, only producing 32- or 64-bit
F-P results at the end of a series of calculations. Original
Java, though, demanded a 32- or 64-bit rounded result for every
intermediate step (which is what SPARC's implementation did).
The upshot was that to achieve the results mandated by original
Java, a Java-on-x86 implementation would have had to insert an
extra rounding step after every F-P operation, with a significant
speed penalty (perhaps a pipeline stall).

Don't remind me. As I remember it, Intel already had their 80-bit
representation and applied pressure to get it included in the 754 spec.
Sure, it helps precision but I wish they had either stuck to powers of two
or mandated similarly slightly-larger representations for all other sizes.
As it stands it is an anomaly ratified by the standards body to give one
manufacturer an edge over the others as your example illustrates. Its
inclusion in the standard may have been purely for partisan commercial
interests.

The Java intention of write once, run anywhere is a good goal as long as the
programmer gets involved and knows the limits of such a promise. For
example, even if the JVM executed instructions in an identical way a program
which needed a gigabyte of memory wouldn't "run anywhere" on a machine with
128M of RAM. And a program which needed a network connection wouldn't work
the same way on a stand along machine. And a program which needed a high
resolution timer wouldn't work the same way on a machine with a coarser
timer.

And a program which needed a certain response from floating point arithmetic
wouldn't run the same way on a machine with differently specified floating
point hardware ....... unless the programmer was involved in the decision as
to what parameters the float operations needed and that programmer chose
well.
Result: Java had to relax its rules to allow for x86' extra
precision, and hence had to give up on the ideal of "Same result
everywhere." The "strictfp" keyword was added to Java as a way
to say "Give me adherence to The Rules at whatever speed cost,"
but the default is "fast and loose" and "same result" takes a
back seat.

I've checked the stripfp issue and can see why the keyword was added but it
looks like a kludge added in order to accommodate the eccentric FP width of
Intel FPUs. Once Intel had had their hardware described by the IEEE they
were in a position to exercise undue influence on the market.

From your comment it sounds like there is no fast way for the x87 and its
descendants to round the top of stack to 32 or 64 bits.

James
 
J

James Kuyper

Oh, no, I was thinking about representing floating point numbers using
integers - most likely the traditional exponent and mantissa - and
manipulating those using the machine's normal registers just like had to be
done before FP hardware emerged from the test tube. A given floating point
representation would look and behave in exactly the same way on all machines
so the results would be repeatable. Choosing representation and operations
for speed is what I had in mind.

On most machines I'm familiar with, there is a very big difference
between "best possible performance" and "best possible performance
without using the FPU". The only exceptions I'm aware of are machines
without an FPU, which I'm sure are commonplace in some contexts, but
they would not be very useful for the kind of work I do.
 
K

Kenny McCormack

C is not limited to systems that Rosario1903 does not find riduculous.

What a fine contribution to the general welfare...

Way to go, Kiki. Keep 'em comin'!
 
J

James Harris

James Kuyper said:
On most machines I'm familiar with, there is a very big difference
between "best possible performance" and "best possible performance
without using the FPU". The only exceptions I'm aware of are machines
without an FPU, which I'm sure are commonplace in some contexts, but
they would not be very useful for the kind of work I do.

I remember reading a floating point library (software floating point
implemented in the integer CPU) and seeing lots of very nasty code like many
tests and jumps. There was also necessarily ugly code to extract fields and
put them back together. Both of those things help achieve compatibility but
are slow.

Based on that, a different representation and set of specifications could
make a big difference to software floating point performance. I wondered how
much of a difference could be made to performance and what compromises would
be needed to achieve it.

I recently took a course in high-performance computing. One point made there
was that because FPUs are now so fast, often the limiting factor is getting
datasets to and from memory. So there may be cases where software floating
point could be very fast. It would certainly be flexible, allowing certain
precisions and ranges as needed by the programmer.

James
 
R

Rosario1903

so why cpus not support *at last*, operation[+-/*<><=>=&|^not] on 8
bit unsigned, and one another type among 16, 32, 64, 128 unsigned
with its operation on unsigned ??

What if the CPU doesn't _have_ 8, 16, 32, 64, and 128-bit integers?

they not are on the wave... and i think they calculate more complex;
more $ for doing programs for them
What if the CPU doesn't represent pointers as plain integers in the
first place?

they not are on the wave... and i think they calculate more complex;
more $ for doing programs for them
No, they don't "have to", and they didn't. And there are good reasons
that they didn't.

what is the reason?

how is possible they don't know for a,b in AcN, who is a*b in A, for
know a and b

and in case of overflow not follow the same arbitrary
[a*b=(a*b)%maxNumber] result...

the same for all other mathematical operation as or, and, shr etc etc

what is the reason?
the reason they don't know mathematic?
 
M

Malcolm McLean

On 08/06/2013 10:52 AM, James Harris wrote:

On most machines I'm familiar with, there is a very big difference
between "best possible performance" and "best possible performance
without using the FPU". The only exceptions I'm aware of are machines
without an FPU, which I'm sure are commonplace in some contexts, but
they would not be very useful for the kind of work I do.
It just depends on the program.

Sometimes getting a result in one second rather than one millisecond is
utterly fatal, other times it just wastes 1 second of an employee's time.
 
S

Stephen Sprunk

"Process-Context Identifiers". PCIDs are only meaningful in x86-64
mode, and not implemented on all processors, but these are basically
traditional ASIDs.

Ah; I was aware of the concept, but I hadn't heard anyone had actually
implemented them yet. Thanks!

S
 
S

Stephen Sprunk

Copying is unfortunately required by some system calls. Every time a
system call supplies a buffer for the OS to write in to the OS has to
copy data. In that case the design of the system call mandates at
least one copy operation.

If you need to move more data than will fit in the registers, e.g.
read() or write(), copying to/from a buffer seems mandatory, aside from
completely changing the I/O model a la mmap() or sendfile().
I can answer best in connection with x86. On that, a reload of the
register which is the root of the page table structures (CR3) only
the entries in the TLB are documented to be flushed. It could take
ages to flush the data cahces to memory so leaving them alone is a
good thing. I cannot think of a reason to flush normal data at the
same time as TLB.

If a cache is virtually tagged, rather than physically tagged, you would
need to flush it. As Robert Wessel explained, it may be that lower
level caches are virtual but higher ones are physical, so you may not
need to flush all levels to off-chip RAM. Still, even flushing L1 (and
maybe L2) would measurably degrade performance if you had to do it twice
for every syscall.

S
 
S

Stephen Sprunk

Sure. I'm not arguing anyone else's point. Each person can have his
own discussion.

In any case, much of the world is Arm now!

Very true, much to Intel's chagrin, but ARM pretty much looks the same
as x86 at this level. Power-of-2 data sizes, twos complement, linear
address space, etc.
This is one of the things that it would be useful either to have
standardised or to allow the programmer to specify what is required.
Another is division and remainder where dividend and divisor have
opposite signs.

We could probably come up with dozens of such oddities that _could_ be
programmer-adjustable but typically aren't. I can't think of anywhere
the standard requires that, though; at most, it gives implementations a
list of options to choose from. Is that enough of an improvement over
the status quo, where things are simply left undefined?

S
 
J

James Kuyper

I'm not sure how helpful that really is. Linearity within each
object is important, but as long as I can get a unique address for
each object (including each byte within each object), why should I
care how addresses of distinct objects relate to each other (apart
from "==" and "!=" working properly)?

You could sort an array of pointers, and use binary search to find a
pointer in that array.

For the question "does the pointer p point into the array q which has
a size of n bytes", it would be sufficient if the difference between
unrelated pointers did yield an unspecified (but not undefined)
result. Take the difference n, check that it is between 0 and the
size of the array, then compare p and &array [n].

If the difference between unrelated pointers is unspecified, how could
you be sure that it isn't a value between 0 and the size of the array?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top