Which is the most beautiful and memorable hardware structure in aCPU?

M

MitchAlsup

I also talked to Mitch about it at around that time, although he was preoccupied with spreadsheets for the

Any chance you could complete this sentance?

Perhaps from {88100, 88110, 88120, crazy, insane, Asilomar
participants, Hot Chips participants, all of the preceeding?}

Mitch
 
K

KJ

There's your problem right there, Andy.  Everyone else will say:

1. It's already been done (heard way too many times in this forum).

2. It was obvious (emphasis on the past tense).

People answering either (1) or (2) assume that everything that can be
thought of is already in textbooks.  That's how they got to where they
are.

Textbooks? Didn't you get the memo? Everything that can be thought
of is on Google...or should I say Topeka ;)

KJ
 
G

glen herrmannsfeldt

(snip)
All I know is that I proposed having a separate pipestage
to rename registers, using a RAM (SRAM) table indexed by
logical register number returning physical register number,
in 1986 or 1987 - in Wen-mei Hwu's microprocessor design
class - after he had taken us through Tomasulo and HPSm.
I.e. I proposed eliminating the CAMs, replacing them by a
RAM and an additional pipestage.

With the 360/91 system, though, values can easily have more than
one destination. I suppose that could be done other ways,
too, but it is especially convenient that way.
The idea seemed new to everyone who encountered it. It was
not universally accepted as good. Indeed, I remember arguing
with Tom Olson of AMD (if memory serves), who said that
spending an extra pipestage was not a good idea.
Many people say that the CDB was an important invention.
I think it was a bad idea - long wires, CAMs.

If the wires are too long, then add more pipeline stages along
the way. With 750ns 16way interleaved core, though, the 91
wasn't going to get much faster than 60ns.
Conceptually it is elegant, but implementation wise it is a bad idea.
The important thing is taking that conceptually elegant
CAM-ful idea, and implementing it in an efficient non-CAM manner.
The modern style of register renaming accomplishes this -
certainly for the registers, but also, depending on the
system, for the reservation stations (if those are still
being used).

Logic was much more expensive then, than now, so the
tradoffs are likely different. If you used RAM tables
with more than one entry for each source, you could do
multiple destinations easily.
I'd love to see a reference for this.

There is an issue of the IBM Journal of Research and
Development pretty much devoted to the 91. I believe
it is in there. The 91 is pretty much a favorite for
books on pipelined processor design, mostly referencing
that journal issue.
I believe that a UWisc patent on this was one of the things
that resulted in a big payment from Intel to UWisc.
Myself, I thought it was obvious.

-- glen
 
W

Weng Tianxiang

All I know is that I proposed having a separate pipestage to rename registers, using a RAM (SRAM) table indexed by
logical register number returning physical register number, in 1986 or 1987 - in Wen-mei Hwu's microprocessor design
class - after he had taken us through Tomasulo and HPSm.

I.e. I proposed eliminating the CAMs, replacing them by a RAM and an additional pipestage.

The idea seemed new to everyone who encountered it. It was not universally accepted as good.  Indeed, I remember arguing
with Tom Olson of AMD (if memory serves), who said that spending an extra pipestage was not a good idea.

I also talked to Mitch about it at around that time, although he was preoccupied with spreadsheets for the


Many people say that the CDB was an important invention.  I think it was a bad idea - long wires, CAMs.

Conceptually it is elegant, but implementation wise it is a bad idea.

The important thing is taking that conceptually elegant CAM-ful idea, and implementing it in an efficient non-CAM manner.

The modern style of register renaming accomplishes this - certainly for the registers, but also, depending on the
system, for the reservation stations (if those are still being used).


I'd love to see a reference for this.

I believe that a UWisc patent on this was one of the things that resulted in a big payment from Intel to UWisc.

Myself, I thought it was obvious.

Hi Andy,
Your opinion is bright.

Can you tell me UWisc patent number or its title?

I have a design which is expected to work in a core of modern
multiprocessors in more than 3GHz world,
and the output drives one target.

The design can have two implementations:
1. One source always drives the one target and it uses a lot of power;
2. 16 sources can selectively use a common output bus to drive the
target with much less power.

The output must be finished within 1 clock cycle.

Which implementation is more wise in real world?

In another words, a 16 sources selectively drives a common output bus
with one target
is implementation wise in more than 3GHz world?

Thank you.

Weng




Thank you.

Weng
 
M

MitchAlsup

Got distracted, forgot to finish.  Wasn't exactly sure I remembered what you were working on.

Remember the first time I met you, Mitch, and Willie Anderson? What were you working on?  Memory bandwidth spreadsheets
for the 88110? SIMD vectors?  I remember we talked about DRAM bank structure, and you made your usual "If DRAMs were
designed the way I want them to be designed..." speech.  I remember that you were interested in Linpack, while I was
interested in OOO and GCC.

Willie was on 88110
Sounds like I was already on 88120
As to DRAM see USPTO 5367494

It was not so much that I was concentratng on Linpack, We (shebanow
and I) were trying to build a machine that could perform as if it were
a vector machine on vectorizable codes (without vector instructions::
i.e. native 88100 instructions at 6 per cycle) and also perform well
on GCC-like spaghetti codes. Linpack (Matrix 300) was simply the
vector code expample.

Mitch
 
G

glen herrmannsfeldt

(snip)
It was not so much that I was concentratng on Linpack, We (shebanow
and I) were trying to build a machine that could perform as if it were
a vector machine on vectorizable codes (without vector instructions::
i.e. native 88100 instructions at 6 per cycle) and also perform well
on GCC-like spaghetti codes. Linpack (Matrix 300) was simply the
vector code expample.

The 360/91 was also designed to perform well on non-vectorized code.
Well, on the code generated for other 360's. Among others is
loop mode where for a small enough loop it stops fetching
instructions from memory (they are in a special cache).
The goal was one instruction per cycle. (With 750ns core it
wasn't likely to do more than that.)

The 360/91 even had to handle self-modifying code, including
instructions that might have already been fetched. The IBM
Fortran library for OS/360 did use some self-modifying code.
(No recursion in Fortran 66 so it wasn't so hard to do.)

-- glen
 
T

Tim McCaffrey

The 360/91 was also designed to perform well on non-vectorized code.
Well, on the code generated for other 360's. Among others is
loop mode where for a small enough loop it stops fetching
instructions from memory (they are in a special cache).
The goal was one instruction per cycle. (With 750ns core it
wasn't likely to do more than that.)

Must have got that idea from the CDC 6600.
The 360/91 even had to handle self-modifying code, including
instructions that might have already been fetched. The IBM
Fortran library for OS/360 did use some self-modifying code.
(No recursion in Fortran 66 so it wasn't so hard to do.)

SMC was not allowed in the CDC instruction stack (i.e. non-coherent cache).

- Tim
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top