maybe PICTURE THIS
Mr. Moore's type of 25 X model, HOWEVER,
1) expanded to a sixteen by sixteen array for A and B busses, (5x5 -->
1a) both A and B are MULTIPLEXED into TWO buses, each with an ID
multiplier of sixteen for inter and intra processor reg maps,
( SIXTEEN is for a MAXIMUM bandwidth! )
1b) both A and B have peek ahead, two register element stacks
1c) !!! TIMES FOUR FOR FAULT TOLERANT ( SUPER COOLED?) VERSION OPTION
2) special C bus for local parallel memory ( Direct RamBus DRAM ?)
3) extra X and Y stacks ( along with the T/S/parameter stack )
4) thats it! this is my whole base model list!
5) iterate testing and recurse testing of my sixteen bit VLIW decode.
Mr Moore's X18 homepage ( obsolete? )
Mr Moore's 25 x ( obsolete? )
<head><title>Chuck Moore's X18 Forth Microcomputer Core</title>
<meta name=description content="A high-performance, low-power
microcomputer core. Available as a GDS II file. On-chip memory and
stacks. Forth instruction set.">
<meta name=keywords content="microprocessor, stacks, push-down stacks,
mips, power, low power, instructions, instruction set, DRAM, ROM,
watchdog, watchdog timer">
Updated 2001 June
<h1>X18 Microcomputer core</h1>
High performance, low power Forth engine. Optimized for compute-bound
portable applications. 18 bit address/data matches cache SRAM.
<li>2400 Mips, sustained
<li>Asynchronous (no external clock)
<li>2 16-deep push-down stacks
<li>27 0-operand instructions
<li>128 words ROM, 384 DRAM
<li>20 mW @ 1.8 V
<li>.2 sq mm</ul>
The X18 is an evolution of the F21 and i21 microprocessors. With .18um
transistors, it has 5x their speed and 1/5 their power. It has their
16-deep Return and Data stacks and 27 0-operand instructions, packed 3
per word. A 100ms watchdog timer assures continued operation. Boots
from on-chip ROM.
<p>Redesigned with new layout and simulation tools to be robust and to
minimize power. The computer can be throttled by a factor of 1024 to
provide 2.4 Mips using 20 uW. It may be stopped altogether, but will
have to reboot.
<p>Multiply (125 Mops) and divide (40 Mops) have been improved.
Internal memory is fast enough (1 ns) to sustain 2400 Mips. Data
access, especially to external SRAM, will slow this. Code is loaded
into on-chip DRAM for execution.
<h1> CPU </h1>
Forth code is highly factored into many small subroutines. An optimized
processor requires an efficient call/return mechanism. This is best
achieved with 2 push-down stacks. Each is implemented as a register
feeding a 16x18-bit RAM with 8-transistor bit cells. The current entry
is indicated by a 16-bit bidirectional, circular shift register.
<p>One stack is used to store subroutine return addresses. All
processors have such a stack. The other is used to pass parameters to
and from subroutines. Other processors use registers or stack frames
for this purpose. However, all languages use an implicit stack to
evaluate expressions. Forth makes it explicit.
<p> As if emphasizing their importance, the stacks require 2/3 of the
CPU silicon area. It is difficult to achieve their 1-cycle accesss
<p> The merits of stack vs. register designs have been argued for
decades. A comprehensive book, <a
Computers,</em></a> by Phil Koopman has been published online. To quote
Sec 6.2: "0-operand stack addressing ... makes stack machines superior
to conventional machines in the areas of program size, processor
complexity, system complexity, processor performance, and consistency
of program execution."
<p> The Forth ALU operates on the top 1 or 2 items of the parameter
stack, leaving the result there. This permits 0-operand instructions.
Eliminating register addresses permits shorter instructions, in this
case 5-bit. Several instructions are required to rearrange the stack.
And it's convenient to move things to the return stack.
<p> An address register is useful to reduce stack manipulation. It also
supports incrementing to address successive words in memory. Similar
use of the top of the return stack provides 2 addresses for
<p> A demultiplexor allows the packing of up to 3 instructions per
word. This increases the density of compiled code and reduces the
interference between instruction and data memory access. It keeps the
CPU busy while the next instruction is being fetched. Providing a
sustained execution speed of 2400 Mips.
<p> This is implemented by a 3-bit shift register. The current bit
enables its slot into the instruction latch. A ready pulse from the
memory manager latches the high-order 5 bits (slot 0). The pulse is
delayed by a string of 14 inverters so that it repeats 2 ns later,
latching the next slot. Slot 2 stops the process, as does a jump or
fetch/store, until the next ready pulse.
<p> There are 27 simple instructions, exactly suited to Forth. This
allows 1-1 compilation of Forth source to machine code. On other
processors, each Forth primitive requires several instructions. The
situation is reversed for other languages: several Forth instructions
may be required for their primitives.
<tr><td>T<td>Top of stack
<tr><td>S<td>2nd number on stack
<tr><td>R<td>Top of Return stack
<p>Remember that fetch pushes the stack, store and binary operations
pop it.<table border>
<tr><td>0<td>word ;<td>Jump to subroutine; tail recursion
<tr><td>1<td>if<td>Jump to 'then' if T0-T17 are zero
<tr><td>3<td>-if<td>Jump to 'then' if T17 is one
<tr><td>8<td>@r<td>Fetch from address in R
<tr><td>9<td>@+<td>Fetch from address in A; increment A
<tr><td>b<td>@<td>Fetch from address in A
<tr><td>c<td>!r<td>Store into address in R
<tr><td>d<td>!+<td>Store into address in A; increment A
<tr><td>f<td>!<td>Store into address in A
<tr><td>11<td>2*<td>Shift T left 1 bit
<tr><td>12<td>2/<td>Shift T right 1 bit; preserve T17
<tr><td>13<td>+*<td>Add S to T if T0=1 (multiply step)
<tr><td>14<td>or<td>Exclusive-or S to T
<tr><td>15<td>and<td>And S to T
<tr><td>17<td>+<td>Add S to T
<tr><td>1c<td>push<td>Store into R
<tr><td>1d<td>a!<td>Store into A
<tr><td>1f<td>drop<td>Store T nowhere
<p> Another advantage of the 5-bit instruction is ease of decoding. A
tree of NAND and NOR gates lead from the instruction bus to the enable
for each register. This is facilitated by the limit of 10 lines to be
routed: each bit and its complement.
<head><title>Chuck Moore's 25x Forth Multicomputer Chip</title>
<meta name=description content="A parallel computer with 25 computers
on a chip. An on-chip network goes off-chip to array even more
<meta name=keywords content="microcomputer, microprocessor, parallel,
network, array, memory, coprocessors">
Updated 2001 June
An array of 25 microcomputers on a 7 sq mm die.
<li>.2 sq mm asynchronous microcomputer core
<li>5 x 5 array of cores: 60,000 Mips
<li>5 horizontal, 5 vertical parallel interconnect buses: 180 Ghz
<li>Specialized computers to interface off-chip.
<li>Max power 500 mW @ 1.8 V, with 25 computers running
<li>100mAh battery life is 1 year, with 1 computer running throttled
<li>64-pin SOIC: mirrored pin-out to 4ns cache SRAM
<li>Array chips on 2-sided PCB</ul>
Availability of the tiny (.2 sq mm), asynchronous <a href=X18.html>X18
microcomputer core</a> naturally suggested arraying it on a chip. Its
extremely low power (20 mW) made that feasible. A 5x5 array was chosen
to fit on a 7 sq mm die, the smallest available prototype, though
larger arrays are possible. 25 computers running at 2400 Mips is a
total of 60,000 Mips. An unlimited supply.
<p>Communication among the computers is provided by a network with 5
horizontal and 5 vertical buses. Each computer has 2 bus registers to
access a horizontal and a vertical bus. Each bus is 18-bits wide and
can run at 1 GHz. All 10 buses can be active at once connecting a
20-computer subset. So total bandwidth is 180 GHz.
<p>Each computer can customized. Registers are added to the 16
processors at the edge of the array and connected to package pins. Each
computer is responsible for a particular interface. Protocols are
implemented with software.<ul>
<li>4 serial controllers
After booting from ROM, the computers await code downloaded from one of
Chosen to be the mirror image of an 18-bit cache memory chip. This is
the fastest memory available, with 4 ns access. Its package is a
100-pin SOIC. The 18-bit Multicomputer thus has 256K words of external
memory in 1 chip.
<p>Putting the Multicomputer chip on the top of a 2-sided PCB and the
SRAM chip on the bottom gives a very small footprint. A decoupling
capacitor is the only other component needed. An array of such pairs is
a multicomputer board. Connecting Multicomputer to SRAM is trivial,
with mm traces. Routing for power and a serial network is also easy.
Computers load code from the network.
<p>A parallel computer with 60Gips nodes! Power is determined by the
The chip is awaiting funding. If interested, contact <a
href=mailto:[email protected]>[email protected]</a>
<p>A 7 sq mm die, packaged, will cost about $1 in quantity 1,000,000.
Cost per Mip is 0.
<p>25 prototypes can be obtained from <a
href=http://www.mosis.com>MOSIS</a> for $14,000 with 16 week
turn-around. The TSMC .18um process has monthly submissions.
Maybe an important note.
ONLY the diagonal needs the X,Y and *SPECIAL* C register ( each
is unique for parallel ram access, a 4 x 4 x MemWidth multiplex for
maybe four Direct RAM Bus DRAMS)
the other ( 200+ nodes are used for programmable multiplexing)
stack_machine_id[A/B-select, [ A[0..15]] or B[0==self,1..15]]
I am IBM.