How best do I implement routing boxes in RTL?

N

news reader

In the design I have 256 3-bit registers, every time I need to read or
write 16 of them (data_o0, 1, ...15).
The read/write address is not totally random.

For example, assuming that I arrange the register into a 16X16 matrix,
data_o0 accesses
among the zeros row or column. data_o1 may access from 20 of the
registers, but not 256, data_o2 may
access from 30 of the variables, etc.

If I code such that every output reads from the 256 registers, the final
logic will be overkill and highly redundant.



If I use case statements to list each of the senarios, the RTL code may end
up 500 kilobyte.
Will design compiler synthesize a 500KB design efficiently? Will NCVerilog
compile and simulate it efficiently?

Are there any neater techniques to attack this problem?
 
?

=?iso-8859-1?B?VXRrdSDWemNhbg==?=

Hi "news reader", my humble perls in between..

news said:
In the design I have 256 3-bit registers, every time I need to read or
write 16 of them (data_o0, 1, ...15).
The read/write address is not totally random.

It seems that you have an algorithm that handles a deterministic
distribution of the values to be accessed. Therefore you think you can
implement it with logic only.

I assume you are modeling an algorithm for a special matrix operation.
For example, assuming that I arrange the register into a 16X16 matrix,
data_o0 accesses among the zeros row or column. data_o1 may access from 20 of the
registers, but not 256, data_o2 may access from 30 of the variables, etc.

The values do not give us much info. data_ox (x = 1, 2, ...) is
accessing which elements and in which distribution?
If I code such that every output reads from the 256 registers, the final
logic will be overkill and highly redundant.

You think that the distribution of elements can be accessed with pure
logic.
Therefore you tried to model your logic to cover every case, or you
want to do it so.
If I use case statements to list each of the senarios, the RTL code may end
up 500 kilobyte.

This is reasonable then.
Will design compiler synthesize a 500KB design efficiently?

What means "efficience" for you? Speed or minimum logic?
If minimum logic, then please share with us the algorithm you are
trying to implement.
Will NCVerilog compile and simulate it efficiently?

NCVerilog does not care about logic implementation. It defines the
behaviour of the system, no matter how the objects are linked.
Are there any neater techniques to attack this problem?

Since you have not given much data, I think you can implement this
stuff with a RAM.
Why don't you use a RAM? Then you can define the RAM addresses to
model your matrix. You will generate addresses to define the positions
for your matrix which mimics your algorithm.

Utku.
 
N

news reader

Utku Özcan said:
Hi "news reader", my humble perls in between..



It seems that you have an algorithm that handles a deterministic
distribution of the values to be accessed. Therefore you think you can
implement it with logic only.

I assume you are modeling an algorithm for a special matrix operation.

It's not matrix, but the memory access is intensive, must accomplish r/w in
single clock cycle, so register is used instead of memory.

The values do not give us much info. data_ox (x = 1, 2, ...) is
accessing which elements and in which distribution?

In each clock cycle, 16 addresses are generated, and 16 data are
read/written. However,
each of the 16 data is read/written only to n/256 addresses (0<n<255).

You think that the distribution of elements can be accessed with pure
logic.
Therefore you tried to model your logic to cover every case, or you
want to do it so.


This is reasonable then.


By means of case statement, I use 32 case statements, in each case statement
there
are less than 256 choices. Some have only 20, 30 choices, etc.

What means "efficience" for you? Speed or minimum logic?
If minimum logic, then please share with us the algorithm you are
trying to implement.


NCVerilog does not care about logic implementation. It defines the
behaviour of the system, no matter how the objects are linked.


For example in read operation,
--------------------- implementation A------------------
input [7:0] addr_i0, addr_r1, ...addr_r15;
output [2:0] dat_o0, dat_o1, ...dat_o15;

reg [2:0] mymemory[0:255]; // Main memory

dat_o0 <= mymemory[addr_i0];
dat_o1 <= mymemory[addr_i1];
.....
dat_o15 <= mymemory[addr_i15];
--------------------- End A------------------

--------------------- implementation B------------------

case (addr_i0) // I can calculate these options through simulations.
8'd0 : dat_o0 <= mymemory[0 ];
8'd5 : dat_o0 <= mymemory[5 ];
8'd54 : dat_o0 <= mymemory[54 ];
8'd122: dat_o0 <= mymemory[122];
8'd125: dat_o0 <= mymemory[125];
....
8'd166: dat_o0 <= mymemory[166];
8'd233: dat_o0 <= mymemory[233];
default: dat_o0 <= mymemory[0 ];
endcase



case (addr_i1)
8'd0 : dat_o1 <= mymemory[0 ];
8'd7 : dat_o1 <= mymemory[7 ];
8'd9 : dat_o1 <= mymemory[9 ];
8'd13 : dat_o1 <= mymemory[13 ];
8'd25 : dat_o1 <= mymemory[25 ];
8'd57 : dat_o1 <= mymemory[57 ];
8'd124: dat_o1 <= mymemory[124];
....
8'd133: dat_o1 <= mymemory[133];
8'd155: dat_o1 <= mymemory[155];
8'd277: dat_o1 <= mymemory[277];
default: dat_o1 <= mymemory[0 ];
endcase

....
case (addr_i15)
....
--------------------- End B------------------

In terms of hardware implementation, is it certain that implementation B
saves hardware
compared to A? Will the large chunks of RTL codes causes a DC or NCVerilog
to
choke up?


Since you have not given much data, I think you can implement this
stuff with a RAM.
Why don't you use a RAM? Then you can define the RAM addresses to
model your matrix. You will generate addresses to define the positions
for your matrix which mimics your algorithm.

I used registers instead of RAM due to the memory throughput.
 
J

jtw

I have had similar requirements (updating state variables, or some such)
where I used dual-port RAM; I use one port for the read, and the other
(delayed a clock) for the modify-write.

The pipeline needs to be managed properly, but it can save tremendously on
registers (assuming that only one index needs to be updated at a time. If
all entries need concurrent access--well, a memory won't cut it. For my
application(s), typically TDM processing of multiple channels, it works
well.)

JTW

news reader said:
Utku Özcan said:
Hi "news reader", my humble perls in between..



It seems that you have an algorithm that handles a deterministic
distribution of the values to be accessed. Therefore you think you can
implement it with logic only.

I assume you are modeling an algorithm for a special matrix operation.

It's not matrix, but the memory access is intensive, must accomplish r/w
in
single clock cycle, so register is used instead of memory.

The values do not give us much info. data_ox (x = 1, 2, ...) is
accessing which elements and in which distribution?

In each clock cycle, 16 addresses are generated, and 16 data are
read/written. However,
each of the 16 data is read/written only to n/256 addresses (0<n<255).

You think that the distribution of elements can be accessed with pure
logic.
Therefore you tried to model your logic to cover every case, or you
want to do it so.


This is reasonable then.


By means of case statement, I use 32 case statements, in each case
statement there
are less than 256 choices. Some have only 20, 30 choices, etc.

What means "efficience" for you? Speed or minimum logic?
If minimum logic, then please share with us the algorithm you are
trying to implement.


NCVerilog does not care about logic implementation. It defines the
behaviour of the system, no matter how the objects are linked.


For example in read operation,
--------------------- implementation A------------------
input [7:0] addr_i0, addr_r1, ...addr_r15;
output [2:0] dat_o0, dat_o1, ...dat_o15;

reg [2:0] mymemory[0:255]; // Main memory

dat_o0 <= mymemory[addr_i0];
dat_o1 <= mymemory[addr_i1];
....
dat_o15 <= mymemory[addr_i15];
--------------------- End A------------------

--------------------- implementation B------------------

case (addr_i0) // I can calculate these options through simulations.
8'd0 : dat_o0 <= mymemory[0 ];
8'd5 : dat_o0 <= mymemory[5 ];
8'd54 : dat_o0 <= mymemory[54 ];
8'd122: dat_o0 <= mymemory[122];
8'd125: dat_o0 <= mymemory[125];
...
8'd166: dat_o0 <= mymemory[166];
8'd233: dat_o0 <= mymemory[233];
default: dat_o0 <= mymemory[0 ];
endcase



case (addr_i1)
8'd0 : dat_o1 <= mymemory[0 ];
8'd7 : dat_o1 <= mymemory[7 ];
8'd9 : dat_o1 <= mymemory[9 ];
8'd13 : dat_o1 <= mymemory[13 ];
8'd25 : dat_o1 <= mymemory[25 ];
8'd57 : dat_o1 <= mymemory[57 ];
8'd124: dat_o1 <= mymemory[124];
...
8'd133: dat_o1 <= mymemory[133];
8'd155: dat_o1 <= mymemory[155];
8'd277: dat_o1 <= mymemory[277];
default: dat_o1 <= mymemory[0 ];
endcase

...
case (addr_i15)
...
--------------------- End B------------------

In terms of hardware implementation, is it certain that implementation B
saves hardware
compared to A? Will the large chunks of RTL codes causes a DC or NCVerilog
to
choke up?


Since you have not given much data, I think you can implement this
stuff with a RAM.
Why don't you use a RAM? Then you can define the RAM addresses to
model your matrix. You will generate addresses to define the positions
for your matrix which mimics your algorithm.

I used registers instead of RAM due to the memory throughput.


 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
44,998
Latest member
MarissaEub

Latest Threads

Top