How best do I implement routing boxes in RTL?

Discussion in 'VHDL' started by news reader, Mar 8, 2007.

  1. news reader

    news reader Guest

    In the design I have 256 3-bit registers, every time I need to read or
    write 16 of them (data_o0, 1, ...15).
    The read/write address is not totally random.

    For example, assuming that I arrange the register into a 16X16 matrix,
    data_o0 accesses
    among the zeros row or column. data_o1 may access from 20 of the
    registers, but not 256, data_o2 may
    access from 30 of the variables, etc.

    If I code such that every output reads from the 256 registers, the final
    logic will be overkill and highly redundant.



    If I use case statements to list each of the senarios, the RTL code may end
    up 500 kilobyte.
    Will design compiler synthesize a 500KB design efficiently? Will NCVerilog
    compile and simulate it efficiently?

    Are there any neater techniques to attack this problem?
    news reader, Mar 8, 2007
    #1
    1. Advertising

  2. Hi "news reader", my humble perls in between..

    news reader schrieb:

    > In the design I have 256 3-bit registers, every time I need to read or
    > write 16 of them (data_o0, 1, ...15).
    > The read/write address is not totally random.


    It seems that you have an algorithm that handles a deterministic
    distribution of the values to be accessed. Therefore you think you can
    implement it with logic only.

    I assume you are modeling an algorithm for a special matrix operation.

    > For example, assuming that I arrange the register into a 16X16 matrix,
    > data_o0 accesses among the zeros row or column. data_o1 may access from 20 of the
    > registers, but not 256, data_o2 may access from 30 of the variables, etc.


    The values do not give us much info. data_ox (x = 1, 2, ...) is
    accessing which elements and in which distribution?

    > If I code such that every output reads from the 256 registers, the final
    > logic will be overkill and highly redundant.


    You think that the distribution of elements can be accessed with pure
    logic.
    Therefore you tried to model your logic to cover every case, or you
    want to do it so.

    > If I use case statements to list each of the senarios, the RTL code may end
    > up 500 kilobyte.


    This is reasonable then.

    > Will design compiler synthesize a 500KB design efficiently?


    What means "efficience" for you? Speed or minimum logic?
    If minimum logic, then please share with us the algorithm you are
    trying to implement.

    > Will NCVerilog compile and simulate it efficiently?


    NCVerilog does not care about logic implementation. It defines the
    behaviour of the system, no matter how the objects are linked.

    > Are there any neater techniques to attack this problem?


    Since you have not given much data, I think you can implement this
    stuff with a RAM.
    Why don't you use a RAM? Then you can define the RAM addresses to
    model your matrix. You will generate addresses to define the positions
    for your matrix which mimics your algorithm.

    Utku.
    =?iso-8859-1?B?VXRrdSDWemNhbg==?=, Mar 8, 2007
    #2
    1. Advertising

  3. news reader

    news reader Guest

    "Utku Özcan" <> wrote in message
    news:...
    >
    > Hi "news reader", my humble perls in between..
    >
    > news reader schrieb:
    >
    >> In the design I have 256 3-bit registers, every time I need to read or
    >> write 16 of them (data_o0, 1, ...15).
    >> The read/write address is not totally random.

    >
    > It seems that you have an algorithm that handles a deterministic
    > distribution of the values to be accessed. Therefore you think you can
    > implement it with logic only.
    >
    > I assume you are modeling an algorithm for a special matrix operation.
    >


    It's not matrix, but the memory access is intensive, must accomplish r/w in
    single clock cycle, so register is used instead of memory.


    >> For example, assuming that I arrange the register into a 16X16 matrix,
    >> data_o0 accesses among the zeros row or column. data_o1 may access from
    >> 20 of the
    >> registers, but not 256, data_o2 may access from 30 of the variables,
    >> etc.

    >
    > The values do not give us much info. data_ox (x = 1, 2, ...) is
    > accessing which elements and in which distribution?
    >


    In each clock cycle, 16 addresses are generated, and 16 data are
    read/written. However,
    each of the 16 data is read/written only to n/256 addresses (0<n<255).


    >> If I code such that every output reads from the 256 registers, the final
    >> logic will be overkill and highly redundant.

    >
    > You think that the distribution of elements can be accessed with pure
    > logic.
    > Therefore you tried to model your logic to cover every case, or you
    > want to do it so.
    >
    >> If I use case statements to list each of the senarios, the RTL code may
    >> end
    >> up 500 kilobyte.

    >
    > This is reasonable then.
    >



    By means of case statement, I use 32 case statements, in each case statement
    there
    are less than 256 choices. Some have only 20, 30 choices, etc.


    >> Will design compiler synthesize a 500KB design efficiently?

    >
    > What means "efficience" for you? Speed or minimum logic?
    > If minimum logic, then please share with us the algorithm you are
    > trying to implement.
    >
    >> Will NCVerilog compile and simulate it efficiently?

    >
    > NCVerilog does not care about logic implementation. It defines the
    > behaviour of the system, no matter how the objects are linked.
    >



    For example in read operation,
    --------------------- implementation A------------------
    input [7:0] addr_i0, addr_r1, ...addr_r15;
    output [2:0] dat_o0, dat_o1, ...dat_o15;

    reg [2:0] mymemory[0:255]; // Main memory

    dat_o0 <= mymemory[addr_i0];
    dat_o1 <= mymemory[addr_i1];
    .....
    dat_o15 <= mymemory[addr_i15];
    --------------------- End A------------------

    --------------------- implementation B------------------

    case (addr_i0) // I can calculate these options through simulations.
    8'd0 : dat_o0 <= mymemory[0 ];
    8'd5 : dat_o0 <= mymemory[5 ];
    8'd54 : dat_o0 <= mymemory[54 ];
    8'd122: dat_o0 <= mymemory[122];
    8'd125: dat_o0 <= mymemory[125];
    ....
    8'd166: dat_o0 <= mymemory[166];
    8'd233: dat_o0 <= mymemory[233];
    default: dat_o0 <= mymemory[0 ];
    endcase



    case (addr_i1)
    8'd0 : dat_o1 <= mymemory[0 ];
    8'd7 : dat_o1 <= mymemory[7 ];
    8'd9 : dat_o1 <= mymemory[9 ];
    8'd13 : dat_o1 <= mymemory[13 ];
    8'd25 : dat_o1 <= mymemory[25 ];
    8'd57 : dat_o1 <= mymemory[57 ];
    8'd124: dat_o1 <= mymemory[124];
    ....
    8'd133: dat_o1 <= mymemory[133];
    8'd155: dat_o1 <= mymemory[155];
    8'd277: dat_o1 <= mymemory[277];
    default: dat_o1 <= mymemory[0 ];
    endcase

    ....
    case (addr_i15)
    ....
    --------------------- End B------------------

    In terms of hardware implementation, is it certain that implementation B
    saves hardware
    compared to A? Will the large chunks of RTL codes causes a DC or NCVerilog
    to
    choke up?



    >> Are there any neater techniques to attack this problem?

    >
    > Since you have not given much data, I think you can implement this
    > stuff with a RAM.
    > Why don't you use a RAM? Then you can define the RAM addresses to
    > model your matrix. You will generate addresses to define the positions
    > for your matrix which mimics your algorithm.
    >


    I used registers instead of RAM due to the memory throughput.



    > Utku.
    >
    news reader, Mar 9, 2007
    #3
  4. news reader

    jtw Guest

    I have had similar requirements (updating state variables, or some such)
    where I used dual-port RAM; I use one port for the read, and the other
    (delayed a clock) for the modify-write.

    The pipeline needs to be managed properly, but it can save tremendously on
    registers (assuming that only one index needs to be updated at a time. If
    all entries need concurrent access--well, a memory won't cut it. For my
    application(s), typically TDM processing of multiple channels, it works
    well.)

    JTW

    "news reader" <> wrote in message
    news:esrs16$anh$...
    >
    > "Utku Özcan" <> wrote in message
    > news:...
    >>
    >> Hi "news reader", my humble perls in between..
    >>
    >> news reader schrieb:
    >>
    >>> In the design I have 256 3-bit registers, every time I need to read or
    >>> write 16 of them (data_o0, 1, ...15).
    >>> The read/write address is not totally random.

    >>
    >> It seems that you have an algorithm that handles a deterministic
    >> distribution of the values to be accessed. Therefore you think you can
    >> implement it with logic only.
    >>
    >> I assume you are modeling an algorithm for a special matrix operation.
    >>

    >
    > It's not matrix, but the memory access is intensive, must accomplish r/w
    > in
    > single clock cycle, so register is used instead of memory.
    >
    >
    >>> For example, assuming that I arrange the register into a 16X16 matrix,
    >>> data_o0 accesses among the zeros row or column. data_o1 may access from
    >>> 20 of the
    >>> registers, but not 256, data_o2 may access from 30 of the variables,
    >>> etc.

    >>
    >> The values do not give us much info. data_ox (x = 1, 2, ...) is
    >> accessing which elements and in which distribution?
    >>

    >
    > In each clock cycle, 16 addresses are generated, and 16 data are
    > read/written. However,
    > each of the 16 data is read/written only to n/256 addresses (0<n<255).
    >
    >
    >>> If I code such that every output reads from the 256 registers, the final
    >>> logic will be overkill and highly redundant.

    >>
    >> You think that the distribution of elements can be accessed with pure
    >> logic.
    >> Therefore you tried to model your logic to cover every case, or you
    >> want to do it so.
    >>
    >>> If I use case statements to list each of the senarios, the RTL code may
    >>> end
    >>> up 500 kilobyte.

    >>
    >> This is reasonable then.
    >>

    >
    >
    > By means of case statement, I use 32 case statements, in each case
    > statement there
    > are less than 256 choices. Some have only 20, 30 choices, etc.
    >
    >
    >>> Will design compiler synthesize a 500KB design efficiently?

    >>
    >> What means "efficience" for you? Speed or minimum logic?
    >> If minimum logic, then please share with us the algorithm you are
    >> trying to implement.
    >>
    >>> Will NCVerilog compile and simulate it efficiently?

    >>
    >> NCVerilog does not care about logic implementation. It defines the
    >> behaviour of the system, no matter how the objects are linked.
    >>

    >
    >
    > For example in read operation,
    > --------------------- implementation A------------------
    > input [7:0] addr_i0, addr_r1, ...addr_r15;
    > output [2:0] dat_o0, dat_o1, ...dat_o15;
    >
    > reg [2:0] mymemory[0:255]; // Main memory
    >
    > dat_o0 <= mymemory[addr_i0];
    > dat_o1 <= mymemory[addr_i1];
    > ....
    > dat_o15 <= mymemory[addr_i15];
    > --------------------- End A------------------
    >
    > --------------------- implementation B------------------
    >
    > case (addr_i0) // I can calculate these options through simulations.
    > 8'd0 : dat_o0 <= mymemory[0 ];
    > 8'd5 : dat_o0 <= mymemory[5 ];
    > 8'd54 : dat_o0 <= mymemory[54 ];
    > 8'd122: dat_o0 <= mymemory[122];
    > 8'd125: dat_o0 <= mymemory[125];
    > ...
    > 8'd166: dat_o0 <= mymemory[166];
    > 8'd233: dat_o0 <= mymemory[233];
    > default: dat_o0 <= mymemory[0 ];
    > endcase
    >
    >
    >
    > case (addr_i1)
    > 8'd0 : dat_o1 <= mymemory[0 ];
    > 8'd7 : dat_o1 <= mymemory[7 ];
    > 8'd9 : dat_o1 <= mymemory[9 ];
    > 8'd13 : dat_o1 <= mymemory[13 ];
    > 8'd25 : dat_o1 <= mymemory[25 ];
    > 8'd57 : dat_o1 <= mymemory[57 ];
    > 8'd124: dat_o1 <= mymemory[124];
    > ...
    > 8'd133: dat_o1 <= mymemory[133];
    > 8'd155: dat_o1 <= mymemory[155];
    > 8'd277: dat_o1 <= mymemory[277];
    > default: dat_o1 <= mymemory[0 ];
    > endcase
    >
    > ...
    > case (addr_i15)
    > ...
    > --------------------- End B------------------
    >
    > In terms of hardware implementation, is it certain that implementation B
    > saves hardware
    > compared to A? Will the large chunks of RTL codes causes a DC or NCVerilog
    > to
    > choke up?
    >
    >
    >
    >>> Are there any neater techniques to attack this problem?

    >>
    >> Since you have not given much data, I think you can implement this
    >> stuff with a RAM.
    >> Why don't you use a RAM? Then you can define the RAM addresses to
    >> model your matrix. You will generate addresses to define the positions
    >> for your matrix which mimics your algorithm.
    >>

    >
    > I used registers instead of RAM due to the memory throughput.
    >
    >
    >
    >> Utku.
    >>

    >
    >
    jtw, Mar 11, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. santhosh
    Replies:
    1
    Views:
    1,095
    Mike Treseler
    Aug 21, 2003
  2. Zyd

    VHDL RTL description

    Zyd, Apr 14, 2004, in forum: VHDL
    Replies:
    1
    Views:
    1,480
    H. Li
    Apr 14, 2004
  3. Anand P Paralkar

    ASIC RTL and FPGA RTL

    Anand P Paralkar, Apr 26, 2004, in forum: VHDL
    Replies:
    1
    Views:
    4,862
    Alexander Gnusin
    Apr 26, 2004
  4. teamgda

    rtl

    teamgda, Jul 15, 2004, in forum: VHDL
    Replies:
    5
    Views:
    844
    Ken Smith
    Aug 2, 2004
  5. Stefan Mueller
    Replies:
    5
    Views:
    12,387
    jamesxa
    Jun 16, 2009
Loading...

Share This Page