how to design this datapath unit for DSP using VHDL/Verilog?

W

walala

Dear all,

I want to design an arithmatic datapath unit for digital signal processing
using VHDL and/or Verilog.

The input are 5 elements(either sequential or parallel) each having 8 bits.
It needs to multiply each of these 5 inputs with a predefined constant
matrix(10x10, floating point scaled and round to integer). The output will
be a 10x10 matrix summing the above five matrices up, each element having 12
bits). So for each element of the matrix, I can have a MAC unit. The
internal computation will be 16 bits.

Hence for each 5 inputs x1, x2, x3, x4, x5, the output matrix

Y=x1*C1+x2*C2+x3*C3+x4*C4+x5*C5 where Y, C1, C2, C3, C4, C5 are matrices;

If I put an MAC for each element, I will have a purely parallel
architecture, but I need 100 16bits MAC units, which will be too resource
consuming.

I am considering to make a parallel-serial architecture, at each time, it
outputs one row, which will be 10x12 bits... so the output will be
row-by-row.

I also need to consider to streamlize the datapath operation. Since there
will be a stream of 5 elements input in a non-stop fashion, the output will
also be non-stop streaming. So after one row is outputted, that row can be
used for computation/storage of the results for the next 5 input elements.

I am ok so far in thinking... but further thinking makes me confused and
perplexed... how to do sequential timing control(how to what to do at which
cycle)? do I need to pipelining? how to design the architecture? I mean, I
know pipelining theoratically from one semester course, but now I am going
to implement one, I am totally lost...

Finally, how to program this? Is there any examples for this?

Please help me!

Thanks a lot,

-Walala
 
D

David Jones

Dear all,

I want to design an arithmatic datapath unit for digital signal processing
using VHDL and/or Verilog.

The input are 5 elements(either sequential or parallel) each having 8 bits.
It needs to multiply each of these 5 inputs with a predefined constant
matrix(10x10, floating point scaled and round to integer). The output will
be a 10x10 matrix summing the above five matrices up, each element having 12
bits). So for each element of the matrix, I can have a MAC unit. The
internal computation will be 16 bits.

Hence for each 5 inputs x1, x2, x3, x4, x5, the output matrix

Y=x1*C1+x2*C2+x3*C3+x4*C4+x5*C5 where Y, C1, C2, C3, C4, C5 are matrices;

What is your throughput requirement and what technology are you using?

That will determine the amount of parallelism that you need.

If the requirement is low enough, then only one MAC unit will be required.

Next, you must define the timing of the inputs. If they are serial, then
it's easy: stuff the data into the MAC unit. Being pipelined (right?),
the MAC unit will output the answer N clocks later.

If you have more parallelism in your input data than you want in your
MAC units, then you will need to buffer the data. This circuit will be
easy to design once you define the timing requirements.
 
W

walala

Hi David,

Thanks for your answer!

The requirement of output throughput is 33-50MHz, i.e., it should output 33
million to 50 million 12-bits element per second,

and each 5 inputs correspond to 10x10=100 such 12-bits element outputs...

The technology I am going to use is 0.25u.

I think the inputs are naturally serial, but again, I am not sure how to do
the parallel-serial partition of the internal MACs... and how to pace the
outputs...

Seems inputs are faster than the outputs, maybe I should let the input wait
after fed into the unit?

Can you give some further advice on how to do this architecture? how to do
the timing? I think it is really difficult...and point me to some resources?

Thanks very much,

-Walala

 
W

walala

Can we assume the input are all present at once(parallel)? Since there are
only 5 inputs(5x8=40bits), is it a reasonable assumption?

walala said:
Hi David,

Thanks for your answer!

The requirement of output throughput is 33-50MHz, i.e., it should output 33
million to 50 million 12-bits element per second,

and each 5 inputs correspond to 10x10=100 such 12-bits element outputs...

The technology I am going to use is 0.25u.

I think the inputs are naturally serial, but again, I am not sure how to do
the parallel-serial partition of the internal MACs... and how to pace the
outputs...

Seems inputs are faster than the outputs, maybe I should let the input wait
after fed into the unit?

Can you give some further advice on how to do this architecture? how to do
the timing? I think it is really difficult...and point me to some resources?

Thanks very much,

-Walala
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top