Synthesis of Concurrent Statements for FIR Filter

H

heilig.brian

Dear List,

I am trying to implement a 16-tap FIR Low-Pass Filter and have written
the convolution in VHDL (of which I am a beginner). The input sequence
'x' is a 1-bit sequence of 1's and 0's. This is to be converted to 1's
and -1's and convolved with the impulse sequence 'h'. My goal is for
the convolution portion of the filter to be completely asynchronous
and parallel. That is with each clock cycle 16 bits of the input
sequence are convolved with the impulse response providing a single 12-
bit output. Each element of the impulse response 'h' is a 10 bit
signed integer. The input sequence 'x' is a known sequence and I am
sure the output sequence 'y' will always fit into 12 bits.

Here is the code:
-- 16-tap FIR Low-Pass Filter Convolution Function
--
--
-- When convolved with the code it will produce a maximum value that
will fit into 12-bits

library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.numeric_std.all;

entity fir_lpf_conv is
port (
x: in std_logic_vector(15 downto 0);
y: out std_logic_vector(11 downto 0)
);
end fir_lpf_conv;

architecture fir_lpf_conv_arch of fir_lpf_conv is
type coef_type is array(0 to 15) of integer range -511 to 511;
constant h: coef_type :=
(4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4);
signal mult: coef_type;
signal sum: integer range -2047 to 2047;
begin
blabla: for i in x'range generate
mult(i) <= h(i) when x(i)='1' else -h(i);
end generate;
sum <= mult(0) + mult(1) + mult(2) + mult(3) + mult(4) + mult(5) +
mult(6) + mult(7)
+ mult(8) + mult(9) + mult(10) + mult(11) + mult(12) + mult(13)
+ mult(14) + mult(15);
y <= std_logic_vector(to_signed(sum,12));
end fir_lpf_arch;

I haven't simulated it yet, but I have a sneaky feeling it will not do
what I expect. Even if it does do what I want it to then I'd like to
understand why.

The code should multiply each h element by the corresponding x element
(with zeros converted to -1s) in parallel, AND THEN sum the result
into sum AND THEN put the 'sum' result into 'y'. My use of AND THEN in
that statement makes me think I need sequential code, that is the
multiply should be done in parallel, and the sum should be done in
parallel, but the sum should use the results of the multiply. However
when I look at sequential code it is always clock or event driven and
I don't think that's what I need. All this should be done in less than
1/2 clock cycle.

I could see the compiler synthesizing the above code in two different
ways:

1. Multiply in parallel AND THEN add the results in parallel. (this
would be good)
2. Multiply in parallel and add in parallel. The parallel sum will use
the previous values stored in 'mult', and possibly some updated values
in 'mult' depending on the exact timing. (this would be bad)

So my question is: if the code is correct, then what is the rule for
synthesis? How does the compiler know that I want 'AND THEN' behavior?
If the code is incorrect, what do I write to get 'AND THEN' behavior
that is not clock driven?

I also have a couple less important questions:
Is there a better way to write my sum using a for loop? I couldn't get
it to compile.
I really don't need the intermediate signal 'sum'. I'd like to just
sum into 'y' but I get a type error because the synthesizer doesn't
know if the stuff on the right is signed or unsigned.

Thank You!
Brian
 
T

Tricky

Dear List,

I am trying to implement a 16-tap FIR Low-Pass Filter and have written
the convolution in VHDL (of which I am a beginner). The input sequence
'x' is a 1-bit sequence of 1's and 0's. This is to be converted to 1's
and -1's and convolved with the impulse sequence 'h'. My goal is for
the convolution portion of the filter to be completely asynchronous
and parallel. That is with each clock cycle 16 bits of the input
sequence are convolved with the impulse response providing a single 12-
bit output. Each element of the impulse response 'h' is a 10 bit
signed integer. The input sequence 'x' is a known sequence and I am
sure the output sequence 'y' will always fit into 12 bits.

Here is the code:
-- 16-tap FIR Low-Pass Filter Convolution Function
--
--
-- When convolved with the code it will produce a maximum value that
will fit into 12-bits

library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.numeric_std.all;

entity fir_lpf_conv is
        port (
                x: in std_logic_vector(15 downto 0);
                y: out std_logic_vector(11 downto 0)
        );
end fir_lpf_conv;

architecture fir_lpf_conv_arch of fir_lpf_conv is
        type coef_type is array(0 to 15) of integer range -511 to 511;
        constant h: coef_type :=
(4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4);
        signal mult: coef_type;
        signal sum: integer range -2047 to 2047;
begin
        blabla: for i in x'range generate
                mult(i) <= h(i) when x(i)='1' else -h(i);
        end generate;
        sum <= mult(0) + mult(1) + mult(2) + mult(3) + mult(4) + mult(5) +
mult(6) + mult(7)
             + mult(8) + mult(9) + mult(10) + mult(11) + mult(12) + mult(13)
+ mult(14) + mult(15);
        y <= std_logic_vector(to_signed(sum,12));
end fir_lpf_arch;

I haven't simulated it yet, but I have a sneaky feeling it will not do
what I expect. Even if it does do what I want it to then I'd like to
understand why.

The code should multiply each h element by the corresponding x element
(with zeros converted to -1s) in parallel, AND THEN sum the result
into sum AND THEN put the 'sum' result into 'y'. My use of AND THEN in
that statement makes me think I need sequential code, that is the
multiply should be done in parallel, and the sum should be done in
parallel, but the sum should use the results of the multiply. However
when I look at sequential code it is always clock or event driven and
I don't think that's what I need. All this should be done in less than
1/2 clock cycle.

I could see the compiler synthesizing the above code in two different
ways:

1. Multiply in parallel AND THEN add the results in parallel. (this
would be good)
2. Multiply in parallel and add in parallel. The parallel sum will use
the previous values stored in 'mult', and possibly some updated values
in 'mult' depending on the exact timing. (this would be bad)

So my question is: if the code is correct, then what is the rule for
synthesis? How does the compiler know that I want 'AND THEN' behavior?
If the code is incorrect, what do I write to get 'AND THEN' behavior
that is not clock driven?

I also have a couple less important questions:
Is there a better way to write my sum using a for loop? I couldn't get
it to compile.
I really don't need the intermediate signal 'sum'. I'd like to just
sum into 'y' but I get a type error because the synthesizer doesn't
know if the stuff on the right is signed or unsigned.

Thank You!
Brian

What you have written contains 0 multipliers, 15 x 2-1 muxes, no
registers and a very long adder chain. It is very very unlikely that
this will work. you will HAVE to break up the adder chain and pipeline
it - 16 adds just isnt going to work without pipelining. You normally
only want to add 2-3 numbers in a single clock cycle.

You also say "x" is a 1 bit sequence? is it coming in serially? or is
it really coming in as a bus like you've written. As it stands, it
expects all the X bits to be there at the same time.

I suggest you read up on digital design. this code is no way
synthesisable. Here is a hint (Im going to assume that X is a
synchronous input and not asynchronous like you said):

It should give you a latency of 4 clock cycles:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity fir_lpf_conv is
port (
clk : in std_logic;

x: in std_logic_vector(15 downto 0);
y: out std_logic_vector(11 downto 0)
);
end fir_lpf_conv;

architecture fir_lpf_conv_arch of fir_lpf_conv is
type coef_type is array(0 to 15) of integer range -511 to 511;
constant h: coef_type :=
(4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4);
signal mult: coef_type;

subtype sum_range_t is integer range -2047 to 2047;

signal sum: sum_range_t;

signal sum01 : sum_range_t;
signal sum23 : sum_range_t;
.....etc
begin
blabla: for i in x'range generate
mult(i) <= h(i) when x(i)='1' else -h(i);
end generate;


sum_proc : process(clk)
variable sum_total : integer;
begin
if rising_edge(clk) then

sum01 <= mult(0) + mult(1);
sum23 <= mult(2) + mult(3);
........etc

sum0123 <= sum01 + sum23;
.......etc


sum_total := sum0to7 + sum8to15;
y <= std_logic_vector( to_signed( sum_total,
12) );

end if;
end process;


end fir_lpf_arch;


Also - delete std_logic_arith from the code. It clashes with
numeric_std. always use the numeric_std package (which you have).
 
Joined
Jan 29, 2009
Messages
152
Reaction score
0
VHDL (simulation) works with "delta" delays:
- each mult(i) is set.
- after a "delta" delay, all mult(i) are summed.
- after a "delta" delay, y is assigned.
The synthesis tool will take care of keeping this "delta" semantics intact

I think this will do what you want?

Though you could also write this as a process:

Code:
architecture fir_lpf_conv_arch of fir_lpf_conv is

begin

process(x)
  type coef_type is array(0 to 15) of integer range -511 to 511;
  constant h: coef_type := (4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4);
  variable mult: coef_type;
  variable sum: integer range -2047 to 2047;
begin

for i in x'range loop
mult(i) := h(i) when x(i)='1' else -h(i);
end generate;
sum := mult(0) + mult(1) + mult(2) + mult(3) + mult(4) + mult(5) +
mult(6) + mult(7)
+ mult(8) + mult(9) + mult(10) + mult(11) + mult(12) + mult(13)
+ mult(14) + mult(15);
y <= std_logic_vector(to_signed(sum,12));
end process;

end fir_lpf_arch;

The best way to calculate sum would be something like,
Code:
sum(0) <= mult(0) + mult(1);
sum(1) <= mult(2) + mult(3);
sum(2) <= mult(4) + mult(5);
sum(3) <= mult(6) + mult(7);
sum(4) <= mult(8) + mult(9);
sum(5) <= mult(10) + mult(11);
sum(6) <= mult(12) + mult(13);
sum(7) <= mult(14) + mult(15);

sum2(0) <= sum(0) + sum(1);
sum2(1) <= sum(2) + sum(3);
sum2(2) <= sum(4) + sum(5);
sum2(3) <= sum(6) + sum(7);

sum3(0) <= sum2(8) + sum2(9);
sum3(1) <= sum2(10) + sum2(11);

sum4 <= sum(12) + sum(13);
Written like this additions are performed in parallel which will scale better.

This isn't too hard to rewrite into a few loops to make this dynamic.
Might even write as a nested loop, using a double indexed array sum(i,j) (or sum(i)(j), depending on how you declare it)

I wrote it like that to make the pattern clear for looping. It isn't actually needed to write like that to get parallelism. You can simply do:
Code:
sum <= (((mult(0) + mult(1)) + (mult(2) + mult(3))) + 
((mult(4) + mult(5)) + (mult(6) + mult(7)))) + 
(((mult(8) + mult(9)) + (mult(10) + mult(11))) +
((mult(12) + mult(13)) + (mult(14) + mult(15))));
Though it already starts to look a bit like lisp like that ;-)
 
Last edited:
H

heilig.brian

What you have written contains 0 multipliers,

The following line...

mult(i) <= h(i) when x(i)='1' else -h(i);

....is a 1 bit multiplier where a 1 means 'multiply by 1' and a 0 means
'multiply by -1'. When x(i)='1' then mult(i) <= h(i) * 1, else mult(i)
<= h(i) * -1.
15 x 2-1 muxes, no
registers and a very long adder chain. It is very very unlikely that
this will work. you will HAVE to break up the adder chain and pipeline
it - 16 adds just isnt going to work without pipelining. You normally
only want to add 2-3 numbers in a single clock cycle.

Because of the propagation delay? The Quartus II software I'm using
has a parallel_add megafunction (if you're not familiar with Quartus
II a megafunction is like a parameterized logical element) that can
add up to 128 32-bit integers in parallel! Well, at least that's what
it says.
You also say "x" is a 1 bit sequence? is it coming in serially? or is
it really coming in as a bus like you've written. As it stands, it
expects all the X bits to be there at the same time.

It is a 1-bit sequence that is initially serial but through a series
of external d flip flops I am converting it to 16 bits in parallel.
However each of these bits represents one element of the x sequence.
It is not converted to a 16 bit word.
I suggest you read up on digital design. this code is no way
synthesisable.

Ouch. Well you caught me. I bought "Circuit Design with VHDL" a few
days ago and it is on its way. I thought, "How hard can this be?"
Here is a hint (Im going to assume that X is a
synchronous input and not asynchronous like you said):

X is a synchronous input. The problem is I could draw a working logic
diagram that would perform the 16 1-bit multiplies in parallel and
then sum all the results in parallel. In fact I started off this way
but then figured it's a good time to learn VHDL. So if I know that it
can be represented as a bunch of logic gates then the problem is to
write VHDL code that will synthesize those gates for me.
It should give you a latency of 4 clock cycles:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity fir_lpf_conv is
        port (
                clk : in std_logic;

                x: in std_logic_vector(15 downto 0);
                y: out std_logic_vector(11 downto 0)
        );
end fir_lpf_conv;

architecture fir_lpf_conv_arch of fir_lpf_conv is
        type coef_type is array(0 to 15) of integer range -511 to 511;
        constant h: coef_type :=
(4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4);
        signal mult: coef_type;

        subtype sum_range_t is integer range -2047 to 2047;

        signal sum: sum_range_t;

        signal sum01 : sum_range_t;
        signal sum23 : sum_range_t;
        .....etc
begin
        blabla: for i in x'range generate
                mult(i) <= h(i) when x(i)='1' else -h(i);
        end generate;

        sum_proc : process(clk)
          variable sum_total : integer;
        begin
          if rising_edge(clk) then

            sum01   <= mult(0) + mult(1);
            sum23   <= mult(2) + mult(3);
            ........etc

            sum0123 <= sum01 + sum23;
            .......etc

            sum_total := sum0to7 + sum8to15;
            y         <= std_logic_vector( to_signed( sum_total,
12) );

          end if;
        end process;

end fir_lpf_arch;

Also - delete std_logic_arith from the code. It clashes with
numeric_std. always use the numeric_std package (which you have).

Ok. Thanks for the help.
 
T

Tricky

The following line...

mult(i) <= h(i) when x(i)='1' else -h(i);

...is a 1 bit multiplier where a 1 means 'multiply by 1' and a 0 means
'multiply by -1'. When x(i)='1' then mult(i) <= h(i) * 1, else mult(i)
<= h(i) * -1.

Thats probably the way you intend it, but in reality you've just
written a mux with 2 constant inputs that are selected via the
appropriate bit on X. on looking at the RTL viewer, that constants you
have chosen make it even less complicated, making each input input
just a function of X. You could completly change the constants, and
you will never get a hardware multiply, you will always get a mux.

Because of the propagation delay? The Quartus II software I'm using
has a parallel_add megafunction (if you're not familiar with Quartus
II a megafunction is like a parameterized logical element) that can
add up to 128 32-bit integers in parallel! Well, at least that's what
it says.

Whats wrong with a propgation delay? FPGAs are great for massive
parrallel processing, but there is normally a latency involved.
Pipelining still means you can get 1 result/clock cycle, but you have
to wait n clock cycles of latency before the first result arrives. n
is ALWAYS fixed, so you know when the output is valid, and from then
on every clock cycle yields a valid result. I fear if latency is your
bigest worry, you're coming at FPGA design from the wrong angle.

Yes altera do provide a parallel_add megafunction, but it looks
horrible to use (the data input is based on their own 2d-array of
std_logic for a start, not the best way to encourage use!). But I
could do a parallel add without their mega function, and add 256x64
bit numbers in parallel if I want, just using the "+" sign. Doesnt
mean it'll make good hardware/firmware though. You'll also add that
there is a "Pipeline" parameter on the parallel add megafunction.

ok, Ive compiled some stuff, and heres the results:

As a quick reference, I ran your initial massive add through
timequest, on a stratix 2 (putting registers in at the mux stage and
the output, so timequest could actually work) - FMax = 94Mhz
Doing the massive add with a parallel add component, 0 latency FMax =
200MHz
parallel adder, pipeline length of 4, FMax = 320Mhz
Pipelining it the way I did in previous post : FMax = 360MHz.

remember this has been done on a large device with no additional
logic, so FMax reports may be artificially high. But I know which
method Id rather use!.

to get hold of the parallel add, you have to actually instatiate it.
Converting the data input into the write format is a bit of an arse:


signal data : altera_mf_logic_2D(15 downto 0, 9 downto 0);
begin


i_gen : for i in data'range(1) generate
j_gen :for j in data'range(2) generate
data(i, j) <= std_logic_vector( to_signed(mult(i), 10) )(j);
end generate j_gen;
end generate i_gen;


par_add : parallel_add
generic map (
width => 10,
size => 16,
widthr => 12,
pipeline => 0,
representation => "SIGNED"
)
port map (
data => data,
result => result
);
It is a 1-bit sequence that is initially serial but through a series
of external d flip flops I am converting it to 16 bits in parallel.
However each of these bits represents one element of the x sequence.
It is not converted to a 16 bit word.

Well, you have x coming in as a 16 bit bus. And you have 16
"multiplies" in parallel.
Another question - how fast is the serial bus? 16x the main clock
speed? if it isnt, how do you know when any of the X bit are valid?

X is a synchronous input. The problem is I could draw a working logic
diagram that would perform the 16 1-bit multiplies in parallel and
then sum all the results in parallel. In fact I started off this way
but then figured it's a good time to learn VHDL. So if I know that it
can be represented as a bunch of logic gates then the problem is to
write VHDL code that will synthesize those gates for me.

But Id recommend you do it that way, especially as a VHDL beginner.
VHDL is a description language, not a programming language. It is
meant for describing digital hardware. You can write whatever you want
in VHDL (to a point), and it may simulate how you intend giving the
results you wanted in the way you specified, but that doesnt mean its
any good as a hardware description.
 
H

heilig.brian

Thats probably the way you intend it, but in reality you've just
written a mux with 2 constant inputs that are selected via the
appropriate bit on X. on looking at the RTL viewer, that constants you
have chosen make it even less complicated, making each input input
just a function of X. You could completly change the constants, and
you will never get a hardware multiply, you will always get a mux.

I think this is good. It is equivalent to a multiply by 1 or -1,
right?
Whats wrong with a propgation delay? FPGAs are great for massive
parrallel processing, but there is normally a latency involved.
Pipelining still means you can get 1 result/clock cycle, but you have
to wait n clock cycles of latency before the first result arrives. n
is ALWAYS fixed, so you know when the output is valid, and from then
on every clock cycle yields a valid result. I fear if latency is your
bigest worry, you're coming at FPGA design from the wrong angle.

You are right. Latency is hardly a concern. I guess my line of
thinking was that I could imagine the logic diagram, now if I could
just write the VHDL to make that logic diagram a reality.

But I wasn't asking about throughput delay, which I think is (or I'll
define as) the time through the entire device. Rather I was asking
about the delay between when the x elements are available on the
rising edge of the clock, to when the next y outputs are available to
be sampled. If this time is greater than half a clock cycle then I
will get garbage out. I think you summarized this in your discussion
below determining FMax.
ok, Ive compiled some stuff, and heres the results:

Again, thank you for your help.
As a quick reference, I ran your initial massive add through
timequest, on a stratix 2 (putting registers in at the mux stage and
the output, so timequest could actually work) - FMax = 94Mhz
Doing the massive add with a parallel add component, 0 latency FMax =
200MHz
parallel adder, pipeline length of 4, FMax = 320Mhz
Pipelining it the way I did in previous post : FMax = 360MHz.

I see. My sample clock is 20 MHz so that's ok. But I see your point
and will add pipelining.
Well, you have x coming in as a 16 bit bus. And you have 16
"multiplies" in parallel.
Another question - how fast is the serial bus? 16x the main clock
speed? if it isnt, how do you know when any of the X bit are valid?

The serial bus is 20 MHz as is the sample clock. Every time a new x
bit is shifted in I process the entire 16-bit sequence again. So bits
0-14 in the last interval become bits 1-15 in this one.
But Id recommend you do it that way, especially as a VHDL beginner.
VHDL is a description language, not a programming language. It is
meant for describing digital hardware. You can write whatever you want
in VHDL (to a point), and it may simulate how you intend giving the
results you wanted in the way you specified, but that doesnt mean its
any good as a hardware description.

You caught me again. I am a programmer with some hardware experience.
This small exercise is only the beginning, I'll soon need to know VHDL
well. So I guess I'll start reading!

Thanks,
Brian
 
H

heilig.brian

This is the code I finally settled on:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity code_filter is
port (
x0: in std_logic;
y: out std_logic_vector(11 downto 0);
clk: in std_logic
);
end code_filter;

architecture code_filter_arch of code_filter is
type coef_type is array(0 to 15) of integer range -511 to 511;
constant h: coef_type :=
(4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4);
signal mult: coef_type;
signal x: std_logic_vector(15 downto 0);
signal sum0_1, sum2_3, sum4_5, sum6_7: integer range -1023 to 1023;
signal sum8_9, sum10_11, sum12_13, sum14_15: integer range -1023 to
1023;
signal sum0_3, sum4_7, sum8_11, sum12_15: integer range -2047 to
2047;
signal sum0_7, sum8_15: integer range -4095 to 4095;
signal sum_total: integer range -8191 to 8191;
begin
process (clk)
begin
if rising_edge(clk) then
x(x'high downto 1) <= x((x'high-1) downto 0);
x(0) <= x0;
end if;
end process;

one_bit_multiply: for i in x'range generate
mult(i) <= h(i) when x(i)='1' else -h(i);
end generate;

sum0_1 <= mult(0) + mult(1);
sum2_3 <= mult(2) + mult(3);
sum4_5 <= mult(4) + mult(5);
sum6_7 <= mult(6) + mult(7);
sum8_9 <= mult(8) + mult(9);
sum10_11 <= mult(10) + mult(11);
sum12_13 <= mult(12) + mult(13);
sum14_15 <= mult(14) + mult(15);
sum0_3 <= sum0_1 + sum2_3;
sum4_7 <= sum4_5 + sum6_7;
sum8_11 <= sum8_9 + sum10_11;
sum12_15 <= sum12_13 + sum14_15;
sum0_7 <= sum0_3 + sum4_7;
sum8_15 <= sum8_11 + sum12_15;
sum_total <= sum0_7 + sum8_15;
y <= std_logic_vector(to_signed(sum_total,12));
end code_filter_arch;

The entire filter is now contained in this code, including the shift
registers (which used to be in another file). It has been simulated
and it works great. The major difference between this version and what
I had before is the processing of the add. The lesson I learned here
is that VHDL produces a result that closely matches the code, unlike C
which will perform aggressive optimizations. My previous version
resulted in 15 adders in one long chain (exactly as the code was
written) whereas the current version resulted in 15 adders in a
hierarchical structure (again exactly as it is written). This resulted
in a reduction of the propagation delay by a factor of log2(16).

Anyway it's good to know my initial design actually did work, even
though it wasn't optimal. After your first scathing response I felt
like I should turn in my degree and restart a career in some liberal
arts field. But your reply is greatly appreciated as I understand what
is going on much better.

Brian
 
Joined
Dec 9, 2008
Messages
88
Reaction score
0
Brian, thanks for closing out with your working solution! It is nice to see the design process form beginning to end.

John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top