combinatorial feedback loop

K

Ken Cecka

I'm playing around with different ways to count the number of '1' bits in a vector. I'll only be dealing with 6 bit vectors, so I could just do this with a lookup table, but I started with the most readable implementation to see how it would synthesize (copied below).

The Xilinx tools are synthesizing this into a single counter with a feedback loop, and then telling me me that the design can run at 700MHz. I'm a little skeptical - anyone know if it's reasonable to expect timing analysis to unroll a feedback loop like this?

Ken

ENTITY test IS
GENERIC
(
bits : INTEGER := 6
);
PORT
(
reset : IN STD_LOGIC;
clk : IN STD_LOGIC;
input : IN STD_LOGIC_VECTOR((bits - 1) DOWNTO 0);
ones : OUT STD_LOGIC_VECTOR(log2(bits) DOWNTO 0)
);
END test;

ARCHITECTURE model OF test IS

SIGNAL input_l : STD_LOGIC_VECTOR(input'RANGE);
SIGNAL count : STD_LOGIC_VECTOR(ones'RANGE);

BEGIN

-- latch inputs and outputs for timing analysis
PROCESS (reset, clk)
BEGIN
IF (reset = '1') THEN
input_l <= (OTHERS => '0');
ones <= (OTHERS => '0');
ELSIF (clk'EVENT) AND (clk = '1') THEN
input_l <= input;
ones <= count;
END IF;
END PROCESS;

-- count ones
PROCESS (input_l, count)
BEGIN
FOR i IN input_l'RANGE LOOP
IF (input_l(i) = '1') THEN
count <= count + 1;
END IF;
END LOOP;
END PROCESS;

END;
 
K

Ken Cecka

Jonathan said:
I'm playing around with different ways to count the number of '1'
bits in a vector. I'll only be dealing with 6 bit vectors, so
I could just do this with a lookup table, but I started with
the most readable implementation to see how it would
synthesize (copied below).

The Xilinx tools are synthesizing this into a single
counter with a feedback loop, and then telling me
that the design can run at 700MHz. I'm a little
skeptical - anyone know if it's reasonable to expect
timing analysis to unroll a feedback loop like this?

Ken [...]
-- count ones
PROCESS (input_l, count)
BEGIN
FOR i IN input_l'RANGE LOOP
IF (input_l(i) = '1') THEN
count <= count + 1;
END IF;
END LOOP;
END PROCESS;

OUCH! You really, really don't want to do this!

I suggest that your s[ck]epticism is entirely
justified - there's no way that makes any sense.
Indeed, a counter with combinational feedback
makes no sense as a design - it isn't coherent.

Of course you can re-code the loop with a variable:

PROCESS (input_l)
variable count: unsigned...
BEGIN
FOR i IN input_l'RANGE LOOP
IF (input_l(i) = '1') THEN
count := count + 1;
END IF;
END LOOP;
count_signal <= count;
END PROCESS;

and now synthesis will properly unroll the for-loop
(creating a chain of adders) and there is no
combinational feedback. I'm guessing you know
a load of other ways of doing it, too.

I was expecting it to unroll into a chain of adders in the first place and was surprised when it used feedback, but didn't stop think about the underlying adder implementation and whether this could even work. Thanks for the reality check!

Ken
 
A

Andy

Of course you can re-code the loop with a variable:

  PROCESS (input_l)
     variable count: unsigned...
  BEGIN
    FOR i IN input_l'RANGE LOOP
      IF (input_l(i) = '1') THEN
        count := count + 1;
      END IF;
    END LOOP;
    count_signal <= count;
  END PROCESS;

Don't forget to initialize count to zero before the loop...

Andy
 
K

kennheinrich

I'm playing around with different ways to count the number of '1' bits in a vector.  I'll only be dealing with 6 bit vectors, so I could just do this with a lookup table, but I started with the most readable implementation to see how it would synthesize (copied below).

The Xilinx tools are synthesizing this into a single counter with a feedback loop, and then telling me me that the design can run at 700MHz.  I'm  a little skeptical - anyone know if it's reasonable to expect timing analysis to unroll a feedback loop like this?

Ken

ENTITY test IS
  GENERIC
  (
    bits : INTEGER := 6
  );
  PORT
  (
    reset : IN  STD_LOGIC;
    clk   : IN  STD_LOGIC;
    input : IN  STD_LOGIC_VECTOR((bits - 1) DOWNTO 0);
    ones  : OUT STD_LOGIC_VECTOR(log2(bits) DOWNTO 0)
  );
END test;

ARCHITECTURE model OF test IS

  SIGNAL input_l : STD_LOGIC_VECTOR(input'RANGE);
  SIGNAL count : STD_LOGIC_VECTOR(ones'RANGE);

BEGIN

  -- latch inputs and outputs for timing analysis
  PROCESS (reset, clk)
  BEGIN
    IF (reset = '1') THEN
      input_l <= (OTHERS => '0');
      ones <= (OTHERS => '0');
    ELSIF (clk'EVENT) AND (clk = '1') THEN
      input_l <= input;
      ones <= count;
    END IF;
  END PROCESS;

  -- count ones
  PROCESS (input_l, count)
  BEGIN
    FOR i IN input_l'RANGE LOOP
      IF (input_l(i) = '1') THEN
        count <= count + 1;
      END IF;
    END LOOP;
  END PROCESS;

END;

There are also lots of nifty ways to count ones that the software guys
like to use. I haven't though through them enough to say whether they
merely represent a way to reclaim the inherent inefficiency of
launching a big ole 32-bit ALU op (the minimum unit of work in the
software pipeline) for each measly one-bit, or whether you could
exploit some of the insights to make faster gate-level hardware. If
you want insane speed, I'm not sure how you could beat a (possibly
pipelined) log adder tree from an algorithmic viewpoint.

Random google hit from "count the number of ones":

http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/


There is also some interesting work on fast decoding of VLC codes
(MPEG video for example) using combinatorial decoding of parts of the
input word to switch muxes to re-align the remainder of the input word
to a successive decode stage. If you had any indication of the
statistics (and were willing to play the statistics game, which is not
always an option), you might be able to exploit the same idea, say for
fast lookahead carry generation of halves or quarters of your input
width.

I realize this is a slight tangent to your original question,
though... sorry :)

- Kenn
 
K

Ken Cecka

There are also lots of nifty ways to count ones that the software guys
like to use. I haven't though through them enough to say whether they
merely represent a way to reclaim the inherent inefficiency of
launching a big ole 32-bit ALU op (the minimum unit of work in the
software pipeline) for each measly one-bit, or whether you could
exploit some of the insights to make faster gate-level hardware. If
you want insane speed, I'm not sure how you could beat a (possibly
pipelined) log adder tree from an algorithmic viewpoint.

Random google hit from "count the number of ones":

http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/


There is also some interesting work on fast decoding of VLC codes
(MPEG video for example) using combinatorial decoding of parts of the
input word to switch muxes to re-align the remainder of the input word
to a successive decode stage. If you had any indication of the
statistics (and were willing to play the statistics game, which is not
always an option), you might be able to exploit the same idea, say for
fast lookahead carry generation of halves or quarters of your input
width.

I realize this is a slight tangent to your original question,
though... sorry :)

- Kenn

I spent a little time looking over some of the software tricks earlier today. They seem to be mostly geared around minimizing serial instructions.

The specific case I'm addressing is counting ones in a relatively small vector where I want to do it in a single clock cycle. Interestingly, I tried coding it with a chain of adders and as a lookup table, and the Xilinx compiler faithfully implemented this in the RTL schematic (adder chain and a ROM respectively), but they both reduced to the exact same logic in the technology schematic.

Ken
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top