cosine calcs

Discussion in 'VHDL' started by Niv, Aug 11, 2006.

1. NivGuest

I may need to perform the following function:
Calculate the full included angle between Z axis given X axis angle & Y
axis angle
(I think this is cos(full) = cos(x).cos(Y) [confirm?]

I'm given X & Y (radians or degrees not yet known, but that's just
scaling)
I then have to test to see if the full angle is greater than a fixed
value, lets say 40 deg as an example.
So cos(40) is a constant which I can compare with cos(x).cos(Y).

How do I calc the cos of X & Y? , before I then multiply.
I can't use a lookup table as the X & Y are wide, which would build
a prohibitively large LUT. However, I have plenty of time to do some
sort
of iterative algorithm (200 us plus), so could calc cos(X), store, then
cos(Y), mult
and finally compare to my constant.

TIA, Niv.

Niv, Aug 11, 2006

2. Jonathan BromleyGuest

On 11 Aug 2006 04:04:53 -0700, "Niv" <> wrote:

Kevin,

>How do I calc the cos of X & Y? , before I then multiply.
>I can't use a lookup table as the X & Y are wide, which would build
>a prohibitively large LUT. However, I have plenty of time to do some
>sort
>of iterative algorithm (200 us plus), so could calc cos(X), store, then
>cos(Y), mult

Standard answer: see Ray Andraka's website www.andraka.com
for lots of great information on CORDIC algorithms. Sounds like
it's the perfect match for what you need. Given you have so
much time, a bit-serial CORDIC could be very compact indeed.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK

http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

Jonathan Bromley, Aug 11, 2006

3. NivGuest

Jonathan Bromley wrote:

> On 11 Aug 2006 04:04:53 -0700, "Niv" <> wrote:
>
> Kevin,
>
> >How do I calc the cos of X & Y? , before I then multiply.
> >I can't use a lookup table as the X & Y are wide, which would build
> >a prohibitively large LUT. However, I have plenty of time to do some
> >sort
> >of iterative algorithm (200 us plus), so could calc cos(X), store, then
> >cos(Y), mult

>
> Standard answer: see Ray Andraka's website www.andraka.com
> for lots of great information on CORDIC algorithms. Sounds like
> it's the perfect match for what you need. Given you have so
> much time, a bit-serial CORDIC could be very compact indeed.
> --
> Jonathan Bromley, Consultant

OK, looked at Ray website about Cordics, and now my head hurts! (No
maths since Uni, back in the (late) 70's. Well, that's my excuse.
I've copied a 16 bit wide 15 deep pipeline cordic from some free
website & got it
simulating (seems to jump about near 0 though).

But, how to change it so it accepts a wider input value (think that's
easy), but takes more time to calculate because it does many more
iterations. Goal is for a small area implementation.

Niv.

Niv, Aug 11, 2006
4. Jonathan BromleyGuest

On 11 Aug 2006 08:11:39 -0700, Niv <> wrote:

>OK, looked at Ray website about Cordics, and now my head hurts! (No
>maths since Uni, back in the (late) 70's. Well, that's my excuse.

You and me both!

>I've copied a 16 bit wide 15 deep pipeline cordic from some free
>website & got it
>simulating (seems to jump about near 0 though).

It shouldn't. I've a Verilog implementation somewhere that I'll
dig up and send to you - it seems to be pretty well behaved.
The only major issue is that you mustn't give it input angles
outside a certain range - it's close to +/- 100 degrees, can't
remember the exact number. Some implementations cheat
and restrict the range even further, to about half that, to save
one processing step - this is still useful, because you still
have a full quadrant of angle range.

>But, how to change it so it accepts a wider input value (think that's
>easy), but takes more time to calculate because it does many more
>iterations.

I think all you need to do is make the arctan(2^(-N)) table longer
and wider. Once you've hit 16 bits, the "fudge factor" that you
get (product of the cosines of all the values in that table) is so
close to constant with increasing N that you stop worrying
about it. And of course all the datapaths get wider everywhere.

> Goal is for a small area implementation.

I found that part of Ray Andraka's description somewhat heavy
going, but the basic idea is easy - just doing the sum/difference
calculations one bit at a time.

I'll dig out what I have over the weekend. No promises about
quality of documentation though - I did it as an experiment,
not intended for public consumption. And I have a nasty feeling
it might be written in Verilog :-(
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK

http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

Jonathan Bromley, Aug 11, 2006
5. Andy RayGuest

Jonathan Bromley wrote:
> On 11 Aug 2006 08:11:39 -0700, Niv <> wrote:
>
>
>> OK, looked at Ray website about Cordics, and now my head hurts! (No
>> maths since Uni, back in the (late) 70's. Well, that's my excuse.

>
> You and me both!
>

I've been working on a cordic circuit generator built using HDCaml.
It's pretty much working now so I have posted it at:

http://www.evilkid.pwp.blueyonder.co.uk/

Probably a few bugs to work through (and requires some decent
documentation) but for simple stuff (like sin/cos, mag/phase etc) it
seems to be working ok.

>> I've copied a 16 bit wide 15 deep pipeline cordic from some free
>> website & got it
>> simulating (seems to jump about near 0 though).

>
> It shouldn't. I've a Verilog implementation somewhere that I'll
> dig up and send to you - it seems to be pretty well behaved.
> The only major issue is that you mustn't give it input angles
> outside a certain range - it's close to +/- 100 degrees, can't
> remember the exact number.

1.74 radians or 99.6 degrees so you were pretty close...

Some implementations cheat
> and restrict the range even further, to about half that, to save
> one processing step - this is still useful, because you still
> have a full quadrant of angle range.
>
>> But, how to change it so it accepts a wider input value (think that's
>> easy), but takes more time to calculate because it does many more
>> iterations.

>
> I think all you need to do is make the arctan(2^(-N)) table longer
> and wider. Once you've hit 16 bits, the "fudge factor" that you
> get (product of the cosines of all the values in that table) is so
> close to constant with increasing N that you stop worrying
> about it. And of course all the datapaths get wider everywhere.
>

If it's the CORDIC from opencores then that implements a fully unrolled
algorithm so each extra stage costs some adders, muxes and registers.
It's quite an expensive architecture but very fast.

>> Goal is for a small area implementation.

>
> I found that part of Ray Andraka's description somewhat heavy
> going, but the basic idea is easy - just doing the sum/difference
> calculations one bit at a time.
>
> I'll dig out what I have over the weekend. No promises about
> quality of documentation though - I did it as an experiment,
> not intended for public consumption. And I have a nasty feeling
> it might be written in Verilog :-(

It doesn't offer a bit-serial implementation (yet) but it can generate
iterative or pipelined architectures of any precision (in vhdl and
verilog). The iterative version is reasonably efficient. Perhaps if
you have the chance you would compare the implementation to your hand
crafted one.

To generate a sincos core use something like:

cordic_gen -b 24 -f 4 -i 20 -a sincos -m iter -vhdl

For vhdl simulation and synthesis you also need to compile the package
in hdcaml.vhd.

Since I cant seem to get the ocaml native code compilers to work on my
system the program requires an installation of ocaml to work.

I hope someone finds it useful.

Cheers,

Andy.

Andy Ray, Aug 12, 2006
6. Jonathan BromleyGuest

On 11 Aug 2006 08:11:39 -0700,
Niv <> wrote:

>I've copied a 16 bit wide 15 deep pipeline cordic from some free
>website & got it
>simulating (seems to jump about near 0 though).
>
>But, how to change it so it accepts a wider input value (think that's
>easy), but takes more time to calculate because it does many more
>iterations. Goal is for a small area implementation.

Can you quantify "small"? I resuscitated my iterative CORDIC (one
clock cycle per bit of data, not pipelined); for 16 data bits and 3
guard bits it needs 62 FFs and about 230 LUTs in a Spartan-3.
I haven't tried really hard to optimise it for area; for example,
the arctangent lookup table is not yet in a ROM, but is in general
logic. I get something like 100MHz clock rate without trying
too hard. There are plenty of opportunities for making the
logic a little smaller and faster, but not much chance of
reducing the FF count. The size scales linearly with bit width
of the data (or very nearly so - the ROM grows faster than
linearly, but it's not a big deal).

Bit-serial implementation would be smaller but needs the same
number of FFs; however, by careful shoehorning you could get
some of the storage into SRL16s and save quite a lot of space.
That sort of jumping-on-the-suitcase is not really my strength,
though!

I'll try to put the example on our website before long, but
meanwhile let me know if you want a copy.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK

http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

Jonathan Bromley, Aug 15, 2006
7. Jan DecaluweGuest

Jonathan Bromley wrote:
> On 11 Aug 2006 08:11:39 -0700,
> Niv <> wrote:
>
>
>>I've copied a 16 bit wide 15 deep pipeline cordic from some free
>>website & got it
>>simulating (seems to jump about near 0 though).
>>
>>But, how to change it so it accepts a wider input value (think that's
>>easy), but takes more time to calculate because it does many more
>>iterations. Goal is for a small area implementation.

>
>
> Can you quantify "small"? I resuscitated my iterative CORDIC (one
> clock cycle per bit of data, not pipelined); for 16 data bits and 3
> guard bits it needs 62 FFs and about 230 LUTs in a Spartan-3.
> I haven't tried really hard to optimise it for area; for example,
> the arctangent lookup table is not yet in a ROM, but is in general
> logic. I get something like 100MHz clock rate without trying
> too hard.

Readers may be interested in the following page which describes
an apparently similar design (in MyHDL/Python):

http://myhdl.jandecaluwe.com/doku.php/cookbook:sinecomp:synthesis

The page includes testbench, code, Verilog output and FPGA synthesis
report. MyHDL currently converts only to Verilog, but VHDL output is
in the works.

Regards,
Jan

--
Jan Decaluwe - Resources bvba - http://www.jandecaluwe.com
Losbergenlaan 16, B-3010 Leuven, Belgium
From Python to silicon:
http://myhdl.jandecaluwe.com

Jan Decaluwe, Aug 15, 2006
8. Jan DecaluweGuest

Jan Decaluwe, Aug 15, 2006
9. NivGuest

Jonathan Bromley wrote:
> On 11 Aug 2006 08:11:39 -0700,
> Niv <> wrote:
>
> >I've copied a 16 bit wide 15 deep pipeline cordic from some free
> >website & got it
> >simulating (seems to jump about near 0 though).
> >
> >But, how to change it so it accepts a wider input value (think that's
> >easy), but takes more time to calculate because it does many more
> >iterations. Goal is for a small area implementation.

>
> Can you quantify "small"? I resuscitated my iterative CORDIC (one
> clock cycle per bit of data, not pipelined); for 16 data bits and 3
> guard bits it needs 62 FFs and about 230 LUTs in a Spartan-3.
> I haven't tried really hard to optimise it for area; for example,
> the arctangent lookup table is not yet in a ROM, but is in general
> logic. I get something like 100MHz clock rate without trying
> too hard. There are plenty of opportunities for making the
> logic a little smaller and faster, but not much chance of
> reducing the FF count. The size scales linearly with bit width
> of the data (or very nearly so - the ROM grows faster than
> linearly, but it's not a big deal).
>
> Bit-serial implementation would be smaller but needs the same
> number of FFs; however, by careful shoehorning you could get
> some of the storage into SRL16s and save quite a lot of space.
> That sort of jumping-on-the-suitcase is not really my strength,
> though!
>
> I'll try to put the example on our website before long, but
> meanwhile let me know if you want a copy.
> --
> Jonathan Bromley, Consultant
>
> DOULOS - Developing Design Know-how
> VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services
>
> Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
>
> http://www.MYCOMPANY.com
>
> The contents of this message may contain personal views which
> are not the views of Doulos Ltd., unless specifically stated.

The size you describe sounds plenty small enough, target device is
probably an Actel APA1000, but a lot of other stuff is going in there
as well. 100 MHz is plenty fast enough as well, our master clock is
only 33MHz, possible a bit more, but I have 1/2 ms to do all my calcs
(to be confirmed though).

I'm starting to get my head round the cordic, so I'll persevere a bit
more so I understand it and then code it up myself (VHDL), with loads
of generics to paramaterise it, if I can.
But looking at other designs may help my understanding.
It seems a word serial is only about twice the size of a bit serial, so
may look at both types of implementation. A fully parallel design
would be too resource hungry though, I think.

Regards, Kev P. (AKA Niv).

Niv, Aug 16, 2006
10. Jonathan BromleyGuest

On 16 Aug 2006 00:13:00 -0700, Niv
<> wrote:

[me]
>> Can you quantify "small"? I resuscitated my iterative CORDIC (one
>> clock cycle per bit of data, not pipelined); for 16 data bits and 3
>> guard bits it needs 62 FFs and about 230 LUTs in a Spartan-3.

whoops, typo, that should have been "about 330 LUTs". Sorry.
This count scales pretty much linearly with the number of bits
in the data path, and is only very slightly affected by the number
of iterations you perform.

[Kevin]
>I'm starting to get my head round the cordic, so I'll persevere a bit
>more so I understand it and then code it up myself (VHDL), with loads
>of generics to paramaterise it, if I can.

I parameterised for...
* bit width of the coordinate and angle I/Os
* number of guard LSBs added to the internal datapaths
* number of iterations of the algorithm

Does it need any more? I'm not sure.

>But looking at other designs may help my understanding.

I've emailed it to you separately. The testbench is ropey, but
it shows the thing working as a cos/sin calculator.

>It seems a word serial is only about twice the size of a bit serial

Indeed. Unless you do very, very clever stuff with the bit shifters,
you need the same number of FFs for both. But the LUT count should
be significantly lower than for bit-serial. So maybe it could be got
down to half the size of word-serial, or thereabouts.

> A fully parallel design
>would be too resource hungry though, I think.

Yes, for any respectable accuracy. The fully-parallel design
is simpler because you don't need a barrel shifter, but bigger
because you replicate the adder hardware for each iteration.
I reckon the barel shifter is about as big as 3 adders, so you
can guess that a 16+3-bit, 19-iteration version would come
out around 5X bigger than the word-serial form. And given
you've got so much time to do the calculation, it would be
a completely pointless waste of space.

Thanks for provoking me into bringing that design back to
life - it gave me a chance to tidy it up, and it now exists in
both VHDL and Verilog flavours. --
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK

http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

Jonathan Bromley, Aug 16, 2006