Best CPU platform(s) for FPGA synthesis

Jon Beniston · Aug 2, 2007

Any specific box recommendations?

I'd recommend running them on MicroBlaze.. good opportunities for h/w
acceleration ;-)

Jon

jjohnson · Aug 2, 2007

Yo, Adrian!

and Paul and everyone else, that's some great info and
is very much appreciated.

Since quartus_fit is dominating my runtime (EP2S180 and HC230), and
quartus_fit gains the most from extra CPUs, it makes sense for me to
go at least to 4 CPUs (I currently only have dual-processor boxes,
thus the need to go shopping). Do you know if the HardCopyII fitter
also makes use of multiple processors?

When Quartus does spawn jobs off to up to 4 processors, can each one
of those spawned jobs use up to 4GB?

In the case of Quartus supporting a max of 4 processors, at the very
least an 8-processor box would allow me to run two copies of Quartus
at the same time (e.g., different designs, or different flavors of the
same design). 8 processors on 64-bit Linux w/ 16GB of RAM with 32-bit
Quartus would seem to be a well-balanced setup if most Quartus jobs
remain under 2GB, correct?

Since memory access is such a big part of the overall runtime,
obviously the faster memory buses on newer machines will help. (Good
thing, because the clock speed difference along from an Opteron 250 to
a newer Opteron 2218 isn't much of an increase: 2.4GHz to 2.8GHz).

Since the databases for big chips get so large (and memory accesses
apparently so random), does a larger data cache buy you much? The L1
I&D caches are relatively small on both AMD and Intel, although
Opteron is 2x (64K Instr, 64K Data) larger than Intel's.

For the L2 cache, Intel's is 2x larger than AMDs on a per-core basis.
Since Intel shares two caches between neighboring cores (as you say
1&2 or 3&4 can share quickly, but slow from 1/3 and 2/4), whereas
Opterons have a dedicated cache per core, would Opterons see a speedup
from less contention for the cache, or a slowdown from having to go
outside the local caches in order to share data? (I guess a function
of how often the quartus_fit algorithms need to share data, right?)

If I were trying to run two Quartus jobs simultaneously on one 8-CPU
machine (with NUM_PARALLEL_CPUS = 4 for each run), I would expect
competition for external memory to be huge, and thus statistically
some benefit to Intel's larger cache. And with more "stuff" cached,
that the higher clock speeds on current Intel CPUs might give the
runtime advantage to Intel. On the other hand, AMD has the Direct
Connect Architecture and HyperTransport, so...

I know you vendor guys are reluctant to publish benchmark info, but
from the currently-available, mainstream, small-server perspective
with 8 processors, I'm kind of pushed toward the following CPU
choices:

4 dual-core Opteron 2218's (2.6 GHz, 90nm process, 2MB L2 cache as 1MB
dedicated per core )
4 dual-core Opteron 2220's (2.8 GHz, 90nm process, 2MB L2 cache as 1MB
dedicated per core )
4 dual-core Intel 5160's (3.0 GHz, 65nm process, 1333 MHz FSB, 4MB
shared L2 cache)
2 quad-core Intel X5355's (2.66 GHz, 65nm process, 1333 MHz FSB, 8MB
L2 cache, shared 4MB per core pair)

Of those, is there an obvious bang for the buck advantage (weighted
more toward bang than buck) for any one of those in particular?

-------
P.S. Those QX6850's are hard to come by; Dell's overclocked XPS720's
look sweet, but my company won't spring for overclocked boxes...

Thanks again, very very much!

Wei Wang · Aug 2, 2007

Did you changed the setting "use up to x number of CPUs" (don't remember
the exact name) somewhere in the project settings?

is there such a setting for xilinx ise as well?

thx, -wei

Wei Wang · Aug 2, 2007

If cost is no object, then go with the Intel quad-core running at 3
GHz : QX6850. Each core has 2 MB of L2 cache (8MB total), which is,
according to several reports in this forum, the single most important
factor.

I would say go with 4GB of ram, although if you're using the biggest
chips, you might need more. Keep in mind that Windows 32-bit will only
be able to use 3GB max of this 4 GB, and each application will only be
able to access 2GB max. So you might consider Windows 64 bits or Linux
64 bits if necessary.

Patrick

Why only 3GB max of 4GB? thanks, -Wei

Wei Wang · Aug 2, 2007

is there such a setting for xilinx ise as well?

thx, -wei

Found similar memory recommendations for Xilinx's largest XC5VLX330
FPGA,
http://www.xilinx.com/ise/products/memory.htm#v5lx
only Linux-64 machines are supported, memory recommendation: typical
7.2GB and peak 10.6GB.

Guest · Aug 2, 2007

Wei Wang said:
Found similar memory recommendations for Xilinx's largest XC5VLX330
FPGA,
http://www.xilinx.com/ise/products/memory.htm#v5lx
only Linux-64 machines are supported, memory recommendation: typical
7.2GB and peak 10.6GB.

This web page needs to be updated: NT64 is also supported, but runtime
will be faster on Linux64, so that's what we recommend.

Steve

MM · Aug 2, 2007

Hi Steve,

Could you give us (Xilinx users) some more detailed recommendations on what
would be the best platform to run ISE/EDK tools when working on midsize to
big designs? Tell us what you are using @ Xilinx?

Thanks,
/Mikhail

Guest · Aug 2, 2007

I can give you some general recommendations. For the best place and route
runtimes,
use a 64bit Linux system. If your design is small enough to fit into 4G of
memory
(LX110 or smaller), and you are not programming devices (the 32bit cable
drivers
don't work on a 64bit system), you can use the 32bit executables to save
memory.
Otherwise, go ahead and use the 64bit executables. They use more memory and
the runtime is simular.

As mentioned earlier, synthesis, map, place and route do not use
multithreading, so
you will not get an advantage using multiple processors for a single design.
However,
ProjNav is multithreaded so if you are doing different tasks, other
processors will
be used. In addition, upcoming software releases will use those processors.

Steve

Eric Smith · Aug 2, 2007

Steve said:
I can give you some general recommendations. For the best place and
route runtimes, use a 64bit Linux system. If your design is small
enough to fit into 4G of memory (LX110 or smaller), and you are not
programming devices (the 32bit cable drivers don't work on a 64bit
system), you can use the 32bit executables to save memory.
Otherwise, go ahead and use the 64bit executables. They use more
memory and the runtime is simular.

Note that it works just fine to install 32-bit ISE on a 64-bit Linux
system, and to install the 64-bit cable drivers.

In my experience, the open source user-space-only cable interface works
far better than the Xilinx-supplied cable drivers anyhow:

http://www.rmdir.de/~michael/xilinx/

Andreas Ehliar · Aug 3, 2007

Why only 3GB max of 4GB? thanks, -Wei

The short answer is that the upper 1GB is reserved for the kernel.
If you want a bit more detail you can look at for example the
following article:
http://kerneltrap.org/node/2450

/Andreas

MM · Aug 3, 2007

I can give you some general recommendations. For the best place and route
runtimes,
use a 64bit Linux system. If your design is small enough to fit into 4G of
memory
(LX110 or smaller), and you are not programming devices (the 32bit cable
drivers
don't work on a 64bit system), you can use the 32bit executables to save
memory.
Otherwise, go ahead and use the 64bit executables. They use more memory and
the runtime is simular.

Is there a 64-bit version of EDK ? If not, can I mix 64 bit ISE with 32 bit
EDK?

Thanks,
/Mikhail

Wei Wang · Aug 3, 2007

I can give you some general recommendations. For the best place and route
runtimes,
use a 64bit Linux system. If your design is small enough to fit into 4G of
memory
(LX110 or smaller), and you are not programming devices (the 32bit cable
drivers
don't work on a 64bit system), you can use the 32bit executables to save
memory.
Otherwise, go ahead and use the 64bit executables. They use more memory and
the runtime is simular.

As mentioned earlier, synthesis, map, place and route do not use
multithreading, so
you will not get an advantage using multiple processors for a single design.
However,
ProjNav is multithreaded so if you are doing different tasks, other
processors will
be used. In addition, upcoming software releases will use those processors.

Steve

- Show quoted text -

What I found was very interesting, it was taking me 12 hours to run
the MAP process before, but yesterday it only took me ~3 hours to run
MAP, and PAR only too took ~40 mins as well.

I was trying to figure out the reasons, then found in *.map *.mrp
files that there was always a map phase which took such a long time as
~10+ hours, and that phrase was always very memory hungry. I was using
Linux64 with 2GB real memory and 4GB swap memory, as I just found that
the real 2GB memory was much smaller than the required peak memory
10.6GB. Yesterday, I was running ISE9.1i for XC5VLX330 on another
Linux64 machine with 11G real memory and 8G swap memory, the there
wasn't any MAP phrase which took a ridiculous ~10+ hours.

Can Xilinx guys shed some more light on the runtime of MAP and PAR,
wrt different memory sizes and CPU cores?

Patrick Dubois · Aug 3, 2007

P.S. Those QX6850's are hard to come by; Dell's overclocked XPS720's
look sweet, but my company won't spring for overclocked boxes...

Polywell has some desktop computers with QX6850 available. Although
since you're looking at an 8-way workstation (!), QX6850 is probably
not an option. Polywell has AMD or Intel workstations with the CPUs
you're looking at as well.

For one socket, Intel clearly has the edge over AMD I think. For multi-
socket workstations/servers however, I'm not so sure. Benchmarks are
harder to find. I would suspect that the Hypertransport bus would help
AMD close the gap with Intel a little. Their integrated memory
controller probably helps as well in a multi-socket machine.

I searched for benchmarks for the newest 90-nm Opteron but couldn't
find any unfortunately...

Patrick

Guest · Aug 3, 2007

Can Xilinx guys shed some more light on the runtime of MAP and PAR,
wrt different memory sizes and CPU cores?

Even though our memory requirement table lists devices, memory is more
dependent on the design and the timing constraints. Since we can't predict
what is in your design, we just give you the typical and max numbers from
our collected test cases.

An example for constraints which will reduce memory is instead of creating
a bunch of individual from to timespecs, you can create timegroups with the
endpoints, then put one timespec on that.

Also, ISE 9.2i is getting an average of 27% improvement in memory
utilization.

I don't have any data regarding runtime of different CPU cores.

Steve

which commercial HDL-Simulator for FPGA?	23	Jun 18, 2008
[ANN] HercuLeS high-level synthesis tool	0	Jul 11, 2011
best machine for quartus and future multithreaded place and route plans...	2	Oct 13, 2006
PC configuration for fastest compiles (synthesis, place and route,etc)	1	Feb 15, 2008
IIR filter implementation on FPGA	3	Jul 31, 2008
I'm considering buying a new motherboard/processor combo for faster synthesis	9	Apr 6, 2004
Call for Papers: The 2011 International Conference on Modeling,Simulation and Visualization Methods	0	Feb 27, 2011
Call for Papers & Sessions: The 2011 International Conference onModeling, Simulation and Visualizat	0	Dec 26, 2010

Best CPU platform(s) for FPGA synthesis

Jon Beniston

jjohnson

Wei Wang

Wei Wang

Wei Wang

Guest

MM

Guest

Eric Smith

Andreas Ehliar

MM

Wei Wang

Patrick Dubois

Guest

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads