Best CPU platform(s) for FPGA synthesis

J

jjohnson

OK, the questions apply primarily to FPGA synthesis (Altera Quartus
fitter for StratixII and HardCopyII), but I'm interested in feedback
regarding all EDA tools in general.


Context: I'm suffering some long Quartus runtimes on their biggest
StratixII and second-biggest HardCopyII device. Boss has given me
permission to order a new desktop/workstation/server. Immediate goal
is to speed up Quartus, but other long-term value considerations will
be taken into account.


True or false?
--------------------
Logic synthesis (analyze/elaborate/map) is mostly integer operations?
Place and Route (quartus_map) is mostly double-precision floating-
point?
Static Timing Analysis (TimeQuest) is mostly double-precision floating-
point?
RTL simulation is mostly integer operations?
SDF / gate-level simulation is mostly double-precision floating-point?


AMD or Intel?
-------------------
Between AMD & Intel's latest multicore CPUs,
- Which offers the best integer performance?
- Which offers the best floating-point performance?
Specific models within the AMD/Intel family?
Assume cost is no object, and each uses its highest-performing memory
interface, but disk access is (necessary evil) over a networked drive.
(Small % of total runtime anyway.)


Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs.
Windows? >2GB of RAM?
---------------------------------------------------------------------------------------------------------------------------------
Is Quartus (and the others) more efficient in any one particular
environment? I prefer Linux, but the OS is now secondary to pure
runtime performance (unless it is a major contributor). Can any of
them make use of more than 2GB or RAM? More than 4GB? Useful limit on
the number of processors/cores?


Any specific box recommendations?



Thanks a gig,

jj
 
S

sharp

True or false?
--------------------
Logic synthesis (analyze/elaborate/map) is mostly integer operations? Yes.

Place and Route (quartus_map) is mostly double-precision floating-
point?
I don't know why they would use floating point if they don't have to.
Static Timing Analysis (TimeQuest) is mostly double-precision floating-
point?
I seriously doubt it. I don't see a need for floating point there
when delays can use scaled integers.
RTL simulation is mostly integer operations? Yes.

SDF / gate-level simulation is mostly double-precision floating-point?
No, or at least not in any implementation I am familiar with. All the
delays are scaled up so that integers can be used for them.

In simulation (assuming something with state-of-the art performance),
the CPU operations themselves are not very important anyway. It is
not compute-bound, it is memory-access-bound. What you need is big
caches and fast access to memory for when the cache isn't big enough.

Is Quartus (and the others) more efficient in any one particular
environment? I prefer Linux, but the OS is now secondary to pure
runtime performance (unless it is a major contributor). Can any of
them make use of more than 2GB or RAM? More than 4GB?

64-bit Linux can make use of more than 4GB of RAM. But don't use 64-
bit executables unless your design is too big for 32-bit tools,
because they will run slower on the same machine.
Useful limit on
the number of processors/cores?

Most of these tools are not multi-threaded, so the only way you will
get a speedup is if you have multiple jobs at the same time. Event-
driven simulation in particular is not amenable to multi-threading,
despite much wishful thinking for the last few decades.
 
N

Nial Stewart

I think that memory performance is the limiting factor for
FPGA synthesis and P&R.

This machine had a single core AMD 64 processor which I recently replaced with
a slightly faster dual core processor.

I ran a fairly quick FPGA build through Quartus to get a time for a
before and after comparison before I did the swap.

The before and after times were exactly the same :-(

I think the amount and speed of memory is crucial, it's probably
worth paying as much attention to that as to the processor.


Nial.
 
F

Frank Buss

Nial said:
I ran a fairly quick FPGA build through Quartus to get a time for a
before and after comparison before I did the swap.

Did you changed the setting "use up to x number of CPUs" (don't remember
the exact name) somewhere in the project settings?
 
P

Patrick Dubois

AMD or Intel?
-------------------
Between AMD & Intel's latest multicore CPUs,
- Which offers the best integer performance?
- Which offers the best floating-point performance?
Specific models within the AMD/Intel family?

Assume cost is no object, and each uses its highest-performing memory
interface, but disk access is (necessary evil) over a networked drive.
(Small % of total runtime anyway.)

Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs.
Windows? >2GB of RAM?

If cost is no object, then go with the Intel quad-core running at 3
GHz : QX6850. Each core has 2 MB of L2 cache (8MB total), which is,
according to several reports in this forum, the single most important
factor.

I would say go with 4GB of ram, although if you're using the biggest
chips, you might need more. Keep in mind that Windows 32-bit will only
be able to use 3GB max of this 4 GB, and each application will only be
able to access 2GB max. So you might consider Windows 64 bits or Linux
64 bits if necessary.

Patrick
 
K

Kai Harrekilde-Petersen

Jon Beniston said:
Dynamic range?

Not a likely problem. Even a 32bit int would be big enough for holding
up to a ridiculous 4.3 seconds, assuming 1psec resolution.

As far as I know, everything in the simulate, synth, P&R, and STA
chain can be performed with adequate resolution using integers.

Crosstalk and inductive effects might require floating point help, but
I would be surprised if even that can be approximated well with
fixed-point arithmetic.


Kai
 
E

Eric Smith

64-bit Linux can make use of more than 4GB of RAM. But don't use 64-
bit executables unless your design is too big for 32-bit tools,
because they will run slower on the same machine.

Although that might be true for some specific cases, in general on Linux
native 64-bit executables tend to run faster than 32-bit executables.
But I haven't benchmarked 32-bit vs. 64-bit FPGA tools.
 
J

jjohnson

Thanks everyone, this is real interesting, but please don't stop
posting if you have more insights to share!

FWIW, my runtimes in Quartus are dominated by P&R (quartus_fit); on
Linux, they run about 20% faster on my 2005-era 64-bit Opteron than on
my 2004-era 32-bit Xeon (both with a 32-bit build of Quartus). Another
test run of a double-precision DSP simulation (compiled C) ran
substantially slower on the Opteron, which I thought was supposed to
have better floating-point performance than Xeons of that era. Maybe
it was just a case of the gcc -O5 optimization switches being totally
tuned to Intel instead of AMD, or maybe my Quartus P&R step is
primarily dominated by integer calculations.

I originally suspected P&R might have a lot of floating-point
calculations (even prior to signal-integrity considerations) if they
were doing any kind of physical synthesis (e.g., delay calculation
based on distance and fanout); ditto for STA, because that's usually
an integral part of the P&R loops. I also suspected that if floating-
point operations (at least multiplies, add/subtract, and MACs) could
be done in a single cycle, there would be no advantage to using
integer arithmetic instead (especially if manual, or somewhat explicit
integer scaling is required).

On the other hand, in something like a router, you can get more exact
location info wrt stuff like grid coordinates than you can with
floating-point. As far as dynamic range is concerned, I seem to recall
that SystemC standardized on 64-bit time to run longer simulations,
but SystemC is a different animal in that regard anyway. Nonetheless,
I also seem to recall that its implementation of time was 64-bit
integers (scaled), because the average FPU operations are really only
linear over the 53-bit mantissa part. Assuming they want linear
representation of time ticks, I can see the appeal of using 64-bit
integers in simulation.

As far as event-driven simulations are concerned, I totally understand
how hard it is to make good use of multithreading or multiprocessing,
because everything is so tightly coupled in that evaluate/update/
reschedule loop. If you were working at a much higher level
(behavioral/transaction), where the number of low-level events is
lower and the computation behind "complex" events took up a much
larger portion of the evaluate/update/reschedule loop, then multicore/
multiprocessing solutions might be more effective for simulation.
(Agree/disagree?) It seems that as you get more coarse-grained with
the simulation, that even distributed processing (multiple machines on
a network) becomes more feasible. Obviously the scheduler has one
"core" and has to reside in one CPU/memory space, but if it has less
work to do, then it can handle less frequent communication with the
event-processing CPUs in another space.

Back to Quartus in particular and Windows in general... Quartus
supports the new "number_of_cpus" or some similar variable, but only
seems to use it in small sections of quartus_fit (I think Altera is
just making their baby steps in this area).

That appears to be related to the number of processors inside one box.
If a single CPU is just hyperthreaded, the processor takes care of
instruction distribution unrelated to a variable like number_of_cpus,
right? And if there are two single-core processors in a box, obviously
it will utilize "number_of_cpus=2" as expected. Does anyone know how
that works with dual-core CPUs? i.e, if I have two quad-core CPUs in
one box, will setting "number_of_cpus=7" make optimal use of 7 cores
while leaving me one to work in a shell or window?

Does anyone know if Quartus makes better use of multiple processors in
a partitioned bottom-up flow compared to a single top-down compile
flow?

In 32-bit Windows, is that 3GB limit for everything running at one
time? i.e., is 4GB a waste on a Windows machine? Can it run multiple
2GB processes and go beyond 3 or 4GB? Or is 3GB an absolute O/S limit,
and 2GB an absolute process limit in Windows?

In 32-bit Linux, can it run 4GB per process and as many simultaneous
processes of that size as the virtual memory will support?

In going to 64-bit apps and O/S versions, should the tools run equally
fast as long as the processor is truly 64-bit?


Thanks again for all the insights and interesting discussion.


jj
 
J

Jon Beniston

Not a likely problem. Even a 32bit int would be big enough for holding
up to a ridiculous 4.3 seconds, assuming 1psec resolution.

I think you're a factor of 1000 out.

For an ASIC STA, gate delays must be specified at a much finer
resolution than 1ps.

Cheers,
Jon
 
K

Kai Harrekilde-Petersen

Thanks everyone, this is real interesting, but please don't stop
posting if you have more insights to share!
[snip]

In 32-bit Linux, can it run 4GB per process and as many simultaneous
processes of that size as the virtual memory will support?

As I recall, 32-bit Linux has a limit around 3.0-3.5GB per process.
On the 64-bit Linux , I have used 8+GB for a single process doing
gatelevel simulations.


Kai
 
K

Kai Harrekilde-Petersen

Jon Beniston said:
I think you're a factor of 1000 out.

Duh, brain fart indeed!
For an ASIC STA, gate delays must be specified at a much finer
resolution than 1ps.

I don't recall seeing sub-psec resolution in the 130nm libraries I
have seen, but that doesn't imply that it cannot be so.

But I stand by my argument: the actual resolution should not matter
much, as the total clock delays and cycle times should scale pretty
much as the library resolution. Otherwise, there wouldn't be a point
in choosing such a fast technology (who in their right mind would use
a 45m process for implementing an 32kHz RTC, unless they had to?)


Kai
 
P

Paul Uiterlinden

In 32-bit Linux, can it run 4GB per process and as many simultaneous
processes of that size as the virtual memory will support?

Below is what I have read about it in "Self-Service Linux®"
http://www.phptr.com/content/images/013147751X/downloads/013147751X_book.pdf
I have no experience with it.

<quote>
3.2.2.1.6 The Kernel Segment

The only remaining segment in a process' address space to discuss is the
kernel segment. The kernel segment starts at 0xc0000000 and is
inaccessible by user processes. Every process contains this segment,
which makes transferring data between the kernel and the process'
virtual memory quick and easy. The details of this segment’s contents,
however, are beyond the scope of this book.

Note:

You may have realized that this segment accounts for one quarter of the
entire address space for a process. This is called 3/1 split address
space. Losing 1GB out of 4GB isn't a big deal for the average user, but
for high-end applications such as database managers or Web servers,
this can become an issue. The real solution is to move to a 64-bit
platform where the address space is not limited to 4GB, but due to the
large amount of existing 32-bit x86 hardware, it is advantageous to
address this issue. There is a patch known as the 4G/4G patch, which
can be found at ftp.kernel.org/pub/linux/kernel/people/akpm/patches/ or
http://people.redhat.com/mingo/4g-patches. This patch moves the 1GB
kernel segment out of each process’ address space, thus providing the
entire 4GB address space to applications.
<end quote>
 
C

comp.arch.fpga

Thanks everyone, this is real interesting, but please don't stop
posting if you have more insights to share!
I originally suspected P&R might have a lot of floating-point
calculations (even prior to signal-integrity considerations) if they
were doing any kind of physical synthesis (e.g., delay calculation
based on distance and fanout); ditto for STA, because that's usually
an integral part of the P&R loops. I also suspected that if floating-
point operations (at least multiplies, add/subtract, and MACs) could
be done in a single cycle, there would be no advantage to using
integer arithmetic instead (especially if manual, or somewhat explicit
integer scaling is required).

On the other hand, in something like a router, you can get more exact
location info wrt stuff like grid coordinates than you can with
floating-point. As far as dynamic range is concerned, I seem to recall
that SystemC standardized on 64-bit time to run longer simulations,
but SystemC is a different animal in that regard anyway. Nonetheless,
I also seem to recall that its implementation of time was 64-bit
integers (scaled), because the average FPU operations are really only
linear over the 53-bit mantissa part. Assuming they want linear
representation of time ticks, I can see the appeal of using 64-bit
integers in simulation.

Any operations on large netlists are completely memory and pointer
dominated.
There are lot's of random access pointer indirections in data sets
that are much larger
than the cache. The computations done once you have the data do not
matter at all.
You need hundreds of CPU cycles to access the delay parameters of two
gates in a netlist.
Summing them up can be done for free while the CPU waits on the next
load instruction.

On the other hand, if your dynamic range is needed for summing up
small values floating point
does not help at all.
1e12 + 1 = 1e12 in 32-bit floating point. For opperations like that 32-
bit integer actually has 6 bits
more dynamic range.
As far as event-driven simulations are concerned, I totally understand
how hard it is to make good use of multithreading or multiprocessing,

Why? In a larger design there will allways be many active processes at
each
timestep. These can be distributed to individual processors.
All operations can be on shared memory because each signal has only
one driver.

Kolja Sulimma
 
P

PeteS

Patrick said:
If cost is no object, then go with the Intel quad-core running at 3
GHz : QX6850. Each core has 2 MB of L2 cache (8MB total), which is,
according to several reports in this forum, the single most important
factor.

I would say go with 4GB of ram, although if you're using the biggest
chips, you might need more. Keep in mind that Windows 32-bit will only
be able to use 3GB max of this 4 GB, and each application will only be
able to access 2GB max. So you might consider Windows 64 bits or Linux
64 bits if necessary.

Patrick

The last time I checked the speed of a full FPGA build, the cache did
indeed have the single largest effect, which is hardly surprising. A
cache access is typically one internal bus cycle (not a cpu cycle) which
is an order of magnitude faster than an external memory access cycle.

Properly optimised code that uses the I-Cache properly will run much
faster than inline code, incidentally.

Cheers

PeteS
 
I

Ioiod

64-bit Linux can make use of more than 4GB of RAM. But don't use 64-
bit executables unless your design is too big for 32-bit tools,
because they will run slower on the same machine.

Interesting -- on an AMD Athlon X2/5200+ running RHEL Linux 4 update 4
x86_64,
just about all Synopsys Design Compiler jobs run FASTER in 64-bit mode than
32-bit mode, between 5-10% faster. THe penalty is slightly larger
RAM-footprint,
just as you noted. The X2/5200+ is spec'd the same as an Opteron 1218
(2.6GHz,
2x1MB L2 cache..)

This trend was pretty much consistent across all our Linux EDA-tools.

On Solaris SPARC, 64-bit mode was definitely slower than 32-bit mode, by
about
10-20%. For the life of me, I can't understand why the AMD would run 64-bit
mode faster than its 32-bit mode -- but for every other machine
architecture,
64-bit mode is almost always slower.

I forgot to re-run my 32bit vs 64-bit benchmark on Intel Core2 Duo machines.
FOr 64-bit, the Intel E6850 (4MB L2 cache, 3.0GHz) ran anywhere
from 50-60% faster than the AMD X2/5200+. Don't worry, no production
machines
were overclocked (for obvious official, sign/off reasons.) It was just a
admin's
corner cubicle experiment.
Most of these tools are not multi-threaded, so the only way you will
get a speedup is if you have multiple jobs at the same time. Event-
driven simulation in particular is not amenable to multi-threading,
despite much wishful thinking for the last few decades.

When I ran two separate (unrelated) jobs simultaneously on the AMD and Intel
machines, the AMD machine handled dual-tasking much better. AMD only
dropped 5-7%, for each job. The E6600 fared a lot worse -- anywhere from
10-30% performance drop. (Though not as bad as the Pentium/3 and
Pentium/4 based Xeons.)

I'm wondering if the E6600's unified 4MB L2-cache thrashes badly in
dual-tasking.
Or maybe the better way to look at it, in single-tasking the 4MB L2-cache is
4X more than the AMD Opteron's 1MB cache per CPU-core.
 
I

Ioiod

Eric Smith said:
Although that might be true for some specific cases, in general on Linux
native 64-bit executables tend to run faster than 32-bit executables.
But I haven't benchmarked 32-bit vs. 64-bit FPGA tools.

I think that should be qualified to say 64-bit x86_64 Linux binaries run
faster than
the same binaries compiled for 32-bit x86 Linux.

For other CPU-architectures (MIPs, SPARC, PowerPC, etc.), the opposite is
generally true.
 
G

glen herrmannsfeldt

Ioiod wrote:

(snip)
On Solaris SPARC, 64-bit mode was definitely slower than
> 32-bit mode, by about 10-20%.
For the life of me, I can't understand why the AMD would
run 64-bit mode faster than its 32-bit mode
-- but for every other machine architecture,
64-bit mode is almost always slower.

It might be because more registers are available, and IA32
code is register starved.

-- glen
 
P

Paul Leventis

Hi JJ,

Here is a rather long but detailed reply to your questions courtesy of
Adrian, one of our parallel compile experts.

You were correct in guessing that quartus_fit included floating-point
operations, but as other writers here have responded, memory accesses
are easily as important in terms of runtime, if not more so. By
contrast, quartus_sta is dominated by integer operations and memory
accesses. Incidentally, this is why quartus_fit will produce a
different fit on different OS's while quartus_sta will not - integer
operations are exact across all platforms but the different compilers
optimize floating-point operations differently between Windows and
Linux, which result in a different fit.

Quartus II's new NUM_PARALLEL_PROCESSORS is required to enable any
kind of parallel compilation. We do not offer any support for
HyperThreaded processors and actually recommend our users disable it
in the BIOS, as it can decrease memory system performance even for a
normal, non-parallel compilation. By contrast, multi-core machines
yeild good results. If you have an Intel Core 2 Duo, for example,
you'd set NUM_PARALLEL_PROCESSORS to 2. If you have two dual-core
Opterons, you'd set it to 4, and so on.

Currently, some parts of quartus_fit, quartus_tan and quartus_sta can
take advantage of parallel compilation, though the best improvement is
usually in quartus_fit. Small designs and those with easy timing and
routability constraints will typically not see much improvement, but
larger and harder-to-fit circuits (the designs that need it the most!)
can see substantial reductions. While the speedups are currently
modest and nowhere near linear with the number of processors used,
they have improved with every release since Quartus 6.1 and we plan to
continue this in future releases.

We do not currently support additional parallel features during
incremental compilation; ie, different partitions will not be mapped
and fit completely in parallel; the fitter will get as much benefit
from parallel compilation as it would without any partitions.

One gotcha with parallel compilation is related to my first point
about Quartus having lots of memory accesses. On some current systems,
the memory system can become a significant bottleneck. For example, an
Intel Core 2 Quad chip has two shared L2 caches, which enables very
fast communication between cores (1,2) and (2,3), but relatively slow
communication between (1,3) and (2,4) since those memory requests must
all share the front-side bus. In this case, setting
NUM_PARALLEL_PROCESSORS to 4 may even give a worse result than setting
it to 2 by forcing half the communication to take place over this
slower FSB. Even with only two processors in use, the OS may sometimes
schedule the processes on cores (1,3) and (2,4) unless you specify
otherwise. Solutions to this problem can be found at www.altera.com/support.
Not all platforms are affected; you'll have to try it and see.

At present, Quartus II currently supports a maximum of four processors
(or cores), so your dual Quad configuration will mostly go unused.
However, your intuition about leaving a processor free is correct; if
you have a four-core system and leave NUM_PARALLEL_PROCESSORS to 3,
you will never see Quartus take more than 75% of your computer's CPU.

As for different OS's, the 32-bit Windows version of Quartus is a
little faster than the Linux version; the differences are largely due
to the quality and settings of the optimizing C compilers we use on
these two platforms, and varies somewhat between various Quartus
executables. 64-bit versions of Quartus are slightly slower than 32-
bit versions due to the increase in working set size (memory) from 64-
bit pointers; this in turn reduces cache hits and thus slows down the
program. This behaviuor is true of most 64-bit applications.

Note: You can run 32-bit Quartus in 64-bit Windows/Linux with no such
performance penalty, and gain access to 4 GB of addressable memory.
This should meet user needs for all but the largets and most
complicated of Stratix III designs. See information on memory
requirements of Quartus at http://www.altera.com/products/software/products/quartus2/memory/qts-memory.html.
Also, I've posted on this topic previously (http://tinyurl.com/
36boga).

Regards,

Paul Leventis
Altera Corp.
 
P

Paul Leventis

Did you changed the setting "use up to x number of CPUs" (don't remember
the exact name) somewhere in the project settings?

Yes, turning on multiple CPU support (NUM_PARALLEL_PROCESSORS setting)
will help :)

It will also depend on whether this is a slow or fast compile. A toy
design will see no speed-up, since the run time will be dominated by
aspects of the compiler that are normally a small portion of run time
-- reading databases from disk, setting up data structures, etc. It
is only the key time-consuming algorithms that have been parallelized
(and only some of them at that). Gains will be the largest on large
designs with compilcated timing assignments.

Regards,

Paul Leventis
Altera Corp.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top