iterative algorithms + tightly coupled CPU with cloud of logic in FPGA

Discussion in 'VHDL' started by wallge, Jan 4, 2007.

  1. wallge

    wallge Guest

    I was wondering if anyone had experience with using combinations of
    FPGA based CPUs and surrounding logic to perform iterative algorithms.
    For instance, if we want to implement different types of more complex
    computer vision algorithms in an embedded system, we may wish to use
    the parallelism of an fpga to do multiple parts of a 2d convolution or
    matrix operation in parallel.
    While the FPGA may be able to handle the number crunching requirements
    of a given algorithm, it seems to me to be ill suited to handle the
    iterative (often non-systolic) nature of many advanced image processing
    algorithms. Often more complex computer vision algorithms seem to be
    too complex to be handled solely by FPGA based logic.

    I was thinking of the case were we have an FPGA connected directly to a
    video source, and data is flowing into the system at some fixed rate.
    We may wish to process this data at several scales, and iteratively
    search the low scales up to the higher ones until we have found
    features of interest in the video stream. Perhaps we wish to mark those
    features by altering pixels in their local neighborhood.

    We may need to iteratively process multiple scales of image data and
    buffer the original video frame in off-FPGA DRAM, since there will not
    be enough on-FPGA BRAM to store full images. Once we find the region of
    interest, we may then wish to retrieve the original to be marked and
    then sent off as output video. A good example of this process might be,
    say, face detection.

    It seems to me that the iterative nature of these kinds of algorithms
    needs to be handled by a combination of CPU and FPGA logic. The FPGA
    handling the number crunching and parallel data
    paths, and the CPU handling the notion of when to iterate, or when to
    stop, or in general, what decision to take next based on the results of
    the FPGA's number crunching. The CPU could be built from programmable
    logic, or placed off-FPGA.

    Does anyone have experience with this kind of thing, or know of
    somewhere I might be able to find more information about optimal ways
    of coupling heterogenous processors?

    I am aware of Altera's C2H compiler, but have not used it, and don't
    know how optimally it combines FPGA/CPU resources.
    I might be in the market to hire a consultant, if one were
    knowledgeable in this area.
     
    wallge, Jan 4, 2007
    #1
    1. Advertising

  2. wallge

    JJ Guest

    wallge wrote:
    > I was wondering if anyone had experience with using combinations of
    > FPGA based CPUs and surrounding logic to perform iterative algorithms.
    > For instance, if we want to implement different types of more complex
    > computer vision algorithms in an embedded system, we may wish to use
    > the parallelism of an fpga to do multiple parts of a 2d convolution or
    > matrix operation in parallel.
    > While the FPGA may be able to handle the number crunching requirements
    > of a given algorithm, it seems to me to be ill suited to handle the
    > iterative (often non-systolic) nature of many advanced image processing
    > algorithms. Often more complex computer vision algorithms seem to be
    > too complex to be handled solely by FPGA based logic.
    >
    > I was thinking of the case were we have an FPGA connected directly to a
    > video source, and data is flowing into the system at some fixed rate.
    > We may wish to process this data at several scales, and iteratively
    > search the low scales up to the higher ones until we have found
    > features of interest in the video stream. Perhaps we wish to mark those
    > features by altering pixels in their local neighborhood.
    >
    > We may need to iteratively process multiple scales of image data and
    > buffer the original video frame in off-FPGA DRAM, since there will not
    > be enough on-FPGA BRAM to store full images. Once we find the region of
    > interest, we may then wish to retrieve the original to be marked and
    > then sent off as output video. A good example of this process might be,
    > say, face detection.
    >
    > It seems to me that the iterative nature of these kinds of algorithms
    > needs to be handled by a combination of CPU and FPGA logic. The FPGA
    > handling the number crunching and parallel data
    > paths, and the CPU handling the notion of when to iterate, or when to
    > stop, or in general, what decision to take next based on the results of
    > the FPGA's number crunching. The CPU could be built from programmable
    > logic, or placed off-FPGA.
    >
    > Does anyone have experience with this kind of thing, or know of
    > somewhere I might be able to find more information about optimal ways
    > of coupling heterogenous processors?
    >
    > I am aware of Altera's C2H compiler, but have not used it, and don't
    > know how optimally it combines FPGA/CPU resources.
    > I might be in the market to hire a consultant, if one were
    > knowledgeable in this area.


    Some ideas
    If you have the dosh, you might consider using the Opteron server
    boards with the 2nd socket used for an FPGA plugin module, there is one
    product for virtex and another for stratix, you will need to google for
    those. They were discussed in this group a year ago when they first
    came out but I forget the vendor names. One issue here is that the
    Opterons are comunicating with the FPGAs through the HT bus and the
    Opterons are running at compute speeds in the 2GHz & up while the FPGA
    may be grunting at 300MHz or less but massively parallel. The Opteron
    had better be smart about partioning the problem and not get to into
    the FPGA at too fine a grain otherwise the HT bus will be the
    bottleneck and either the cpu or FPGA may be idle.

    The other idea is to consider the soft core processor as a unit you can
    either customize at the instruction level by adding your own bit
    twiddly opcodes or add a coprocessor for more complex processing.
    Adding opcodes usually slows down the cpu since it has already been
    architected without your new opcodes in mind. The copro route should
    work better since this support is usually included in the architecture
    definition.

    If soft cores can perform most of their workload from a Bram with
    little need to go to external DRAM for code or data, then quite a few
    of these cores might be placed in the bigger FPGAs and you might then
    be able to mix and match with a mix of hardware engines under software
    control of local soft cpu and much closer in clock speed. You could
    think of a FFT butterfly box as a specialized cpu engine that has it
    instructions set in wired logic, generalize this into a DSP engine and
    there are many options.

    Also consider using real TI/ADI DSP chips with FPGA as possible
    accelerator and also look at nVidia GPUs as a PC accelerator, haven't
    been there but some folks claim some impressive speed ups and you
    probably already got the hardware.


    John Jakson
    transputer guy
     
    JJ, Jan 4, 2007
    #2
    1. Advertising

  3. It sounds like you have a software solution that you want to implement
    in hardware. I don't think who should be too hung up on the iterative
    nature of your algorithm but instead you need to rewrite your algorithm
    targeting hardware or taking advantage of the VHDL. You should be
    looking for things that can be done in parallel. Pipelining can reduce
    the need for memory.

    An example from your description is as areas of interest are
    identified, instead of marking them, pass them onto the next stage in
    the solution.

    I evaluated a couple of different C-to-HDL compilers. Most times they
    require you to rewrite code to work within their environment. To some
    extent it is like learning a new or another language. Now don't get
    me wrong, there are still advantages to using these tools but I found
    the VHDL that they produce wasn't optimal but 'safe'. I instead
    decided to spend my time becoming a better VHDL programmer.

    My advice is start by implementing your solution as a state machine. In
    your algorithm, break up compound statements into simple steps. Each
    step becomes a state. I developed a technique for implementing for-next
    loops that I could easily manage. What I found was, while my solutions
    required a lot of cycles, I could achieve higher clock frequencies.
    After I was able to get a working solution that matched my software
    implementation I went back and identified things that I could implement
    as parallel units and pipeline.

    The other thing I wanted to do was reduce the number of multipliers I
    was using. So, I recoded them as shared hardware instead. The amount of
    combinational logic I was using went up but I was able to reduce the
    number of multipliers to 4 from 36.
     
    Derek Simmons, Jan 5, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Melanie Nasic
    Replies:
    19
    Views:
    3,072
    Thomas Rudloff
    Jan 1, 2006
  2. jersie0
    Replies:
    6
    Views:
    19,574
    Toby A Inkster
    Nov 16, 2003
  3. Replies:
    2
    Views:
    8,010
    Marcus Harnisch
    Oct 30, 2006
  4. Chocawok
    Replies:
    19
    Views:
    711
    Mark Nicholls
    Jan 30, 2006
  5. Jim Langston
    Replies:
    5
    Views:
    367
    Jim Langston
    Sep 6, 2007
Loading...

Share This Page