real-time compression algorithms on fpga

Discussion in 'VHDL' started by Melanie Nasic, Dec 20, 2005.

  1. Hello community,

    I am thinking about implementing a real-time compression scheme on an FPGA
    working at about 500 Mhz. Facing the fact that there is no "universal
    compression" algorithm that can compress data regardless of its structure
    and statistics I assume compressing grayscale image data. The image data is
    delivered line-wise, meaning that one horizontal line is processed, than
    the next one, a.s.o.
    Because of the high data rate I cannot spend much time on DFT or DCT and on
    data modelling. What I am looking for is a way to compress the pixel data in
    spatial not spectral domain because of latency aspects, processing
    complexity, etc. Because of the sequential data transmission line by line a
    block matching is also not possible in my opinion. The compression ratio is
    not so important, factor 2:1 would be sufficient. What really matters is the
    real time capability. The algorithm should be pipelineable and fast. The
    memory requirements should not exceed 1 kb.
    What "standard" compression schemes would you recommend? Are there
    potentialities for a non-standard "own solution"?

    Thank you for your comments.

    Regards, Melanie
    Melanie Nasic, Dec 20, 2005
    1. Advertisements

  2. Melanie Nasic

    DerekSimmons Guest

    I attended Altera's DSP showcase/seminar in Rochester, there was a
    large presence for interest in implementing MPEG4 (section 10 or 11, I
    can't remember) for HDTV applications. You didn't say if the advice
    you are searching for is for a commercial product, R&D science project
    or a student project but even implementing something real time for DVD
    quality (720x480) is worth considering.

    I think a while back somebody announced the availability of JPEG
    library on the, I haven't
    tried it but critiquing it could be a good place to start. You could
    incorporate parts of its implementation into yours, omit the parts you
    don't agree with, and foster new not present in it. There is also a
    section Video Compress Systems, that
    would be worth taking a look at.

    A slightly different approach is using a tool like ImpulseC
    ( It isn't really C but it is close. It allows you
    to efficiently manage parallel systems of functions and integrate them
    into VHDL and Verilog designs.

    Maybe it is the bubble I have been living in, but your clock speed
    seems high, what device families are you considering? I have seen 80
    Mhz designs outperform similar applications on gigahertz PC. Don't
    let PC marketing skew your objectivity (or maybe it is there choice in
    operating systems).

    Could you tell us more about the purpose of your project and you end

    DerekSimmons, Dec 20, 2005
    1. Advertisements

  3. I would think of something like:
    - run length encoding
    - arithmetic coding
    - maybe a dictionary encoding like zip
    - if you know some statistics about the values you want to compress you could
    also try if huffman coding is sufficent

    Stefan Rybacki, Dec 20, 2005
  4. You don't say anything about quality.

    Here's C code for a lossy compressor / decompressor which consistently
    achieves a 2:1 ratio for 8 bpp grayscale images:

    #include <stdint.h>
    #include <stdio.h>

    compress(FILE *fin, FILE *fout)
    uint8_t pin[2], pout;

    for (;;) {
    if (fread(&pin, sizeof pin, 1, fin) != 1)
    return (ferror(fin) ? -1 : 0);
    pout = (pin[0] + pin[1]) / 2;
    if (fwrite(&pout, sizeof pout, 1, fout) != 1)
    return -1;

    decompress(FILE *fin, FILE *fout)
    uint8_t pin, pout[2];

    for (;;) {
    if (fread(&pin, sizeof pin, 1, fin) != 1)
    return (ferror(fin) ? -1 : 0);
    pout[0] = pout[1] = pin;
    if (fwrite(&pout, sizeof pout, 1, fout) != 1)
    return -1;

    (note that the code assumes that the size of the input stream is an
    even number)

    =?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=, Dec 20, 2005
  5. Melanie Nasic

    Dave Guest

    The answer, as always, is it all depends ...

    Lossless compression (something like run length encoding) might work for
    some kinds of image data (computer screens, rendered images), but will fail
    for others (natural images etc).

    Lossy compression will of course lose something from the image. The simplest
    form is probably to average two adjacent pixels, giving you 2 to 1. I
    suspect that anything more complex will exceed your space/speed/complexity

    You need to spell out what type of images you are processing, what level of
    'loss' is acceptable and why you need compression anyway (it will cause you
    much pain !)

    Dave, Dec 20, 2005
  6. Melanie Nasic

    Pete Fraser Guest

    Are you hoping for lossless, or is lossy OK?
    That's 1024 bits, or bytes?
    Is it enough for one line?
    You don't say what your resolution and frame rate are.
    If you don't have a line of storage available, that restricts you a lot.
    I don't understand this though. If you're using a 500MHz FPGA, it's
    presumably recent, and presumably has a decent amount of storage.

    How about a 1d predictor, non-linear quantizer and entropy coder?
    If you have more memory available, look at JPEG-LS.
    It can do lossless, or variable mild degrees of loss.
    Pete Fraser, Dec 20, 2005
  7. Hi Pete,

    I want the compression to be lossless and not based on perceptional
    irrelevancy reductions. By stating 1 kb I meant 1024 bits and that's just
    about half the line data. Your recommendation "1d predictor, non-linear
    quantizer and entropy coder" sound interesting. COuld you please elaborate
    on that? How is it done? Where can I find some exemplary codes? How can it
    be achieved with hardware (VHDL sources?)

    Thank you a lot.

    Bye, Mel.
    Melanie Nasic, Dec 20, 2005
  8. Melanie Nasic

    Pete Fraser Guest

    I think you first need to play around with software and a few sample images.
    The 1 d predictor means that you predict the next pixel in the sequence
    by examining pixels on the left. A simple example would be to encode the
    first pixel on the line, then use that as the prediction for the next pixel.
    In that way you send only the difference between the predicted value
    and what the pixel actually is. If you had enough memory to store a line you
    could use a 2 d predictor, where you predict from pixels to the left and
    pixels above.

    Unfortunately, you can't use the non-linear quantizer as it's lossy.

    I find Khalid Sayood's book "Introduction to Data Compression" quite good.
    It comes with a link to a bunch of simple C code that has a variety of
    predictors and entropy coders. You could try it on some sample images,
    see how good compression you get, then go to hardware when you have
    something acceptable.
    Pete Fraser, Dec 20, 2005
  9. Melanie Nasic

    Brannon Guest

    JPEGs are lossy because of the quantization step. You can do it without
    the quantization step and still notice a significant compression. If
    you preload your quantization constants and Huffman codes into lookup
    tables, you can easily process one pixel per clock cycle in a 1500 gate
    FPGA. I wrote a fully piplelined version that queued up the first eight
    rows of an incoming image into block ram before starting on the DCTs.
    It worked great. Ideally you would do DCTs on blocks larger than 8x8,
    but the advantage to 8x8 is that you can easily do 64 8bit operations
    in parallel which is nice for the Z-ordering, etc. Bigger squares
    require bigger chips and external memory, and as soon as you have to go
    to external memory you lose your pipelining.

    You don't want to do a dictionary method for an image. In fact, I'm not
    sure you want to do a dictionary method in FPGA. That sounds scary to
    me. Use frequency domain (DCT or Wavelet) Z-ordering for photos and use
    raw RLE for screen shots and similar images.
    Brannon, Dec 20, 2005
  10. Melanie Nasic

    Pete Fraser Guest

    Is the OP not getting pixels in raster-scan order though?
    That, and a half-line memory limit means there's not enough storage
    for a 2-d DCT.
    Pete Fraser, Dec 20, 2005
  11. Melanie Nasic

    Nils Guest

    If you can store a buffer of at least one extra scanline, you could try the
    Paeth predictor + RLE. This will give reasonable prediction of the next
    pixel's grayscale value, and if the prediction is OK, the result will often
    contain a string of zeroes, and the RLE will do a good job.

    If you can "afford it" (in other words, the FPGA is fast enough), you could
    use arithmetic coding on the resulting predicted values, with a simple order
    0 model, instead of RLE.

    Paeth + RLE will do OK on computer generated images, but not on natural
    images. Paeth + AC will do OK on both.

    Both will fit in 1kb of code for sure.

    Nils, Dec 22, 2005
  12. Melanie Nasic

    John B Guest

    Have a look at Graphics File Formats by Kay & Levine (ISBN
    0-07-034025-0). It will give you some ideas.
    John B, Dec 22, 2005
  13. Melanie Nasic

    news Guest

    : I want the compression to be lossless and not based on perceptional
    : irrelevancy reductions.

    If it has to be lossless there's no way you can guarantee to
    get 2:1 compression (or indeed any compression at all!). You
    may do, with certain kinds of input, but it's all down to the
    statistics of the data. The smaller your storage the less
    you can benefit from statistical variation across the image,
    and 1 Kbyte is very small!

    Given that a lossless system is inevitably 'variable bit rate'
    (VBR) the concept of "real time capability" is somewhat vague;
    the latency is bound to be variable. In real-world applications
    the output bit-rate is often constrained so a guaranteed minimum
    degree of compression must be achieved; such systems cannot be
    (always) lossless.

    From my experience I would say you will need at least a 4-line
    buffer to get near to 2:1 compression on a wide range of input
    material. For a constant-bit-rate (CBR) system based on a 4x4
    integer transform see:

    This is designed for ease of hardware implementation rather than
    ultimate performance, and is necessarily lossy.

    To reply by email change 'news' to my forename.
    news, Dec 22, 2005
  14. Melanie Nasic

    Jerry Coffin Guest


    [ ... ]
    The output bit rate will vary, but can be bounded -- for an obvious
    example, consider using LZW compression with 8-bit inputs and 12-bit
    codes as the output. In the worst case, each 8-bit input produces a
    12-bit output, and you have to clear and rebuild the dictionary every
    4096 characters.

    Real-time constraints are a more or less separate issue though -- here,
    the output data stream isn't (usually) nearly as difficult to deal with
    as things like dictionary searching in the compression process. Like
    the compression rate, this will vary, but (again) it's usually pretty
    easy to set an upper bound. Using the same example, in LZW you build up
    a string out of characters, with one dictionary entry for each "next"
    character following a current character. There can be no more than 256
    next characters (worst case), so the worst case requirement is to
    search all 256 entries for follow-on characters in N microseconds (or
    nanoseconds, or whatever). You just about need more details to
    guarantee this though -- a trie-based dictionary has different
    characteristics than a hash-based dictionary (for an obvious example).
    In nearly every case it's still pretty easy to place an upper bound on
    the complexity and time involved though.

    OTOH, you can run into a little bit of a problem with some of the
    basics -- if you happen (for example) to be storing your dictionary in
    SRAM, the time per access is pretty easy to estimate. If you're storing
    it in something like SDRAM, the worst case can be a bit harder to
    figure out.
    Jerry Coffin, Dec 22, 2005
  15. Melanie Nasic

    Jerry Coffin Guest

    JPEG supports lossless encoding that can fit (at least roughly) within
    the constraints you've imposed. It uses linear prediction of the
    current pixel based on one or more previous pixels. The difference
    between the prediction and the actual value is what's then encoded. The
    difference is encoded in two parts: the number of bits needed for the
    difference and the difference itself. The number of bits is Huffman
    encoded, but the remainder is not.

    This has a number of advantages. First and foremost, it can be done
    based on only the curent scan line or (depending on the predictor you
    choose) only one scan line plus one pixel. In the latter case, you need
    to (minutely) modify the model you've outlined though -- instead of
    reading, compressing, and discarding an entire scan line, then starting
    the next, you always retain one scan line worth of data. As you process
    pixel X of scan line Y, you're storing pixels 0 through X+1 of the
    current scan line plus pixels X-1 through N (=line width) of the
    previous scan line.

    Another nice point is that the math involved is always simple -- the
    most complex case is one addition, one subtraction and a one-bit right
    Yes, almost certainly. Lossless JPEG is open to considerable
    improvement. Just for an obvious example, it's pretty easy to predict
    the current pixel based on five neighboring pixels instead of three. At
    least in theory, this should improve prediction accuracy by close to
    40% -- thus reducing the number of bits needed to encode the difference
    between the predicted and actual values. At a guess, you won't really
    see 40% improvement, but you'll still see a little improvement.

    In the JPEG 2000 standard, they added JPEG LS, which is certainly an
    improvement, but if memory serves, it requires storing roughly two full
    scan lines instead of roughly one scan line. OTOH, it would be pretty
    easy to steal some of the ideas in JPEG LS without using the parts that
    require more storage -- some things like its handling of runs are
    mostly a matter of encoding that shouldn't really require much extra

    The final question, however, is whether any of these is likely to give
    you 2:1 compression. That'll depend in your input data -- for typical
    photographs, I doubt that'll happen most of the time. For thngs like
    line art, faxes, etc., you can probably do quite a bit better than 2:1
    on a fairly regular basis. If you're willing to settle for nearl
    lossless compression, you can improve ratios a bit further.
    Jerry Coffin, Dec 22, 2005
  16. Melanie Nasic

    Jerry Coffin Guest

    Though it's only rarely used, there's a lossless version of JPEG
    encoding. It's almost completely different from normal JPEG encoding.
    This can be done within your constraints, but would be improved if you
    can relax them minutely. Instead of only ever using the current scan
    line, you can improve things if you're willing to place the limit at
    only ever storing one scan line. The difference is that when you're in
    the middle of a scan line (for example) you're storing the second half
    of the previous scan line, and the first half of the current scan line,
    rather than having half of the buffer sitting empty. If you're storing
    the data in normal RAM, this makes little real difference -- the data
    from the previous scan line will remain in memory until you overwrite
    it, so it's only really a question of whether you use it or ignore it.
    Yes. In the JPEG 2000 standard, they added JPEG LS, which is another
    lossless encoder. A full-blown JPEG LS encoder needs to store roughly
    two full scan lines if memory serves, which is outside your
    constraints. Nonetheless, if you're not worried about following the
    standard, you could create more or less a hybrid between lossless JPEG
    and JPEG LS, that would incorporate some advantages of the latter
    without the increased storage requirements.

    I suspect you could improve the prediction a bit as well. In essence,
    you're creating a (rather crude) low-pass filter by averaging a number
    of pixels together. That's equivalent to a FIR with all the
    coefficients set to one. I haven't put it to the test, but I'd guess
    that by turning it into a full-blown FIR with carefully selected
    coefficients (and possibly using more of the data you have in the
    buffer anyway) you could probably improve the predictions. Better
    predictions mean smaller errors, and tighter compression.
    Jerry Coffin, Dec 22, 2005
  17. Hi Jerry,

    thanks for your response(s). Sounds quite promising. Do you know something
    about hardware implementation of the compression schemes you propose? Are
    there already VHDL examples available or at least C reference models?

    Regards, Melanie
    Melanie Nasic, Dec 23, 2005
  18. Melanie Nasic

    Jerry Coffin Guest

    Sorry 'bout that -- Google claimed the first one hadn't posted, so I
    tried again...
    I don't believe I've seen any VHDL code for it. One place that has C
    code is:

    I suspect Google would turn up a few more implementations as well. If
    you like printed information, you might consider _Image and Video
    Compression Standards: Algorithms and Architectures; Second Edition_ by
    Bhaskaran and Konstantinides. Published by Kluwer Academic Publishers,
    ISBN 0-7923-9952-8. It doesn't go into tremendous detail, but it's one
    of the few (that I know of) that discusses lossless JPEG at all.
    Jerry Coffin, Dec 23, 2005
  19. I am thinking about implementing a real-time compression scheme on an FPGA
    I'm currently working on something similar ...

    simple predictive schemes (like the 4th predictor from jpeg or MED (from
    jpeg-ls)) look promising ... they require storage of one line

    entropy coding:
    huffman and multilevel arithmetic coding require a lot of ressources
    (that easily ends up above 1kbit even for a table of probabilities)
    binary arithmetic coding would be able to code less than 1bit/cycle
    (which ends up at >5 cyles/pixel which is too slow in my case)

    I'm currently investigating different schemes of golomb-rice codes
    (static, adaptive or even context-adaptive like in jpeg-ls) ... so far
    they look promising ...

    jpeg-ls: the concept looks nice and quite powerful for a pipelined
    encoder (like for fpgas) - unfortunately the decoder would require a
    large feedback-loop (and pipelining is almost impossible) ...
    jpeg-ls is only symmetric for software implementations :-(

    I'm still curious how you plan to achieve 500MHz even on a Virtex4
    (I would say something like 200MHz could be possible)

    =?ISO-8859-1?Q?Michael_Sch=F6berl?=, Dec 29, 2005
  20. Hi Mel,

    you can calculate the delta of two subsequent pixels an then Huffman
    code this result. This should archive almost 2:1 if there are not much
    large brightness steps in the picture.

    Thomas Rudloff, Jan 1, 2006
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.