video buffering scheme, nonsequential access (no spatial locality)

Discussion in 'VHDL' started by wallge, Jan 24, 2007.

  1. wallge

    wallge Guest

    I am doing some embedded video processing, where I store an incoming
    frame of video, then based on some calculations in another part of the
    system, I warp that buffered frame of video. Now when the frame goes
    into the buffer
    (an off-FPGA SDRAM chip), it is simply written in one pixel at a time
    in row major ordering.

    The problem with this is that I will not be accessing it in this way. I
    may want to do some arbitrary image rotation. This means
    the first pixel I want to access is not the first one I put in the
    buffer, It might actually be the last one in the buffer. If I am doing
    full page reads, or even burst reads, I will get a bunch of pixels that
    I will not need to determine the output pixel value. If i just do
    single reads, this waists a bunch of clock cycles setting up the SDRAM,
    telling it which row to activate and which column to read from. After
    the read is done, you then have to issue the precharge command to close
    the row. There is a high degree of inefficiency to this. It takes 5,
    maybe 10 clock cycles just to retrieve one
    pixel value.

    Does anyone know a good way to organize a frame buffer to be more
    friendly (and more optimal) to nonsequential access (like the kind we
    might need if we wanted to warp the input image via some
    linear/nonlinear transformation)?
     
    wallge, Jan 24, 2007
    #1
    1. Advertising

  2. I have somewhat the same problem and I'm using ram that provides fast
    random access, i.e. ZBT ram. You can get ZBT ram that runs at 200 MHz,
    so that you can effectively process 100 Mpixels/s. ZBT ram is very
    small compared to SDRAM, but if you only need to store a few frames,
    that shouldn't be a problem.

    Adding ZBT might not be an option on your system however... Maybe
    someone can suggest a clever algorithm for your particular problem.


    Patrick Dubois

    On Jan 24, 2:36 pm, "wallge" <> wrote:
    > I am doing some embedded video processing, where I store an incoming
    > frame of video, then based on some calculations in another part of the
    > system, I warp that buffered frame of video. Now when the frame goes
    > into the buffer
    > (an off-FPGA SDRAM chip), it is simply written in one pixel at a time
    > in row major ordering.
    >
    > The problem with this is that I will not be accessing it in this way. I
    > may want to do some arbitrary image rotation. This means
    > the first pixel I want to access is not the first one I put in the
    > buffer, It might actually be the last one in the buffer. If I am doing
    > full page reads, or even burst reads, I will get a bunch of pixels that
    > I will not need to determine the output pixel value. If i just do
    > single reads, this waists a bunch of clock cycles setting up the SDRAM,
    > telling it which row to activate and which column to read from. After
    > the read is done, you then have to issue the precharge command to close
    > the row. There is a high degree of inefficiency to this. It takes 5,
    > maybe 10 clock cycles just to retrieve one
    > pixel value.
    >
    > Does anyone know a good way to organize a frame buffer to be more
    > friendly (and more optimal) to nonsequential access (like the kind we
    > might need if we wanted to warp the input image via some
    > linear/nonlinear transformation)?
     
    Patrick Dubois, Jan 25, 2007
    #2
    1. Advertising

  3. "wallge" <> writes:

    > I am doing some embedded video processing, where I store an incoming
    > frame of video, then based on some calculations in another part of the
    > system, I warp that buffered frame of video. Now when the frame goes
    > into the buffer
    > (an off-FPGA SDRAM chip), it is simply written in one pixel at a time
    > in row major ordering.
    >
    > The problem with this is that I will not be accessing it in this way. I
    > may want to do some arbitrary image rotation. This means
    > the first pixel I want to access is not the first one I put in the
    > buffer, It might actually be the last one in the buffer. If I am doing
    > full page reads, or even burst reads, I will get a bunch of pixels that
    > I will not need to determine the output pixel value. If i just do
    > single reads, this waists a bunch of clock cycles setting up the SDRAM,
    > telling it which row to activate and which column to read from. After
    > the read is done, you then have to issue the precharge command to close
    > the row. There is a high degree of inefficiency to this. It takes 5,
    > maybe 10 clock cycles just to retrieve one
    > pixel value.
    >


    If you are doing truly arbitrary warping, then is it not right that
    you can never get an optimal organisation for all warps?

    > Does anyone know a good way to organize a frame buffer to be more
    > friendly (and more optimal) to nonsequential access (like the kind we
    > might need if we wanted to warp the input image via some
    > linear/nonlinear transformation)?
    >


    Could you do some kind of caching scheme where you read an entire DRAM
    row in at a time, and "hope it comes in handy" later?

    Failing that, can you use SSRAM for your frame buffer?

    Or, can you parallelise your task so that it operates on (eg) 4 wildly
    different areas of input data at a time, which means you can use the
    banking mechanism of the DRAMs to hide the latency?

    Those are my initial thoughts (whilst waiting for a very loooooong
    simulation to run :)

    Cheers,
    Martin

    --

    TRW Conekt - Consultancy in Engineering, Knowledge and Technology
    http://www.conekt.net/electronics.html
     
    Martin Thompson, Jan 25, 2007
    #3
  4. wallge wrote:

    > Does anyone know a good way to organize a frame buffer to be more
    > friendly (and more optimal) to nonsequential access


    Sounds like a RAM.
    If it didn't fit in fpga block ram
    I would use an external device.

    -- Mike Treseler
     
    Mike Treseler, Jan 25, 2007
    #4
  5. wallge

    wallge Guest

    I should have been more specific in my question.

    I have to use a small (64 Mbit) mobile sdram. I can't choose
    to use a different storage element in the system (other than *some*
    FPGA buffering, though not full frame).

    I have heard some discussion of the way in which graphic accelerator
    boards do memory transactions, storing pixels in blocks of neighbor
    pixels
    (instead of being organized row major). In other words the spatial
    locality
    in the SDRAM buffer might look like:

    Image pixels:
    N2 N3 N4
    N1 P N5
    N8 N7 N6

    Memory organization:
    ADDR DATA
    0x0000 P
    0x0001 N1
    0x0002 N2
    0x0003 N3
    0x0004 N4
    0x0005 N5
    0x0006 N6
    0x0007 N7
    0x0008 N8


    Where P is the central pixel of interest, and the N's are its
    neighbors.
    We organize the pixels in the SDRAM buffer not by rows, but by regions
    of interest.
    This way if we are doing some kind of Image warp and we want to get
    more bang for the buck
    in terms of read latency, we are more likely to reuse pixels in the
    neighborhood of the currently accessed pixel
    than if we were arranged in a row or column major ordering (consider
    the case were we wanted to rotate an image by 47.2 degrees from input
    to output).

    Has anyone seen something like this or know of any resources online
    with regard to memory buffer organization schemes for graphics or image
    processing?



    On Jan 24, 2:36 pm, "wallge" <> wrote:
    > I am doing some embedded video processing, where I store an incoming
    > frame of video, then based on some calculations in another part of the
    > system, I warp that buffered frame of video. Now when the frame goes
    > into the buffer
    > (an off-FPGA SDRAM chip), it is simply written in one pixel at a time
    > in row major ordering.
    >
    > The problem with this is that I will not be accessing it in this way. I
    > may want to do some arbitrary image rotation. This means
    > the first pixel I want to access is not the first one I put in the
    > buffer, It might actually be the last one in the buffer. If I am doing
    > full page reads, or even burst reads, I will get a bunch of pixels that
    > I will not need to determine the output pixel value. If i just do
    > single reads, this waists a bunch of clock cycles setting up the SDRAM,
    > telling it which row to activate and which column to read from. After
    > the read is done, you then have to issue the precharge command to close
    > the row. There is a high degree of inefficiency to this. It takes 5,
    > maybe 10 clock cycles just to retrieve one
    > pixel value.
    >
    > Does anyone know a good way to organize a frame buffer to be more
    > friendly (and more optimal) to nonsequential access (like the kind we
    > might need if we wanted to warp the input image via some
    > linear/nonlinear transformation)?
     
    wallge, Jan 25, 2007
    #5
  6. wallge

    Pete Fraser Guest

    "wallge" <> wrote in message
    news:...

    >
    > Image pixels:
    > N2 N3 N4
    > N1 P N5
    > N8 N7 N6


    Have you thought about what order of filtering you'll
    need to use?
     
    Pete Fraser, Jan 25, 2007
    #6
  7. wallge

    wallge Guest

    I am not doing any image filtering.
    This is not a filtering operation.
    It is an interpolation operation
    typically bilinear or bicubic
    to do image transformations.

    On Jan 25, 1:00 pm, "Pete Fraser" <> wrote:
    > "wallge" <> wrote in messagenews:...
    >
    >
    >
    > > Image pixels:
    > > N2 N3 N4
    > > N1 P N5
    > > N8 N7 N6Have you thought about what order of filtering you'll

    > need to use?
     
    wallge, Jan 25, 2007
    #7
  8. wallge

    Pete Fraser Guest

    "wallge" <> wrote in message
    news:...
    >I am not doing any image filtering.


    Yes you are.

    > This is not a filtering operation.


    Yes it is.

    > It is an interpolation operation
    > typically bilinear or bicubic
    > to do image transformations.


    And that's a filtering operation.
    So the maximum kernel size is 4 x 4, though
    you might use 2 x 2. The kernel size could have a substantail
    bearing on the traffic to/from on-chip RAM.

    I'm still not sure of your limitations on off-chip RAM.
    You have a buffer on the input or output (or both?)
    Do you have enough bandwidth to have an
    intermediate buffer for a two-pass operation?
     
    Pete Fraser, Jan 25, 2007
    #8
  9. wallge

    wallge Guest

    Can you write out the FIR filter coeffs for
    a bilinear interpolation "filter kernel"?
    How about a bicubic interpolator filter kernel
    what are its filter coeffs?

    arguing semantics was not the purpose of my post.

    I will probably wind up doing bilinear interpolation or
    "filtering". Which means I need 4 pixels of the input frame to
    determine
    1 pixel of output warped frame.

    By the way what is the Freq response of the bilinear interpolation
    "filter"?



    On Jan 25, 5:16 pm, "Pete Fraser" <> wrote:
    > "wallge" <> wrote in messagenews:...
    >
    > >I am not doing any image filtering.Yes you are.

    >
    > > This is not a filtering operation.Yes it is.

    >
    > > It is an interpolation operation
    > > typically bilinear or bicubic
    > > to do image transformations.And that's a filtering operation.

    > So the maximum kernel size is 4 x 4, though
    > you might use 2 x 2. The kernel size could have a substantail
    > bearing on the traffic to/from on-chip RAM.
    >
    > I'm still not sure of your limitations on off-chip RAM.
    > You have a buffer on the input or output (or both?)
    > Do you have enough bandwidth to have an
    > intermediate buffer for a two-pass operation?
     
    wallge, Jan 25, 2007
    #9
  10. wallge

    Pete Fraser Guest

    "wallge" <> wrote in message
    news:...
    > Can you write out the FIR filter coeffs for
    > a bilinear interpolation "filter kernel"?
    > How about a bicubic interpolator filter kernel
    > what are its filter coeffs?


    I'm happy to, but we're getting away from FPGA stuff,
    so let's do that off line. Let me know how many phases you
    need, and the coefficient format you'd like. I usually
    use a minor 4x4 variation on cubic, but it's all set up in
    Mathematica, so I could do cubic also.

    >
    > arguing semantics was not the purpose of my post.
    >
    > I will probably wind up doing bilinear interpolation or
    > "filtering". Which means I need 4 pixels of the input frame to
    > determine
    > 1 pixel of output warped frame.


    So you don't really need coefficient tables for this.
    You can just use the fractional phase directly.

    >
    > By the way what is the Freq response of the bilinear interpolation
    > "filter"?


    It depends on the position of output relative to input pixel, but
    for a central output pixel the frequency response would be
    Cosusoidal.

    Getting back to FPGA stuff though, what are your off-chip
    RAM bandwidth limitations, and could you consider a two-pass approach?

    >> I'm still not sure of your limitations on off-chip RAM.
    >> You have a buffer on the input or output (or both?)
    >> Do you have enough bandwidth to have an
    >> intermediate buffer for a two-pass operation?

    >
     
    Pete Fraser, Jan 26, 2007
    #10
  11. wallge

    wallge Guest

    I am not sure what you mean by two pass approach.
    The max (theoretical) bandwidth I have available to/from the SDRAM
    is about
    16 bits * 100 Mhz = 1.6 Gbit/sec

    This is not an achievable estimate of course, even if I only did full
    page
    reads and writes, since there is overhead associated with each. I also
    have to refresh periodically.

    My pixel bit width could be brought down to 8 bits. That way I could
    store 2
    pixels per address if need be.



    On Jan 25, 7:23 pm, "Pete Fraser" <> wrote:
    > "wallge" <> wrote in messagenews:...
    >
    > > Can you write out the FIR filter coeffs for
    > > a bilinear interpolation "filter kernel"?
    > > How about a bicubic interpolator filter kernel
    > > what are its filter coeffs?I'm happy to, but we're getting away from FPGA stuff,

    > so let's do that off line. Let me know how many phases you
    > need, and the coefficient format you'd like. I usually
    > use a minor 4x4 variation on cubic, but it's all set up in
    > Mathematica, so I could do cubic also.
    >
    >
    >
    > > arguing semantics was not the purpose of my post.

    >
    > > I will probably wind up doing bilinear interpolation or
    > > "filtering". Which means I need 4 pixels of the input frame to
    > > determine
    > > 1 pixel of output warped frame.So you don't really need coefficient tables for this.

    > You can just use the fractional phase directly.
    >
    >
    >
    > > By the way what is the Freq response of the bilinear interpolation
    > > "filter"?It depends on the position of output relative to input pixel, but

    > for a central output pixel the frequency response would be
    > Cosusoidal.
    >
    > Getting back to FPGA stuff though, what are your off-chip
    > RAM bandwidth limitations, and could you consider a two-pass approach?
    >
    > >> I'm still not sure of your limitations on off-chip RAM.
    > >> You have a buffer on the input or output (or both?)
    > >> Do you have enough bandwidth to have an
    > >> intermediate buffer for a two-pass operation?
     
    wallge, Jan 26, 2007
    #11
  12. "wallge" <> writes:

    > I am not sure what you mean by two pass approach.
    > The max (theoretical) bandwidth I have available to/from the SDRAM
    > is about
    > 16 bits * 100 Mhz = 1.6 Gbit/sec
    >
    > This is not an achievable estimate of course, even if I only did full
    > page
    > reads and writes, since there is overhead associated with each. I also
    > have to refresh periodically.
    >


    Don't forget that for video apps, you often don't need to refresh, as
    you are reading and writing the SDRAM rows in a regular fashion which
    means you can guarantee that each gets touched often enough.

    Indeed for some video applications, like output framebuffers, all you
    need to do is ensure that you read the row out for display soon enough
    after the write, which is often easy to achieve.

    Cheers,
    Martin

    --

    TRW Conekt - Consultancy in Engineering, Knowledge and Technology
    http://www.conekt.net/electronics.html
     
    Martin Thompson, Jan 29, 2007
    #12
  13. wallge

    wallge Guest

    Gabor,

    Are you saying that I don't need to activate/precharge the bank
    when switching to another?
    I am kind of unclear on this. When do activate and precharge commands
    need to be issued? I thought when switching to a new row or bank you
    had
    to precharge (close) the previously active one, then activate the new
    row/bank before
    actually reading from or writing to it. Where am I going wrong here?

    Also to the notion that I don't need to refresh since I am doing video
    buffering: I am actually buffering multiple frames of video and then
    reading
    out several frames later. In other words, there may be a significant
    fraction
    of a second (say 1/8~1/4 sec) of delay between writing data into a
    particular page of memory and actually reading it back out.
    Is this too much time to expect my pixel data to still be valid
    without refreshing?



    On Jan 26, 6:03 pm, "Gabor" <> wrote:
    > On Jan 26, 3:15 pm, "wallge" <> wrote:
    >
    > > I am not sure what you mean by two pass approach.
    > > The max (theoretical) bandwidth I have available to/from the SDRAM
    > > is about
    > > 16 bits * 100 Mhz = 1.6 Gbit/sec

    >
    > > This is not an achievable estimate of course, even if I only did full
    > > page
    > > reads and writes, since there is overhead associated with each. I also
    > > have to refresh periodically.

    >
    > > My pixel bit width could be brought down to 8 bits. That way I could
    > > store 2
    > > pixels per address if need be.You may be missing an important feature of SDRAM. You don't need to

    > use full-page reads or writes to keep data streaming at 100% of the
    > available bandwidth (if you don't change direction) or very nearly 100%
    > (if you switch from read to write infrequently). This is due to the
    > ability
    > to set up another block operation on one bank while another bank is
    > transferring data. When I use SDRAM for relatively random operations
    > like this I like to think of the minimum data unit as one minimal burst
    > (two words in a single-data-rate SDRAM) to each of the four banks.
    > Any number of these data units can be strung one after another
    > with no break in the data flow. Then if you wanted to internally
    > buffer
    > a square section of the image in internal blockRAM the width
    > of the minimum block (allowing 100% data rate) would only be
    > 16 8-bit pixels or 8 16-bit pixels in your case. If the area can
    > cover the required computational core (4 x 4?) for several pixels
    > at a time, you can reduce overall bandwidth. This was the point
    > of suggesting an internal cache memory.
    >
    > HTH,
    > Gabor
     
    wallge, Jan 29, 2007
    #13
  14. "wallge" <> writes:

    > Are you saying that I don't need to activate/precharge the bank when
    > switching to another?


    Not necessarily.

    > I am kind of unclear on this. When do activate and precharge
    > commands need to be issued? I thought when switching to a new row or
    > bank you had to precharge (close) the previously active one, then
    > activate the new row/bank before actually reading from or writing to
    > it. Where am I going wrong here?


    You have to precharge a bank only when you switch to another row
    within that bank.

    > Also to the notion that I don't need to refresh since I am doing
    > video buffering: I am actually buffering multiple frames of video
    > and then reading out several frames later. In other words, there may
    > be a significant fraction of a second (say 1/8~1/4 sec) of delay
    > between writing data into a particular page of memory and actually
    > reading it back out. Is this too much time to expect my pixel data
    > to still be valid without refreshing?


    That very much depends on the access patterns. The fact that you are
    going to implement a frame buffer alone doesn't automatically mean
    that you won't need a refresh. Double-, or triple-check your specs. If
    in doubt I'd definitely recommend putting in a refresh as low priority
    task.

    Regards,
    Marcus

    --
    note that "property" can also be used as syntaxtic sugar to reference
    a property, breaking the clean design of verilog; [...]

    -- Michael McNamara
    (http://www.veripool.com/verilog-mode_news.html)
     
    Marcus Harnisch, Jan 30, 2007
    #14
  15. wallge

    wallge Guest

    I just wanted to say thanks to everyone for responding
    with a lot of helpful answers and feedback in this post.
    Really great forum.

    On Jan 30, 10:32 am, "Gabor" <> wrote:
    > On Jan 29, 10:50 am, "wallge" <> wrote:
    >
    > > Gabor,

    >
    > > Are you saying that I don't need to activate/precharge the bank
    > > when switching to another?First of all, you don't "switch" banks. There are four banks that can

    > all potentially be active at a given time. Only the external
    > interface
    > works on one bank at a time. That being said, realise that the
    > control interface (address, ras, cas, we) is somewhat independent
    > of the data interface (dq).
    >
    > > I am kind of unclear on this. When do activate and precharge commands
    > > need to be issued? I thought when switching to a new row or bank you
    > > had
    > > to precharge (close) the previously active one, then activate the new
    > > row/bank before
    > > actually reading from or writing to it. Where am I going wrong here?You need to precharge a bank before opening a new row in _THAT_

    > bank. Other banks may remain open while this happens. When
    > doing single burst accesses, I generally precharge using the
    > read or write command with auto-precharge (A10 high during CAS).
    >
    > > Also to the notion that I don't need to refresh since I am doing video
    > > buffering: I am actually buffering multiple frames of video and then
    > > reading
    > > out several frames later. In other words, there may be a significant
    > > fraction
    > > of a second (say 1/8~1/4 sec) of delay between writing data into a
    > > particular page of memory and actually reading it back out.What's a page? These RAMs have rows. Each row must be accessed

    > using row activate or else refreshed within the refresh period. If
    > you
    > store data in successive rows / banks first, and then successive
    > columns (i.e. row/bank form LSB's of your address), you will usually
    > refresh the entire part without accessing a large portion of the
    > entire
    > memory.
    >
    > Here's a typical sequence I use for writing streaming data into
    > an SDRAM:
    >
    > Cycle Command Bank Addr Data
    > startup sequence has unused cycles (NOPs)
    > 1 ACT 0 row0 x
    > 2 NOP x x x
    > 3 ACT 1 row0 x
    > 4 NOP x x x
    > 5 ACT 2 row0 x
    > full streaming starts here (burst size = 2)
    > 6 WRITEA 0 col0 data0
    > 7 ACT 3 row0 data0
    > 8 WRITEA 1 col0 data1
    > 9 ACT 0 row1 data1
    > 10 WRITEA 2 col0 data2
    > 11 ACT 1 row1 data2
    > 12 WRITEA 3 col0 data3
    > 13 ACT 2 row1 data3
    > 14 WRITEA 0 col0 data4
    > 15 ACT 3 row1 data4
    > 16 WRITEA 1 col0 data5
    > above sequence (streaming can be repeated ad nauseum)
    > end sequence has unused cycles (NOPs)
    > 17 NOP x x data5
    > 18 WRITEA 2 col0 data6
    > 19 NOP x x data6
    > 20 WRITEA 3 col0 data7
    > 21 NOP x x data7
    >
    > WRITEA is write command with autoprecharge (A10 = 1)
    >
    > Reading is similar except there are pipeline delays on the data bus
    > due to CAS read access time.
    >
    > Regards,
    > Gabor
     
    wallge, Jan 30, 2007
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    567
  2. Shi Mu
    Replies:
    1
    Views:
    623
    Magnus Lycka
    Oct 13, 2005
  3. paul
    Replies:
    1
    Views:
    334
    Victor Bazarov
    Oct 1, 2005
  4. ash

    CreateRecordSet("nonsequential")

    ash, Sep 29, 2006, in forum: ASP General
    Replies:
    0
    Views:
    111
  5. Eric Mahurin

    ideas for an RCR: variable locality

    Eric Mahurin, Oct 1, 2005, in forum: Ruby
    Replies:
    16
    Views:
    209
    Eric Mahurin
    Oct 2, 2005
Loading...

Share This Page