Pipelining question

D

Divyang M

Hi,
I am looking for some insight into how I can go about pipelining my
system.

The system is an image interpolator which contains a buffer (on-chip
dual-port RAM) and an interpolator block.
The buffer stores incoming data (real-time say at X MHz). The
interpolator requires 4 data elements from this buffer to produce 1
output.

To keep the systems real-time, I am running my system at X MHz and also
write to the buffer at this rate. But read the data out of the buffer
at 4X MHz so that at each clock cycle I have all the 4 data elements
that the intepolator unit needs. This has limited X to 50 MHz because
the internal block RAM max out at 200 MHz (well, according to Altera it
can go upto 287 MHz but I hit the wall at some point or another).

I was wondering if there is a way to pipeline the design so that I can
run the whole system at a single clock frequency but still not have a
huge backlog of data accumulation (since there is finite amount of
on-chip storage) or if there are examples of such in any books?

Thanks,
Divyang M
 
K

Kai Harrekilde-Petersen

Divyang M said:
Hi,
I am looking for some insight into how I can go about pipelining my
system.

The system is an image interpolator which contains a buffer (on-chip
dual-port RAM) and an interpolator block.
The buffer stores incoming data (real-time say at X MHz). The
interpolator requires 4 data elements from this buffer to produce 1
output.

To keep the systems real-time, I am running my system at X MHz and also
write to the buffer at this rate. But read the data out of the buffer
at 4X MHz so that at each clock cycle I have all the 4 data elements
that the intepolator unit needs. This has limited X to 50 MHz because
the internal block RAM max out at 200 MHz (well, according to Altera it
can go upto 287 MHz but I hit the wall at some point or another).

I was wondering if there is a way to pipeline the design so that I can
run the whole system at a single clock frequency but still not have a
huge backlog of data accumulation (since there is finite amount of
on-chip storage) or if there are examples of such in any books?

If you have the storage available, write to four RAMs in parallel and
read 1 element from each RAM per clock cycle.
Other combinations are also possible.

Regards,


Kai
 
D

Divyang M

Hello Kai,

I do not have that much storage space to store the data 4 times since
my data is a 240x320 gray level (8-bit/pixel) image.

The other option I was thinking of is to delay the first of the four
output by 4 cycles (registers), the second by 3 cycles, the third
output by 2 cycles, and the fourth output by 1 cycle. These four
points will then be aligned, but if I use this strategy, then I get a
valid output out of my system once every 4 cycles, so a throughput of
0.25 (if I'm using the definition of throughput correctly). But I would
ideally like the throughput to be 1.

Any others suggestions you have would be welcome.

Thanks,
Divyang M
 
I

info_

Are the four element consecutive ?
Then it's a simple pipelined FIR, and it runs at full speed,
producing data at X Ms/s with a X MHz clock.

This become a little bit more interesting when you come to 2D
correlation :) with 8 "adjacent" pixels spread over 3 lines.

Bert Cuzeau
 
D

Divyang M

Hi Bert,

The four elements are not consecutive. I am actually working with an
image. So the four elements are essentially 2 elements each in 2 rows,
something like A11, A12, A21, A22.

But these can be any 4 "adjacent" elements in the image, so they do not
go by a particular order either.

Thanks,
Divyang
 
B

Ben Twijnstra

Hi Divyang,

Is there some scrambled addressing way that you can write the data so that
the words you need to read are always consecutive? That way you could set
your write port width to 8, and your read port width to 32.

If this is possible, your write bandwidth would be limited to the Tpd of the
logic that sets up the write address, but at least you will be able to
clock both ends at write speed.

Best regards,


Ben
 
I

info_

I think you just need to pipeline almost one line (N-2 depth), and pipeline it on two
FF stages at the entrance and the exit. So at a single clock cycle, you would have
your four elements available for combining, every clock cyle.

The big pipeline fits nicely in a dual port memory used as circular buffer,
or simply a ready-made Fifo.

Assuming the pixels per line is n+1 :

FF -> FF -> [[[[[ Fifo(n-2) ]]]]] -> FF -> FF
D2n D2n-1 D2n-2 ... D21 D20 D1n D1n-1

so you have D1n, D1n-1, D2n, D2n-1 available at the same time.
Just be careful at the edges...

What can we not do with pipelining :) ?

It's late here, I hope I didn't goof.

Bert
 
D

Divyang M

Hi Bert,

That would work if I was doing a "forward mapping" (ie taking 4 know
input pixels to comupte an output pixel which can end up anywhere in
the image), but I am doing "inverse mapping" (ie I know the output
pixel I am computing, but the 4 input pixels can be located anywhere in
the input image).

Thanks for your time and help.

Divyang
 
D

Divyang M

Hi Ben,

That's what I am thinking now. There seems to be no straight-forward
solution to "inverse mapping" problems (due to the uncertainity of
which input pixels to access) even in any of the literatures. So I
might have to go this way.

Thanks,
Divyang M
 
B

Ben Twijnstra

Hi Divyang M,
The four elements are not consecutive. I am actually working with an
image. So the four elements are essentially 2 elements each in 2 rows,
something like A11, A12, A21, A22.

But these can be any 4 "adjacent" elements in the image, so they do not
go by a particular order either.

I once wrote a linear interpolation algorithm for a Bayer-matrix camera
interface for one of my customers. I can't publish the code due to legal
reasons, but the idea is as follows:

There is no frame buffer. You use an FSM to capture video into one of four
line buffers when writing. Thus, three buffers are available for reading,
and the fourth is being updated.

Thus, assuming that your destination pixel (xd,yd) is a function of one or
two (x+d,y-1), one or two (x+d,y) and one or two (x+d, y+1) pixels (where d
is some arbitrary X distance) you can read pixels at write speed from the
three line buffers.

In my case, the worst case computation was
p(x-1,y-1)*C1+p(x+1,y-1)*C2+p(x-1,y+1)*C3+p(x+1,y+1)*C4, which would
normally cause 4 fetches from a frame buffer, 2*2 fetches using the
abovementioned technique, or simply 2*1 fetch if you store the x-1 and x
fetches in 8-bit 'cache' registers.

In this case I could simply work from left to right, so I used 6*8 DFFs to
hold

(x-1, y-1), (x, y-1)
(x-1, y) , (x, y )
(x-1, y+1), (x, y+1)

and compute my formula by setting the address on the line buffers to x+1.
The data outputs would then yield

(x+1, y-1)
(x+1, y )
(x+1, y+1)

giving me all the necessary date to compute RGB values for every pixel in
the Bayer pattern, or basically any 3x3 Fourier kernel on grayscale data.

The whole idea yields a 3-scanline delay between input and output, but this
way you can run at high clock speeds and be creative about the
interpolation method.

Best regards,


Ben
 
D

Divyang M

Hi Ben,

Thanks for the insight. I will look carefully at your suggestion and
hopefully it will yield something for my problem as well :).

Thanks,
Divyang
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,052
Latest member
LucyCarper

Latest Threads

Top