efficient running median

Janto Dreijer · Oct 13, 2009

I'm looking for code that will calculate the running median of a
sequence, efficiently. (I'm trying to subtract the running median from
a signal to correct for gradual drift).

My naive attempt (taking the median of a sliding window) is
unfortunately too slow as my sliding windows are quite large (~1k) and
so are my sequences (~50k). On my PC it takes about 18 seconds per
sequence. 17 of those seconds is spent in sorting the sliding windows.

I've googled around and it looks like there are some recent journal
articles on it, but no code. Any suggestions?

Thanks
Janto

Peter Otten · Oct 13, 2009

Janto said:
I'm looking for code that will calculate the running median of a
sequence, efficiently. (I'm trying to subtract the running median from
a signal to correct for gradual drift).

My naive attempt (taking the median of a sliding window) is
unfortunately too slow as my sliding windows are quite large (~1k) and
so are my sequences (~50k). On my PC it takes about 18 seconds per
sequence. 17 of those seconds is spent in sorting the sliding windows.

I've googled around and it looks like there are some recent journal
articles on it, but no code. Any suggestions?

If you aren't using numpy, try that. Showing your code might also be a good
idea...

Peter

Paul Rubin · Oct 13, 2009

Janto Dreijer said:
I'm looking for code that will calculate the running median of a
sequence, efficiently. (I'm trying to subtract the running median from
a signal to correct for gradual drift).

Is that a known and valid technique of correcting for drift? Maybe
what you really want is a high-pass filter, to remove very low
frequency components (drift) from the signal. You can do that with
standard digital filters. Using something depending on ordering
within the samples (i.e. the median) seems weird to me, but I'm no
expert.

The obvious way to compute a running median involves a tree structure
so you can quickly insert and delete elements, and find the median.
That would be asymtotically O(n log n) but messy to implement.

Dave Angel · Oct 13, 2009

Janto said:
I'm looking for code that will calculate the running median of a
sequence, efficiently. (I'm trying to subtract the running median from
a signal to correct for gradual drift).

My naive attempt (taking the median of a sliding window) is
unfortunately too slow as my sliding windows are quite large (~1k) and
so are my sequences (~50k). On my PC it takes about 18 seconds per
sequence. 17 of those seconds is spent in sorting the sliding windows.

I've googled around and it looks like there are some recent journal
articles on it, but no code. Any suggestions?

Thanks
Janto

Since the previous window is sorted, it should just be a "case of delete
one, add one" in the sorted list. And if that isn't fast enough,
consider some form of tree.

DaveA

Ethan Furman · Oct 13, 2009

Janto said:
I'm looking for code that will calculate the running median of a
sequence, efficiently. (I'm trying to subtract the running median from
a signal to correct for gradual drift).

My naive attempt (taking the median of a sliding window) is
unfortunately too slow as my sliding windows are quite large (~1k) and
so are my sequences (~50k). On my PC it takes about 18 seconds per
sequence. 17 of those seconds is spent in sorting the sliding windows.

I've googled around and it looks like there are some recent journal
articles on it, but no code. Any suggestions?

Thanks
Janto

You might look at http://pypi.python.org/pypi/blist/0.9.4

~Ethan~

Dale Dalrymple · Oct 13, 2009

I'm looking for code that will calculate the running median of a
sequence, efficiently. (I'm trying to subtract the running median from
a signal to correct for gradual drift).
...

Any suggestions?

For a reference try:

Comparison of algorithms for standard median filtering
Juhola, M. Katajainen, J. Raita, T.
Signal Processing, IEEE Transactions on
Jan. 1991, Volume: 39 , Issue: 1, page(s): 204 - 208
Abstract

An algorithm I have used comes from:

On computation of the running median
Astola, J.T. Campbell, T.G.
Acoustics, Speech and Signal Processing, IEEE Transactions on
Apr 1989, Volume: 37, Issue: 4, page(s): 572-574

This is a dual heap approach. No code though. There are some obvious
(yeah, right) errors in their pseudo-code.

The point of the dual heap algorithm (loosely put) is to reduce the
computation to slide the window 1 element to be proportional to 2
bubble sorts of log window size instead of a window size sort.

Good luck.

Dale B. Dalrymple

sturlamolden · Oct 13, 2009

The obvious way to compute a running median involves a tree structure
so you can quickly insert and delete elements, and find the median.
That would be asymtotically O(n log n) but messy to implement.

QuickSelect will find the median in O(log n) time.

Paul Rubin · Oct 13, 2009

sturlamolden said:
QuickSelect will find the median in O(log n) time.

That makes no sense, you can't even examine the input in O(log n) time.

Anyway, the problem isn't to find the median of n elements. It's to
find n different medians of n different sets. You might be able to
get close to linear time though, depending on how the data acts.

Janto Dreijer · Oct 13, 2009

That makes no sense, you can't even examine the input in O(log n) time.

True. I think he means O(n).
I've tried wrapping the lesser-known nth_element function from the C++
STL (http://www.cppreference.com/wiki/stl/algorithm/nth_element).
Unfortunately it seems the conversion to std::vector<double> is quite
slow...I'll have a look again. Will probably have to rewrite my whole
median_filter function in C++ to avoid unnecessary conversions.

Anyway, the problem isn't to find the median of n elements. It's to
find n different medians of n different sets. You might be able to
get close to linear time though, depending on how the data acts.

Yes, that's what I've been hoping for. If I want it reeeeaaally fast
I'll have to figure out how to do that.

Thanks
Janto

Janto Dreijer · Oct 13, 2009

For a reference try:

Comparison of algorithms for standard median filtering
Juhola, M. Katajainen, J. Raita, T.
Signal Processing, IEEE Transactions on
Jan. 1991, Volume: 39 , Issue: 1, page(s): 204 - 208
Abstract

An algorithm I have used comes from:

On computation of the running median
Astola, J.T. Campbell, T.G.
Acoustics, Speech and Signal Processing, IEEE Transactions on
Apr 1989, Volume: 37, Issue: 4, page(s): 572-574

This is a dual heap approach. No code though. There are some obvious
(yeah, right) errors in their pseudo-code.

The point of the dual heap algorithm (loosely put) is to reduce the
computation to slide the window 1 element to be proportional to 2
bubble sorts of log window size instead of a window size sort.

Good luck.

Dale B. Dalrymple

Thanks. I'll have a look.

I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for
computing a Running Median" (2003) by Googling. It has code snippets
at the end, but it's not going to be a simple cut-and-paste job. It
will take some work figuring out the missing parts.

Regards
Janto

Janto Dreijer · Oct 13, 2009

Is that a known and valid technique of correcting for drift? Maybe
what you really want is a high-pass filter, to remove very low
frequency components (drift) from the signal. You can do that with
standard digital filters. Using something depending on ordering
within the samples (i.e. the median) seems weird to me, but I'm no
expert.

Well, I don't have a lot of theoretical reasoning behind wanting to
use a median filter. I have some experience applying it to images with
very good results, so that was the first thing I tried. It felt right
at the time and the results looked good. Also, most of the elements in
the sequence are supposed to be near zero, so the median is supposed
to give the amount of drift.

That said, you're probably right. I should have first tried to apply a
HPF.

The obvious way to compute a running median involves a tree structure
so you can quickly insert and delete elements, and find the median.
That would be asymtotically O(n log n) but messy to implement.

Messy. Yes. I tried and failed

Thanks
Janto

Steven D'Aprano · Oct 13, 2009

That makes no sense, you can't even examine the input in O(log n) time.

If that were a relevant argument, no algorithms would ever be described
as running in O(log n).

Obviously to run in O(log n) you must have already built the tree. If you
include the time required to build the tree, then O(n) is the lower-bound
(because you have to see each element to put it in the tree) but that
applies to any data structure. The point is, once you have that tree, you
can perform repeated calculations in O(log n) time (or so it has been
claimed).

For example, if you want the m running medians of m data points, you
might do the following:

find the median of element 1
find the median of element 1 and 2
find the median of element 1 through 3
find the median of element 1 through 4
....
find the median of element 1 through m

Each median calculation may take O(log n), where n varies from 1 to m,
giving a total order of:

log 1 + log 2 + log 3 + ... + log m
=> O(log m!)

steps, which is much better than:

1 + 2 + 3 + ... + m = 1/2*m*(m+1)
=> O(m**2)

Anyway, the problem isn't to find the median of n elements. It's to
find n different medians of n different sets. You might be able to get
close to linear time though, depending on how the data acts.

But since the data sets aren't independent, a tree structure seems like
the way to go to me. Inserting and deleting elements should be fast, and
assuming the claim of calculating the median in O(log n) is correct,
that's quite fast too.

Janto Dreijer · Oct 13, 2009

You might look athttp://pypi.python.org/pypi/blist/0.9.4

~Ethan~

Very nice! I assume you mean I can use it to quickly insert items into
the sliding window?

Thanks
Janto

Paul Rubin · Oct 13, 2009

Janto Dreijer said:
Well, I don't have a lot of theoretical reasoning behind wanting to
use a median filter. I have some experience applying it to images with
very good results, so that was the first thing I tried. It felt right
at the time and the results looked good.

If this is image data, which typically would have fairly low
resolution per pixel (say 8-12 bits), then maybe you could just use
bins to count how many times each value occurs in a window. That
would let you find the median of a window fairly quickly, and then
update it with each new sample by remembering the number of samples
above and below the current median, etc.

Janto Dreijer · Oct 13, 2009

If this is image data, which typically would have fairly low
resolution per pixel (say 8-12 bits), then maybe you could just use
bins to count how many times each value occurs in a window. That
would let you find the median of a window fairly quickly, and then
update it with each new sample by remembering the number of samples
above and below the current median, etc.

Thanks, unfortunately it's not image data. It's electrocardiogram
sequences. So it's sequences of floats.

Janto

sturlamolden · Oct 14, 2009

Obviously to run in O(log n) you must have already built the tree.

You don't need a tree. Quickselect is a partial quicksort.

But my memory served me badly, quickselect is O(n).

Raymond Hettinger · Oct 14, 2009

QuickSelect will find the median in O(log n) time.

Right. See http://code.activestate.com/recipes/269554/ for a sample
implementation in Python.

Raymond

Janto Dreijer · Oct 14, 2009

If you aren't using numpy, try that. Showing your code might also be a good
idea...

Peter

I placed the test code and its output here:
http://bitbucket.org/janto/snippets/src/tip/running_median.py

I also have a version that uses numpy. On random data it seems to be
about twice as fast as the pure python one. It spends half the time
sorting and the other half converting the windows from python lists to
numpy arrays.
If the data is already sorted, the version using python's builtin sort
outperforms the numpy convert-and-sort by about 5 times. Strangely
satisfying

I'm hoping to find a trick to do better than scipy.signal.medfilt.
Looking at its code (PyArray_OrderFilterND in
http://svn.scipy.org/svn/scipy/trunk/scipy/signal/sigtoolsmodule.c)
there may be hope. It looks like they're sorting the entire window
instead of doing a QuickSelect (which I could do with STL's
nth_element).

Janto

Hendrik van Rooyen · Oct 14, 2009

I'm looking for code that will calculate the running median of a
sequence, efficiently. (I'm trying to subtract the running median from
a signal to correct for gradual drift).

My naive attempt (taking the median of a sliding window) is
unfortunately too slow as my sliding windows are quite large (~1k) and
so are my sequences (~50k). On my PC it takes about 18 seconds per
sequence. 17 of those seconds is spent in sorting the sliding windows.

I've googled around and it looks like there are some recent journal
articles on it, but no code. Any suggestions?

Can you not use the mean instead? It is easier to calculate, as you do not
have to work through all the values.

- Hendrik

Peter Otten · Oct 14, 2009

Janto said:
I placed the test code and its output here:
http://bitbucket.org/janto/snippets/src/tip/running_median.py

That gives me something to tinker

I also have a version that uses numpy. On random data it seems to be
about twice as fast as the pure python one. It spends half the time
sorting and the other half converting the windows from python lists to
numpy arrays.
If the data is already sorted, the version using python's builtin sort
outperforms the numpy convert-and-sort by about 5 times. Strangely
satisfying

I was thinking of using as many of numpy's bulk operations as possible:

def running_median_numpy(seq):
data = array(seq, dtype=float)
result = []
for i in xrange(1, window_size):
window = data[:i]
result.append(median(window))
for i in xrange(len(data)-window_size+1):
window = data[i:i+window_size]
result.append(median(window))
return result

But it didn't help as much as I had hoped.
The fastest I came up with tries hard to keep the data sorted:

def running_median_insort(seq):
seq = iter(seq)
d = deque()
s = []
result = []
for item in islice(seq, window_size):
d.append(item)
insort(s, item)
result.append(s[len(d)//2])
m = window_size // 2
for item in seq:
old = d.popleft()
d.append(item)
del s[bisect_left(s, old)]
insort(s, item)
result.append(s[m])
return result

Some numbers:

10.197 seconds for running_median_scipy_medfilt
25.043 seconds for running_median_python
13.040 seconds for running_median_python_msort
14.280 seconds for running_median_python_scipy_median
4.024 seconds for running_median_numpy
0.221 seconds for running_median_insort

What would be an acceptable performance, by the way?

Peter

PS: code not tested for correctness

The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
A humble start on a Boid System Vocabulary -	0	Mar 21, 2010
JNI EXCEPTION_ACCESS_VIOLATION	8	Aug 10, 2006
login failed for user	29	Feb 17, 2008
How bad is $'? (Was: "Get substring of line")	4	Jan 18, 2005
People Helping People!!!!!!!!!!!!	1	Jun 10, 2008
Faster Prime class then Ruby 1.9	0	Feb 3, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

efficient running median

Janto Dreijer

Peter Otten

Paul Rubin

Dave Angel

Ethan Furman

Dale Dalrymple

sturlamolden

Paul Rubin

Janto Dreijer

Janto Dreijer

Janto Dreijer

Steven D'Aprano

Janto Dreijer

Paul Rubin

Janto Dreijer

sturlamolden

Raymond Hettinger

Janto Dreijer

Hendrik van Rooyen

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads