Finding subsets for a robust regression

tkpmep · Sep 29, 2008

I have coded a robust (Theil-Sen) regression routine which takes as
inputs two lists of numbers, x and y, and returns a robust estimate of
the slope and intercept of the best robust straight line fit.

In a pre-processing phase, I create two new lists, x1 and y1; x1 has
only the unique values in x, and for each unique value in x1, y1 has
the median of all such values in x. My code follows, and it seems a
bit clumsy - is there a cleaner way to do it? By the way, I'd be more
than happy to share the code for the entire algorithm - just let me
know and I will post it here.

Thanks in advance

Thomas Philips

d = {} #identify unique instances of x and y
for xx,yy in zip(x,y):
if xx in d:
d[xx].append(yy)
else:
d[xx] = [yy]

x1 = [] #unique instances of x and y
y1 = [] #median(y) for each unique value of x
for xx,yy in d.iteritems():
x1.append(xx)
l = len(yy)
if l == 1:
y1.append(yy[0])
else:
yy.sort()
y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
yy[l//2] )

bearophileHUGS · Sep 29, 2008

(e-mail address removed):

My code follows, and it seems a bit clumsy - is there a cleaner way to do it?

The code doesn't look bad. I can suggest few things:
- When you have "paragraphs" of code that do something definite then
the comment before them can be written without indentation, to denote
it regards the whole little block. Or you can even call a subfunction
(a function inside another function/method)
- Finding good names for variables is important.

So this code:

d = {} #identify unique instances of x and y
for xx,yy in zip(x,y):
if xx in d:
d[xx].append(yy)
else:
d[xx] = [yy]

May become (untested):

# identify unique instances of x and y
d = {}
for xx, yy in zip(x, y):
if xx in d:
d[xx].append(yy)
else:
d[xx] = [yy]

Or better:

# identify unique instances of seqx and seqx
d = defaultdict(list)
for x, y in izip(seqx, seqx):
d[x].append(y)

Or even something like:

def unique_x(seqx, seqx):
"""identify unique instances of seqx and seqx"""
result = defaultdict(list)
for x, y in izip(seqx, seqx):
result[x].append(y)
return result
...
d = unique_x(x, y)

The following line is too much heavy, it looks like a misuse of the
inline if syntax:

y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
yy[l//2] )

It was not added to python for that long lines. This looks better to
me:

l = len(yy)
if l == 1:
y_median = yy[0])
else:
yy.sort()
if l & 1:
y_median = yy[l // 2]
else:
y_median = yy[l // 2 - 1] + yy[l // 2]) / 2.0
y1.append(y_median)

But even better is to use a stronger library code, for example a
median() able to run in O(n), and a more strong unique(), like:
http://code.activestate.com/recipes/438599/
http://code.activestate.com/recipes/466330/
But using a stronger unique() this time isn't useful, so I like this
compromise:

# identify unique instances of seqx and seqx
uniques_x = defaultdict(list)
for x, y in izip(seqx, seqx):
uniques_x[x].append(y)

result_x = []
result_y = []
for x, ys in uniques_x.iteritems():
result_x.append(x)
result_y.append(median(ys))

If you want to use the last bit of semantics of CPython you can even
write that code like this:

# identify unique instances of seqx and seqx
uniques_x = defaultdict(list)
for x, y in izip(seqx, seqx):
uniques_x[x].append(y)

result_x = uniques_x.keys()
result_y = map(median, uniques_x.itervalues())

I think it works because keys and values are given in the same order,
but in real code I tend to avoid using such subtle things. Because if
you translate that code to another language, or you use another Python
implementation it may not work anymore, and lot of code sooner or
later becomes translated...

Bye,
bearophile

Gerard flanagan · Sep 29, 2008

x1 = [] #unique instances of x and y
y1 = [] #median(y) for each unique value of x
for xx,yy in d.iteritems():
x1.append(xx)
l = len(yy)
if l == 1:
y1.append(yy[0])
else:
yy.sort()
y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
yy[l//2] )
--

Not tested, but is this equivalent?

x1 = []
y1 = []
for xx, yy in d.iteritems():
L = len(yy) // 2
yy.sort()
y1.append(yy[L])
if not L & 1:
y1[-1] = (y1[-1] + yy[L-1]) / 2.0

It means that you have a pointless 'sort' when len(yy) == 1, but then
you save an 'if' per-iteration.

G.

Gerard flanagan · Sep 29, 2008

Gerard said:
x1 = [] #unique instances of x and y
y1 = [] #median(y) for each unique value of x
for xx,yy in d.iteritems():
x1.append(xx)
l = len(yy)
if l == 1:
y1.append(yy[0])
else:
yy.sort()
y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
yy[l//2] )
--

Click to expand...

Not tested, but is this equivalent?

x1 = []
y1 = []
for xx, yy in d.iteritems():
L = len(yy) // 2
yy.sort()
y1.append(yy[L])
if not L & 1:
y1[-1] = (y1[-1] + yy[L-1]) / 2.0

It means that you have a pointless 'sort' when len(yy) == 1, but then
you save an 'if' per-iteration.

G.

--

Maybe that should be:

if L and not L & 1:
...

anyway...

Gabriel Genellina · Sep 30, 2008

En Mon said:
result_x = uniques_x.keys()
result_y = map(median, uniques_x.itervalues())

I think it works because keys and values are given in the same order,
but in real code I tend to avoid using such subtle things. Because if
you translate that code to another language, or you use another Python
implementation it may not work anymore, and lot of code sooner or
later becomes translated...

At least with respect to different Python implementations, it *is* clearly
stated in the documentation that "If items(), keys(), values(),
iteritems(), iterkeys(), and itervalues() are called with no intervening
modifications to the dictionary, the lists will directly correspond."

http://docs.python.org/lib/typesmapping.html

tkpmep · Sep 30, 2008

Thank you everyone, for your input. The help is much appreciated.

Thomas Philips

TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'	3	Oct 2, 2022
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
Help for ActionPerformance and how to use HashMap.	2	Feb 10, 2022
How can I simplify my code and how can i get the balls to spawn in different locations	0	Jan 3, 2020
robust optimisation	1	Aug 5, 2008
Regression test	1	Mar 24, 2009
Regression testing for pointers	86	Mar 9, 2012
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021

Finding subsets for a robust regression

tkpmep

bearophileHUGS

Gerard flanagan

Gerard flanagan

Gabriel Genellina

tkpmep

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads