Finding subsets for a robust regression

T

tkpmep

I have coded a robust (Theil-Sen) regression routine which takes as
inputs two lists of numbers, x and y, and returns a robust estimate of
the slope and intercept of the best robust straight line fit.

In a pre-processing phase, I create two new lists, x1 and y1; x1 has
only the unique values in x, and for each unique value in x1, y1 has
the median of all such values in x. My code follows, and it seems a
bit clumsy - is there a cleaner way to do it? By the way, I'd be more
than happy to share the code for the entire algorithm - just let me
know and I will post it here.

Thanks in advance

Thomas Philips

d = {} #identify unique instances of x and y
for xx,yy in zip(x,y):
if xx in d:
d[xx].append(yy)
else:
d[xx] = [yy]

x1 = [] #unique instances of x and y
y1 = [] #median(y) for each unique value of x
for xx,yy in d.iteritems():
x1.append(xx)
l = len(yy)
if l == 1:
y1.append(yy[0])
else:
yy.sort()
y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
yy[l//2] )
 
B

bearophileHUGS

(e-mail address removed):
My code follows, and it seems a bit clumsy - is there a cleaner way to do it?

The code doesn't look bad. I can suggest few things:
- When you have "paragraphs" of code that do something definite then
the comment before them can be written without indentation, to denote
it regards the whole little block. Or you can even call a subfunction
(a function inside another function/method)
- Finding good names for variables is important.

So this code:
d = {} #identify unique instances of x and y
for xx,yy in zip(x,y):
if xx in d:
d[xx].append(yy)
else:
d[xx] = [yy]

May become (untested):
# identify unique instances of x and y
d = {}
for xx, yy in zip(x, y):
if xx in d:
d[xx].append(yy)
else:
d[xx] = [yy]

Or better:
# identify unique instances of seqx and seqx
d = defaultdict(list)
for x, y in izip(seqx, seqx):
d[x].append(y)

Or even something like:
def unique_x(seqx, seqx):
"""identify unique instances of seqx and seqx"""
result = defaultdict(list)
for x, y in izip(seqx, seqx):
result[x].append(y)
return result
...
d = unique_x(x, y)

The following line is too much heavy, it looks like a misuse of the
inline if syntax:
y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
yy[l//2] )

It was not added to python for that long lines. This looks better to
me:

l = len(yy)
if l == 1:
y_median = yy[0])
else:
yy.sort()
if l & 1:
y_median = yy[l // 2]
else:
y_median = yy[l // 2 - 1] + yy[l // 2]) / 2.0
y1.append(y_median)

But even better is to use a stronger library code, for example a
median() able to run in O(n), and a more strong unique(), like:
http://code.activestate.com/recipes/438599/
http://code.activestate.com/recipes/466330/
But using a stronger unique() this time isn't useful, so I like this
compromise:

# identify unique instances of seqx and seqx
uniques_x = defaultdict(list)
for x, y in izip(seqx, seqx):
uniques_x[x].append(y)

result_x = []
result_y = []
for x, ys in uniques_x.iteritems():
result_x.append(x)
result_y.append(median(ys))

If you want to use the last bit of semantics of CPython you can even
write that code like this:

# identify unique instances of seqx and seqx
uniques_x = defaultdict(list)
for x, y in izip(seqx, seqx):
uniques_x[x].append(y)

result_x = uniques_x.keys()
result_y = map(median, uniques_x.itervalues())

I think it works because keys and values are given in the same order,
but in real code I tend to avoid using such subtle things. Because if
you translate that code to another language, or you use another Python
implementation it may not work anymore, and lot of code sooner or
later becomes translated...

Bye,
bearophile
 
G

Gerard flanagan

x1 = [] #unique instances of x and y
y1 = [] #median(y) for each unique value of x
for xx,yy in d.iteritems():
x1.append(xx)
l = len(yy)
if l == 1:
y1.append(yy[0])
else:
yy.sort()
y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
yy[l//2] )
--

Not tested, but is this equivalent?

x1 = []
y1 = []
for xx, yy in d.iteritems():
L = len(yy) // 2
yy.sort()
y1.append(yy[L])
if not L & 1:
y1[-1] = (y1[-1] + yy[L-1]) / 2.0

It means that you have a pointless 'sort' when len(yy) == 1, but then
you save an 'if' per-iteration.

G.
 
G

Gerard flanagan

Gerard said:
x1 = [] #unique instances of x and y
y1 = [] #median(y) for each unique value of x
for xx,yy in d.iteritems():
x1.append(xx)
l = len(yy)
if l == 1:
y1.append(yy[0])
else:
yy.sort()
y1.append( (yy[l//2-1] + yy[l//2])/2.0 if l % 2 == 0 else
yy[l//2] )
--

Not tested, but is this equivalent?

x1 = []
y1 = []
for xx, yy in d.iteritems():
L = len(yy) // 2
yy.sort()
y1.append(yy[L])
if not L & 1:
y1[-1] = (y1[-1] + yy[L-1]) / 2.0

It means that you have a pointless 'sort' when len(yy) == 1, but then
you save an 'if' per-iteration.

G.

--

Maybe that should be:

if L and not L & 1:
...

anyway...
 
G

Gabriel Genellina

result_x = uniques_x.keys()
result_y = map(median, uniques_x.itervalues())

I think it works because keys and values are given in the same order,
but in real code I tend to avoid using such subtle things. Because if
you translate that code to another language, or you use another Python
implementation it may not work anymore, and lot of code sooner or
later becomes translated...

At least with respect to different Python implementations, it *is* clearly
stated in the documentation that "If items(), keys(), values(),
iteritems(), iterkeys(), and itervalues() are called with no intervening
modifications to the dictionary, the lists will directly correspond."

http://docs.python.org/lib/typesmapping.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top