Sorting by item_in_another_list

C

Cameron Walsh

Hi,

I have two lists, A and B, such that B is a subset of A.

I wish to sort A such that the elements in B are at the beginning of A,
and keep the existing order otherwise, i.e. stable sort. The order of
elements in B will always be correct.

for example:

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]


At the moment I have defined a comparator function:

def sort_by_in_list(x,y):
ret = 0
if x in B:
ret -= 1
if y in B:
ret += 1
return ret

and am using:

A.sort(sort_by_in_list)

which does produce the desired results.

I do now have a few questions:

1.) Is this the most efficient method for up to around 500 elements?
If not, what would be better?
2.) This current version does not allow me to choose a different list
for B. Is there a bind_third function of some description that I could
use to define a new function with 3 parameters, feed it the third (the
list to sort by), and have the A.sort(sort_by_in_list) provide the other
2 variables?


Regards to all,

Cameron.
 
C

Cameron Walsh

Cameron said:
Hi,

I have two lists, A and B, such that B is a subset of A.

I wish to sort A such that the elements in B are at the beginning of A,
and keep the existing order otherwise, i.e. stable sort. The order of
elements in B will always be correct.

for example:

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]


At the moment I have defined a comparator function:

def sort_by_in_list(x,y):
ret = 0
if x in B:
ret -= 1
if y in B:
ret += 1
return ret

and am using:

A.sort(sort_by_in_list)

which does produce the desired results.

I do now have a few questions:

1.) Is this the most efficient method for up to around 500 elements? If
not, what would be better?
2.) This current version does not allow me to choose a different list
for B. Is there a bind_third function of some description that I could
use to define a new function with 3 parameters, feed it the third (the
list to sort by), and have the A.sort(sort_by_in_list) provide the other
2 variables?


Regards to all,

Cameron.


Well I found an answer to the second question with the following:
>>> A=[0,1,2,3,4,5,6,7,8,9,10]
>>> B=[2,3,7,8]
>>> def sort_by_in_list(in_list):
def ret_function(x,y):
ret = 0
if x in in_list:
ret -= 1
if y in in_list:
ret += 1
return ret
return ret_function
[2, 3, 7, 8, 0, 1, 4, 5, 6, 9, 10]


Hope it helps someone,

Cameron.
 
P

Paul McGuire

Cameron Walsh said:
Hi,

I have two lists, A and B, such that B is a subset of A.

I wish to sort A such that the elements in B are at the beginning of A,
and keep the existing order otherwise, i.e. stable sort. The order of
elements in B will always be correct.

for example:

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]


At the moment I have defined a comparator function:

def sort_by_in_list(x,y):
ret = 0
if x in B:
ret -= 1
if y in B:
ret += 1
return ret

and am using:

A.sort(sort_by_in_list)

which does produce the desired results.

I do now have a few questions:

1.) Is this the most efficient method for up to around 500 elements? If
not, what would be better?
2.) This current version does not allow me to choose a different list for
B. Is there a bind_third function of some description that I could use to
define a new function with 3 parameters, feed it the third (the list to
sort by), and have the A.sort(sort_by_in_list) provide the other 2
variables?

Think in Python. Define a function to take the list, and have that function
return the proper comparison function. This gives me the chance to also
convert the input list to a set, which will help in scaling up my list to
hundreds of elements. See below.

-- Paul


def sort_by_in_list(reflist):
reflist = set(reflist)
def sort_by_in_list_(x,y):
ret = 0
if x in reflist: ret -= 1
if y in reflist: ret += 1
return ret
return sort_by_in_list_

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]
A.sort( sort_by_in_list(B) )
print A

Gives:
[2, 3, 7, 8, 0, 1, 4, 5, 6, 9, 10]
 
C

Cameron Walsh

Paul said:
Cameron Walsh said:
Hi,

I have two lists, A and B, such that B is a subset of A.

I wish to sort A such that the elements in B are at the beginning of A,
and keep the existing order otherwise, i.e. stable sort. The order of
elements in B will always be correct.

for example:

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]


At the moment I have defined a comparator function:

def sort_by_in_list(x,y):
ret = 0
if x in B:
ret -= 1
if y in B:
ret += 1
return ret

and am using:

A.sort(sort_by_in_list)

which does produce the desired results.

I do now have a few questions:

1.) Is this the most efficient method for up to around 500 elements? If
not, what would be better?
2.) This current version does not allow me to choose a different list for
B. Is there a bind_third function of some description that I could use to
define a new function with 3 parameters, feed it the third (the list to
sort by), and have the A.sort(sort_by_in_list) provide the other 2
variables?

Think in Python. Define a function to take the list, and have that function
return the proper comparison function. This gives me the chance to also
convert the input list to a set, which will help in scaling up my list to
hundreds of elements. See below.

-- Paul


def sort_by_in_list(reflist):
reflist = set(reflist)
def sort_by_in_list_(x,y):
ret = 0
if x in reflist: ret -= 1
if y in reflist: ret += 1
return ret
return sort_by_in_list_

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]
A.sort( sort_by_in_list(B) )
print A

Gives:
[2, 3, 7, 8, 0, 1, 4, 5, 6, 9, 10]

Looks like our answers crossed-over. Must learn about sets in Python...

Thanks very much,

Cameron.
 
P

Paul Rubin

Cameron Walsh said:
for example:

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]

How about:

desired_result = B + sorted(x for x in A if x not in B)
 
P

Paul Rubin

ZeD said:
this. is. cool.

Actually it's better to use a set if B is large:

B2 = set(B)
desired_result = B + sorted(x for x in A if x not in B2)

That avoids a linear search through B for each element of A.
 
F

Fredrik Lundh

Paul said:
for example:

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]

How about:

desired_result = B + sorted(x for x in A if x not in B)

assuming that "keep the existing order" means what it says, you might as
well replace "sorted" with a list comprehension.

</F>
 
P

Paul Rubin

Fredrik Lundh said:
assuming that "keep the existing order" means what it says, you might
as well replace "sorted" with a list comprehension.

Hmm. I didn't read it that way, but yeah, if that's what the OP meant
then the sort could be skipped. I thought he just wanted a stable
sort, which the sorting primitives now promise.
 
J

J. Clifford Dyer

ZeD said:
Paul said:
A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]
How about:

desired_result = B + sorted(x for x in A if x not in B)

this. is. cool.

Cool, yes, but I'm not entirely sure it does what the OP wanted. Partly
because I'm not entirely sure what the OP wanted. Counter example:

Given these variables:

A = [0,1,2,3,4,5,6,8,9,10] # Note 7 is missing
B = [2,3,7,8]

which of the following should the function yield?

desired_result = [2,3,7,8,0,1,4,5,6,9,10]
desired_result2 = [2,3,8,0,1,4,5,6,9,10]

The fact that we are ostensibly sorting A makes me thing it should be
the latter, but the example given was ambiguous. If we are in fact
looking for desired_result2, maybe we should use:

result = [ x for x in B if x in A ] + [ x for x in A if X not in B ]

or like the sibling post suggests: substitute set(A) and set(B) for the
"in" argument in each comprehension.

Cheers,
Cliff
 
P

Paul McGuire

J. Clifford Dyer said:
ZeD said:
Paul said:
A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]
How about:

desired_result = B + sorted(x for x in A if x not in B)

this. is. cool.

Cool, yes, but I'm not entirely sure it does what the OP wanted. Partly
because I'm not entirely sure what the OP wanted. Counter example:

Given these variables:

A = [0,1,2,3,4,5,6,8,9,10] # Note 7 is missing
B = [2,3,7,8]

which of the following should the function yield?

From the original post:

"I have two lists, A and B, such that B is a subset of A."

So this is not a case that needs to be supported.

I envisioned something like the OP had a sequence of items to start with,
did a random sampling from the list, and wanted to move the sampled items to
the front of the list.

-- Paul
 
P

Paul McGuire

Paul McGuire said:
J. Clifford Dyer said:
ZeD said:
Paul Rubin wrote:

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]
How about:

desired_result = B + sorted(x for x in A if x not in B)

this. is. cool.

Cool, yes, but I'm not entirely sure it does what the OP wanted. Partly
because I'm not entirely sure what the OP wanted. Counter example:

Given these variables:

A = [0,1,2,3,4,5,6,8,9,10] # Note 7 is missing
B = [2,3,7,8]

which of the following should the function yield?

From the original post:

"I have two lists, A and B, such that B is a subset of A."

So this is not a case that needs to be supported.

I envisioned something like the OP had a sequence of items to start with,
did a random sampling from the list, and wanted to move the sampled items
to the front of the list.

-- Paul

Here is a little subclass of list that has some set-ish operators added.

-- Paul

# ListWithMath does set-like operations on lists, keeping list order
# stable where applicable
class ListWithMath(list):
def isSuperset(self,other):
return len(self ^ other) == len(other)

def __iadd__(self,other):
self.extend(other)
return self

def __isub__(self,other):
if self.isSuperset(other):
for i in other:
self.remove(i)
return self
else:
raise ValueError

def __ixor__(self,other):
for a in self[::-1]:
if not a in other:
self.remove(a)
return self

def __add__(self,other):
temp = ListWithMath(self)
temp += other
return temp

def __sub__(self,other):
temp = ListWithMath(self)
temp -= other
return temp

def __xor__(self,other):
temp = ListWithMath(self)
temp ^= other
return temp

def __radd__(self,other):
return ListWithMath(other) + self
def __rsub__(self,other):
return ListWithMath(other) - self
def __rxor__(self,other):
return ListWithMath(other) ^ self

A = ListWithMath( [0,1,2,3,4,5,6,7,8,9,10] )
B = ListWithMath( [2,3,7,10,8] )

print "%s+(%s-%s)" % (B,A,B)
C = B+(A-B) # union and exclusion
print C

D = ListWithMath( [1,2,3,6,8,11] )
print "%s ^ %s" % (B,D)
print B^D # intersection

print "%s ^= %s" % (B,D)
B ^= D # in-place intersection
print B

print C
C += []
print C
C -= []
print C
C ^= []
print C

Prints:
[2, 3, 7, 10, 8]+([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]-[2, 3, 7, 10, 8])
[2, 3, 7, 10, 8, 0, 1, 4, 5, 6, 9]
[2, 3, 7, 10, 8] ^ [1, 2, 3, 6, 8, 11]
[2, 3, 8]
[2, 3, 7, 10, 8] ^= [1, 2, 3, 6, 8, 11]
[2, 3, 8]
[2, 3, 7, 10, 8, 0, 1, 4, 5, 6, 9]
[2, 3, 7, 10, 8, 0, 1, 4, 5, 6, 9]
[2, 3, 7, 10, 8, 0, 1, 4, 5, 6, 9]
[]
 
C

Cameron Walsh

Paul said:
J. Clifford Dyer said:
ZeD said:
Paul Rubin wrote:

A = [0,1,2,3,4,5,6,7,8,9,10]
B = [2,3,7,8]

desired_result = [2,3,7,8,0,1,4,5,6,9,10]
How about:

desired_result = B + sorted(x for x in A if x not in B)
this. is. cool.
Cool, yes, but I'm not entirely sure it does what the OP wanted. Partly
because I'm not entirely sure what the OP wanted. Counter example:

Given these variables:

A = [0,1,2,3,4,5,6,8,9,10] # Note 7 is missing
B = [2,3,7,8]

which of the following should the function yield?

From the original post:

"I have two lists, A and B, such that B is a subset of A."

So this is not a case that needs to be supported.

I envisioned something like the OP had a sequence of items to start with,
did a random sampling from the list, and wanted to move the sampled items to
the front of the list.

-- Paul

The sequence of items I had was a list of files to be processed in a
particular order. I noted that the order in which the files were placed
into the list might not match the desired order of processing, and that
the filenames might not be sortable by filename into this desired order
(such are the vagaries of human file naming.)

So yes, Paul is correct.

However, Cliff is also correct in that the solution provided is not what
I (the OP) wanted. The solution provided is half right, in that it
places the elements also in B before the elements only in A, but it
sorts the remaining elements only in A instead of keeping the original
order.

A more correct solution (at least in terms of answering this particular
problem) would be:

desired_result = B + list(x for x in A if x not in B)
rather than
desired_result = B + sorted(x for x in A if x not in B)

This solution is also guaranteed to work if for some reason the sort
algorithm cannot be guaranteed to be stable in the future or on some
other implementation of Python.


Which brings me to the question, would this solution:

B = set(B)
A = B + list(x for x in A if x not in B)

be faster than this solution:

B = set(B)
A.sort(key=B.__contains__, reverse=True)


My guess is yes, since while the __contains__ method is only run once
for each object, the list still does not require sorting.

Thanks everyone for all your helpful replies,

Cameron.
 
S

Steven Bethard

Cameron said:
Which brings me to the question, would this solution:

B = set(B)
A = B + list(x for x in A if x not in B)

be faster than this solution:

B = set(B)
A.sort(key=B.__contains__, reverse=True)

My guess is yes, since while the __contains__ method is only run once
for each object, the list still does not require sorting.

You're probably going to need to time that for your particular data.
Here's what it looks like on an artificial data set:

$ python -m timeit -s "A = range(100); B = range(40, 60, 2); B_set =
set(B)" "A = B + list(x for x in A if x not in B_set)"
10000 loops, best of 3: 54.5 usec per loop

$ python -m timeit -s "A = range(100); B = range(40, 60, 2); B_set =
set(B)" "A.sort(key=B_set.__contains__, reverse=True)"
10000 loops, best of 3: 39.7 usec per loop

That said, I'd probably still use the first solution -- it's more
immediately obvious why that one works.

STeVe
 
P

Paul Rubin

Steven Bethard said:
Cameron said:
Which brings me to the question, would this solution:
B = set(B)
A = B + list(x for x in A if x not in B)
be faster than this solution:
B = set(B)
A.sort(key=B.__contains__, reverse=True)
[timings deleted]
That said, I'd probably still use the first solution -- it's more
immediately obvious why that one works.

Wait a minute, the first example looks wrong, B has gotten replaced by
a set and then it's added to a list.

Anyway how about timing

C = set(A) - set(B)
A = B + filter(C.__contains__, A)

This scans A twice, but it does more of the work in native code,
without sorting.
 
S

Steven Bethard

Paul said:
Steven Bethard said:
Cameron said:
Which brings me to the question, would this solution:
B = set(B)
A = B + list(x for x in A if x not in B)
be faster than this solution:
B = set(B)
A.sort(key=B.__contains__, reverse=True)
[timings deleted]
That said, I'd probably still use the first solution -- it's more
immediately obvious why that one works.

Wait a minute, the first example looks wrong, B has gotten replaced by
a set and then it's added to a list.

Yep. If you look back at the lines I actually timed, they were::

B_set = set(B)
A = B + list(x for x in A if x not in B_set)

and::

B_set = set(B)
A.sort(key=B_set.__contains__, reverse=True)

As you noted, you'll get an error if you try to concatenate B as a set
to the list.

Steve
 
C

Cameron Walsh

Steven said:
As you noted, you'll get an error if you try to concatenate B as a set
to the list.

Steve

Whoops, remind me to check things work before I type them. In the mean
time, here are some more interesting timing results:

With a larger data set, 500 elements instead of 100, the times were
almost the same:

python -m timeit -s "A=range(500); B=range(40,60,2); B_set=set(B)"
"A=B+list(x for x in A if x not in B_set)"
10000 loops, best of 3: 103 usec per loop

python -m timeit -s "A=range(500); B=range(40,60,2); B_set = set(B)"
"A.sort(key=B_set.__contains__, reverse=True)"
10000 loops, best of 3: 99.2 usec per loop

But for the last recommended one:

python -m timeit -s "A=range(500); B=range(40,60,2); C = set(A) -set(B)"
"A=B+filter(C.__contains__, A)"
10000 loops, best of 3: 38.3 usec per loop

Much faster! But that does not take into account the set creation and
set difference calculations. Let's try including that:

python -m timeit -s "A=range(500); B=range(40,60,2)" "C=set(A)-set(B)"
"a=B+filter(C.__contains__, A)"
10000 loops, best of 3: 105 usec per loop

and compare to our other two versions including set creation:

python -m timeit -s "A=range(500); B=range(40,60,2)" "B_set = set(B);
A=B+list(x for x in A if X not in B_set"
10000 loops, best of 3: 105 usec per loop

python -m timeit -s "A=range(500); B=range(40,60,2)" "B_set = set(B);
A.sort(key=B_set.__contains__, reverse=True)"
10000 loops, best of 3: 101 usec per loop

Pretty similar again.

However, converting each element to a string first ("A=list(str(x) for x
in range(500)" etc.), making it more similar to my data, gave:

102usec per loop for the A.sort() method
132usec per loop for the A=B+filter() method
104usec per loop for the A=B+list() method

It is interesting to note the increased time taken by the set
difference/filter method. It appears not to like strings as much as
ints. We can also shave another 7 usec off the filter time:

python -m timeit -s "A=list(str)x) for x in range(500)); B=list(str(x)
for x in range(40,60,2)); not_in_B_set=lambda x: x not in B_set"
"B_set=set(B); A=B+filter(not_in_B_set, A)"
10000 loops, best of 3: 125 usec per loop

In conclusion, it appears I should use the clearest of all the methods,
since for the data sizes I am dealing with, they are of approximately
the same speed, and the time taken is negligible anyway.

Thanks for all your help,

Cameron.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top