What is the most efficient way to compare similar contents in two lists?

Z

Zachary Dziura

similar_headers = 0
different_headers = 0
source_headers = sorted(source_mapping.headers)
target_headers = sorted(target_mapping.headers)

# Check if the headers between the two mappings are the same
if set(source_headers) == set(target_headers):
similar_headers = len(source_headers)
else:
# We're going to do two run-throughs of the tables, to find the
# different and similar header names. Start with the source
# headers...
for source_header in source_headers:
if source_header in target_headers:
similar_headers += 1
else:
different_headers += 1
# Now check target headers for any differences
for target_header in target_headers:
if target_header in source_headers:
pass
else:
different_headers += 1

As you can probably tell, I make two iterations: one for the
'source_headers' list, and another for the 'target_headers' list.
During the first iteration, if a specific header (mapped to a variable
'source_header') exists in both lists, then the 'similar_headers'
variable is incremented by one. Similarly, if it doesn't exist in both
lists, 'different_headers' is incremented by one. For the second
iteration, it only checks for different headers.

My code works as expected and there are no bugs, however I get the
feeling that I'm not doing this comparison in the most efficient way
possible. Is there another way that I can make this same comparison
while making my code more Pythonic and efficient? I would prefer not
to have to install an external module from elsewhere, though if I have
to then I will.

Thanks in advance for any and all answers!
 
C

Chris Angelico

if set(source_headers) == set(target_headers):
   similar_headers = len(source_headers)

Since you're making sets already, I'd recommend using set operations -
same_headers is the (length of the) intersection of those two sets,
and different_headers is the XOR.

# If you need the lists afterwards, use different variable names
source_headers = set(source_headers)
target_headers = set(target_headers)
similar_headers = len(source_headers & target_headers)
different_headers = len(source_headers ^ target_headers)

Chris Angelico
 
Z

Zachary Dziura

Since you're making sets already, I'd recommend using set operations -
same_headers is the (length of the) intersection of those two sets,
and different_headers is the XOR.

# If you need the lists afterwards, use different variable names
source_headers = set(source_headers)
target_headers = set(target_headers)
similar_headers = len(source_headers & target_headers)
different_headers = len(source_headers ^ target_headers)

Chris Angelico

Wow! That was a lot easier than I thought it would be! I guess I
should have done a little bit more research into such operations.
Thanks a bunch!!
 
C

Chris Angelico

Wow! That was a lot easier than I thought it would be! I guess I
should have done a little bit more research into such operations.
Thanks a bunch!!

:)

Python: "Batteries Included".

(Although Python 3 is "Most of the batteries you're used to, included".)

ChrisA
 
S

Steven D'Aprano

Python: "Batteries Included".

(Although Python 3 is "Most of the batteries you're used to, included".)

Oh gods, you're not going to bring up sort(cmp=...) again are you????


/me ducks and covers
 
C

Chris Angelico

Oh gods, you're not going to bring up sort(cmp=...) again are you????

/me ducks and covers

Ha! That *lengthy* thread started fairly soon after I joined this
list. It was highly... informative. I learned a bit about Python, and
a lot about python-list, from the posts there.

Chris Angelico
 
C

Chris Angelico

This is a beautiful solution, and yet I feel compelled to mention that it
disregards duplicates within a given list.  If you need duplicate
detection/differencing, it's better to sort each list and then use an
algorithm similar to the merge step of mergesort.

The original example used the 'in' operator, which is effectively a
set operation anyway. As written, it would count all duplicates in the
source headers but only count one in the target. I'm guessing that the
context mandates no duplicates (say, if they're dictionary keys or
something - let's assume float("nan") is not a key).
Using sets as above is O(n), while the sorting version is O(nlogn) usually.
O(n) is better than O(nlogn).

And of course, the version based on sorting assumes order doesn't matter.

If order and duplicates matter, then you want a completely different
diff. I wrote one a while back, but not in Python. The algorithm went
something like this:

* Start with pointers to beginnings of both lists.
* See if current string is identical in each list; if so, increment
pointers and iterate.
* Once a difference is found, try to find a re-match by incrementing
pointer #1 until the two match. If the end of the first list is
reached, emit current line of #2 as a deletion and point pointer #1 to
just after where it started.
* On finding a re-match, emit all list #1 lines from first difference
to rematch as insertions.

(Since that was for comparing a source file with a user-supplied
modified source - a sort of diff/patch - a re-match was defined by N
consecutive matching lines, but in this, a re-match can simply be two
identical strings.)

But for the situation given, I'm assuming that a simpler solution suffices.

Chris Angelico
 
C

Chris Angelico

The algorithm went
something like this:

* Start with pointers to beginnings of both lists.

PS. It wasn't C with pointers and the like; iirc it actually used
array indices as the "pointers". Immaterial to the algorithm though.

ChrisA
 
G

geremy condra

There was a huge thread about the removal of the cmp parameter to
sort() in Python 3.x. It was quite a wide-ranging discussion, touched
on many things. I think it got up to 90 posts in the first day, or
something. The main thing it taught me was that python-list is a high
traffic list; the next most important thing was who the courteous
posters are.

I know that, but I mean what were you talking about before if you
weren't talking about cmp?
In contrast, the thread on NaNs in Python has taught me more about the
difference between floating point and 'real numbers' in a week or two
than all my previous computing training put together. Yep, this is an
interesting list.

I liked that thread. There should be more of those.

Geremy Condra
 
C

Chris Torek

If order and duplicates matter, then you want a completely different
diff. I wrote one a while back, but not in Python. ...

If order and duplicates matter, one might want to look into
difflib. :)
 
E

Ethan Furman

Chris said:
Not sure what you mean.

I suspect Geremy is referring to your "most of the batteries you're used
to, included" comment -- which batteries are missing?

~Ethan~
 
Z

Zachary Dziura

For this script, it's guaranteed that whatever tables the script goes
through and processes, there will be no duplicate headers. I didn't
include functionality to deal with duplicates because there won't be
any to begin with! I just wanted to find out the most efficient way of
checking for similar headers, which Chris A. told me! Thank you again,
Chris! You saved me a lot of headache!
 
C

Chris Angelico

I suspect Geremy is referring to your "most of the batteries you're used to,
included" comment -- which batteries are missing?

Oh! There's a handful of modules that aren't yet available in 3.x,
which might surprise someone who's moving from 2.x. But hey, I didn't
know about difflib (although I probably would have found it if I'd
gone searching). Python has more than expected, not less!

And Zachary: You're most welcome. Glad to have been of service!

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top