sorting with expensive compares?

T

Tim Peters

[Steven D'Aprano]
...
As others have pointed out, Python's sort never compares the same objects
more than once.

Others have pointed it out, and it's getting repeated now, but it's
not true. Since I wrote Python's sorting implementation, it's
conceivable that I'm not just making this up ;-)

Here's the shortest counter-example: [10, 30, 20].

The first thing sort() does is compute the length of the longest
natural run (whether increasing or decreasing) starting at index 0.
It compares 10 and 30, sees that the list starts with an increasing
run, then compares 30 and 20 to see whether this run can be extended
to length 3. It can't. Since the list has only 3 elements, it
decides to use a binary insertion sort to move the 3rd element into
place. That requires 2 comparisons, the first of which _again_
compares 30 and 20.

This doesn't bother me, and it would cost more to avoid this than it
could possibly save in general. It's true that the workhorse binary
insertion and merge sorts never compare the same elements more than
once, but there is some potential duplication between those and
comparisons done to identify natural runs.
 
P

Paul Rubin

Steven D'Aprano said:
Sure, but I don't he is deliberately trying to sabotage his own files :)

He might have downloaded a file created by a saboteur to have the same
md5 as some popular music file, but which contains a subliminal
hypnotic message which will brainwash him if played. Using a stronger
hash, such as sha256, should protect him from this fate.
 
A

Alex Martelli

Tim Peters said:
[Steven D'Aprano]
...
As others have pointed out, Python's sort never compares the same objects
more than once.

Others have pointed it out, and it's getting repeated now, but it's
not true. Since I wrote Python's sorting implementation, it's
conceivable that I'm not just making this up ;-) ...
could possibly save in general. It's true that the workhorse binary
insertion and merge sorts never compare the same elements more than
once, but there is some potential duplication between those and
comparisons done to identify natural runs.

Since I probably was the first one to point out the erroneous factoid in
question, I apologize -- obviously, my memories of the inner workings of
sort were faulty, and didn't consider that potential duplication.

In this case, the tricks I already (though dubiously;-) suggested in
order to avoid any avoidable comparisons might pay for themselves and
then some, if comparisons are indeed extremely costly.


Alex
 
T

Thomas Wouters

He might have downloaded a file created by a saboteur to have the same
md5 as some popular music file, but which contains a subliminal
hypnotic message which will brainwash him if played. Using a stronger
hash, such as sha256, should protect him from this fate.

But the odds of such a message having the same MD5 as an existing
song on his disk is quite a lot higher than 2**64, unless he has a really,
really large music collection ;) In the case you propose, two files don't
just need to have the same MD5, but they also need to have a whole lot of
other characterstics; both need to be (somewhat) valid MP3's, one needs to
be a piece of music (or other sound) that is somewhat to the target's
liking, and the other needs to be something playable with a subliminal
message the target is likely to respond to.

Calculate-me-odds-on-THAT-<wink>-ly 'yrs,
 
P

Paul Rubin

Thomas Wouters said:
But the odds of such a message having the same MD5 as an existing
song on his disk is quite a lot higher than 2**64, unless he has a really,
really large music collection ;) In the case you propose, two files don't
just need to have the same MD5, but they also need to have a whole lot of
other characterstics; both need to be (somewhat) valid MP3's, one needs to
be a piece of music (or other sound) that is somewhat to the target's
liking, and the other needs to be something playable with a subliminal
message the target is likely to respond to.

The way the known collision attack works, the saboteur has to
construct both files. However, the attacker does have a fair amount
of control over the content. So he can start an innocent file
circulating, then replace it with a sabotaged file on some network.
A user might possibly somehow end up with both versions.

See: http://www.cits.rub.de/MD5Collisions/ for how that kind of attack
can work.
 
S

Stuart D. Gathman

Which is a very strange thing to do, but I'll assume you have a good
reason for doing so.

I believe what the original poster wants to do is eliminate duplicate
content from a collection of ogg/whatever files with different names.
E.g., he has a python script that goes out and collects all the free music
it can find on the web. The same song may appear on many sites under
different names, and he wants only one copy of a given song.

In any case, as others have pointed out, sorting by MD5 is sufficient
except in cases far less probable than hardware failure - and deliberate
collisions. E.g., the RIAA creates collision pairs of MP3 files where one
member carries a freely redistributable license, and the other a "copy
this and we'll sue your ass off" license in an effort to trap the unwary.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,771
Messages
2,569,587
Members
45,097
Latest member
RayE496148
Top