sorting with expensive compares?

D

Dan Stromberg

Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!
 
G

gene tani

Dan said:
Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

might be simpler to memoize cmp(), look in online cookbook or
something like this decorator

http://mail.python.org/pipermail/python-list/2005-October/303035.html
http://aspn.activestate.com/ASPN/Python/Cookbook/
 
B

bonono

Dan said:
Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!
Sounds like DSU time.

[a] -> [ (hash(a), a) ]
 
G

gene tani

Dan said:
Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!
Sounds like DSU time.

[a] -> [ (hash(a), a) ]

Aha! OR: take a log of the array, e.g. log base 10 or some other
monotonic transform and permutation order indexes
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/306862
 
B

bonono

gene said:
Dan said:
Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!
Sounds like DSU time.

[a] -> [ (hash(a), a) ]

Aha! OR: take a log of the array, e.g. log base 10 or some other
monotonic transform and permutation order indexes
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/306862
I may have made a mistaken in that hash(a) should be some function that
returns the "order" of a, rather than the built-in hash() function.
 
B

Ben Sizer

Dan said:
Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

It's not really practical - if the list is unsorted, it's non-trivial
to determine how many times a given element is duplicated until you've
compared it with everything else. That is roughly an O(N*N/2) operation
whereas sorting typically is O(NlogN). This is why C++'s 'std::unique'
function only operates on sorted lists.

So instead, one alternative would be to use a comparison function that
takes the 2 objects and looks for the pair in a dictionary: if that
pair is not found, perform the normal comparison and store it in the
dictionary for next time, and if it is found, just return it. This way
the actual comparison is only done once for each pair.

Alternatively you might be able to produce a cheap comparison from the
expensive one if you can summarise the information in a simpler form.
Perhaps each sorting criterion can be represented as an integer, and
you can instead sort a list of lists containing integers.
 
K

Kent Johnson

Dan said:
Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Sounds like DSU time.

[a] -> [ (hash(a), a) ]

This won't work - elements with different hashes will sort by hash and
elements with the same hash will still be compared which is exactly what
the OP is trying to avoid.

If there is some function of the arrays which sorts in the same order as
the natural comparison then that function can be used as a sort key.
sort(arrayList, key=some_proxy_function)

Kent
 
P

Paul Rubin

Kent Johnson said:
[a] -> [ (hash(a), a) ]

This won't work - elements with different hashes will sort by hash and
elements with the same hash will still be compared which is exactly
what the OP is trying to avoid.

ds = sorted([(hash(c), i) for i,c in enumerate(a)])
dsu = [a for hc,i in ds]

Is that what you mean? I didn't answer the OP because I couldn't
understand the original question. The above brings together clusters
of elements with the same hash, so if the clusters are large you can
finish the sort with relatively few comparisons.
 
A

Aahz

Python appears to have a good sort method, but when sorting array
elements that are very large, and hence have very expensive compares,
is there some sort of already-available sort function that will merge
like elements into a chain, so that they won't have to be recompared as
many times?

I'll just note that Python's sort is specifically designed to reduce the
number of compares precisely because *all* compares in Python are
relatively expensive. I'll suggest a variant on the previous suggestion
of hash:

[a] -> [hash(a), index(a), a]
 
D

Dan Stromberg

Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

I probably should've been more specific.

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".

I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

Thanks!

def __cmp__(self,other):
# sys.stderr.write('Comparing %s and %s\n' % (self.name, other.name))
if verbosity >= 1:
sys.stderr.write('Comparing file_class objects %s and %s\n' %
(self.name, other.name))
if self.device == -1:
if verbosity:
sys.stderr.write('Getting stat data for file %s\n' % self.name)
result = os.stat(self.name)
self.device = result[stat.ST_DEV]
self.inode = result[stat.ST_INO]
self.size = result[stat.ST_SIZE]
if other.device == -1:
if verbosity:
sys.stderr.write('Getting stat data for file %s\n' % other.name)
result = os.stat(other.name)
other.device = result[stat.ST_DEV]
other.inode = result[stat.ST_INO]
other.size = result[stat.ST_SIZE]
if self.device == other.device and self.inode == other.inode:
# same device, same inode, must be identical files return 0
if self.length < other.length:
return -1
elif self.length > other.length:
return 1
# if we've reached this point, the files are not hardlinks, and their
lengths are identical # so slurp in the prefixes if needed, then compare
them if self.prefix == -1:
if verbosity:
sys.stderr.write('Getting prefix for file %s\n' % self.name)
file = open(self.name, 'r')
self.prefix = file.read(self.max_prefix_length) file.close()
if other.prefix == -1:
if verbosity:
sys.stderr.write('Getting prefix for file %s\n' % other.name)
file = open(other.name, 'r')
other.prefix = file.read(self.max_prefix_length) file.close()
if self.prefix < other.prefix:
return -1
elif self.prefix > other.prefix:
return 1
# if we've reached this point, we know that: # 1) The files are not
hardlinks of each other # 2) They have identical sizes # 3) Their
prefixes are identical
# 4) We haven't yet tried doing a cryptographic hash, so compute them if
needed, and compare them if self.hash == '':
self.gen_hash()
if other.hash == '':
other.gen_hash()
if self.hash < other.hash:
return -1
elif self.hash > other.hash:
return 1
# if we've reached this point, we know that: # 1) The files are not
hardlinks of each other # 2) They have identical sizes # 3) Their
prefixes are identical
# 4) The cryptographic hashes are identical # 5) All that remains is a
"long form" comparison, so bite the bullet and do the I/O if verbosity:
sys.stderr.write('Doing byte for byte comparison of %s and %s\n' %
(self.name, other.name))
self_fp = open(self.name, 'r')
other_fp = open(other.name, 'r')
while 1:
self_buf = self_fp.read(self.bufsize) other_buf =
other_fp.read(self.bufsize) if self_buf < other_buf:
self_fp.close()
other_fp.close()
return -1
elif self_buf > other_buf:
self_fp.close()
other_fp.close()
return 1
if not self_buf and not other_buf:
self_fp.close()
other_fp.close()
return 0
if not self_buf:
self_fp.close()
other_fp.close()
return -1
if not other_buf:
self_fp.close()
other_fp.close()
return 1
 
B

bonono

Dan said:
I probably should've been more specific.

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".

I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.
Why would #5 not enough as an indicator that the files are indentical ?
 
A

Alex Martelli

Dan Stromberg said:
I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".

OK, very clear.
I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

Makes sense, including the possibility of computing some items on
demand. However, using a key-callable rather than a cmp-callable may
still make sense -- you just need a callable that extracts the
attributes on demand (and caches them thereafter... assuming you have
enough memory to keep all the caches, I guess [6] might be a problem).
I do see that key-callables won't necessarily work easily here, but
since the key-callable's __cmp__ will in turn be called, that might
still work better... but, on reflection, it's probably just a minor
gain.
However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

The comparison should never get called twice on the same two objects,
anyway. So, I'm not sure what you think you could gain by optimizing
future comparisons of two objects once you've ascertained they are in
fact equal. Still, assuming for example that self.name is a unique
identifier (i.e. the so-called name is in fact a complete path), the
optimization (memoization) is quite easy to perform. Rename what you
currently have as __cmp__ by another name, say _real_cmp, add a
_compared dict to the class, and code the following __cmp__ ...:

_compared = {}
def __cmp__(self, other):
try: return -self._compared[other.name, self.name]
except KeyError: pass
key = self.name, other.name
if key in self._compared: return self._compared[key]
result = self_compared[key] = self._real_cmp(other)
return result

I would not expect this optimization to matter at all, because no key
should ever be already present in the self._compared dictionary (the
same two files should never, ever get compared twice anyway).

However, it might be possible to extend this idea by using the
properties you know an ordering should have -- if A and B have never
been compared, but both have been compared with C, in some cases you
don't need to compare A and B but may exploit already-known results.
For example, if A==C and B==C, you already know that A==B without any
actual comparison; if A<C and B==C, you already know that A<B; and so
on. Of course in some cases you still must work, e.g. if you know A<C
and B<C that doesn't help. I'm not sure if this would help at all,
either: it's quite possible that the sort algorithm already exploits
these possibilities internally. But, perhaps, it may be worth a try.


Alex
 
P

Peter Otten

Dan said:
I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

Are you really trying to establish an order or do want to eliminate the
duplicates?
True

doesn't make that much sense to me, regardless of what the comparison may
actually do.

Peter
 
S

Steve Holden

Dan Stromberg wrote: [...]
I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file
[...]
Why would #5 not enough as an indicator that the files are indentical ?
Because it doesn't guarantee that the files are identical. It indicates,
to a very high degree of probability (particularly when the file lengths
are equal), that the two files are the same, but it doesn't guarantee it.

Technically there are in infinite number of inputs that can produce the
same md5 hash.

regards
Steve
 
S

Steven D'Aprano

Dan Stromberg wrote:
[snip]
I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

Why would #5 not enough as an indicator that the files are indentical ?

(1) Because the MD5 algorithm does include collisions. I was going to say
"rare collisions", but there is actually an infinite number of them. The
collisions are just distributed thinly -- because MD5 is a good hash
algorithm, *very* thinly.

(Proof of this comes from the pigeon-hole principle: there is an infinite
number of possible files of arbitrary length, and only a finite number of
possible hashes. Therefore, an infinite number of files must hash to each
possible hash.)

(2) Having said that, unless the original poster is dealing with billions
(plus) of files, it is unlikely that he is finding any of the collisions
unless there is a bug in his sort routine. Since he claims to be doing
more comparisions-by-file-contents than expected (I would suggest *one*
is more than expected), I predict a bug in his code, his algorithm, or
both.
 
S

Steven D'Aprano

Are you really trying to establish an order or do want to eliminate the
duplicates?

True

doesn't make that much sense to me, regardless of what the comparison may
actually do.

If I have understood the poster's algorithm correctly, it gets even
weirder:


Sorted list of files ->

[parrot.ogg, redhat.iso, george.log, fred.log, rhino.ogg, cat.ogg,
debian.iso, sys_restore.iso, adrian.log, fox.ogg, ...]

It seems to this little black duck that by sorting by file contents in
this way, the effect to the human reader is virtually to randomise the
list of file names.

Even if you limit yourself to (say) a set of ogg files, and sort by the
binary contents ->

# album-track
[parrot-6.ogg, rhino-1.ogg, cat-12.ogg, fox-2.ogg, parrot-3.ogg, ...]

most people looking at the list would guess it had been shuffled, not
sorted. So I too don't know what the original poster hopes to accomplish
by sorting on the content of large binary files.
 
P

Paul Rubin

Dan Stromberg said:
I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems

Just don't do any #6 comparisons. If the hashes are identical, the
files are identical--that's the point of a cryptographic hash. OK, to
be pedantic:

1) there's a microscopic chance of an accidental collision, but
it's much lower than the chance of a hardware failure making your
comparison function give the wrong answer.

2) there are known cryptanalytic attacks against md5 that can let
someone deliberately construct two distinct files with the same
hash, with a fair amount of efort. If you care about this, use
sha-1 instead, or better yet, sha-256. (sha-1 has an attack
similar to the md5 attack, but the amount of work required is not
really practical today, unlike the md5 attack).
 
S

Steven D'Aprano

I probably should've been more specific.

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".

Which is a very strange thing to do, but I'll assume you have a good
reason for doing so.
I've been using the following compare function,

My first suggestion is that you invest a bit of time in building a
small-scale set of data that you can test your sorting on, if you haven't
already done so. On multi-megabyte files, taking the MD5 hash or comparing
the content byte by byte is quite slow. So I would spend a few minutes
populating a directory with a dozen or so fake iso and ogg files, only a
few bytes in length each, and run my tests on them.
which in short checks, in order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the
other file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

My second suggestion is to encapsulate vigorously, at least for testing.
At the moment, you have a single __cmp__ function that does everything
in-line. I would do something like this:

def __cmp__(self, other):
result = compare_by_device_number_and_inode(self, other)
if result is None:
result = compare_by_file_length(self, other)
if result is None:
...
return result

You can always unroll the function calls later, but for now it helps in
debugging: you only need to think about one type of comparison at a time,
and makes it easier to think about what each function needs to do.

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

As others have pointed out, Python's sort never compares the same objects
more than once.

Some more comments follow interleaved with your code:

def __cmp__(self,other):
# sys.stderr.write('Comparing %s and %s\n' % (self.name, other.name))

Are you sure you don't want to just sort by file name? That would be much
simpler *wink*
if verbosity >= 1:
sys.stderr.write('Comparing file_class objects %s and %s\n' %
(self.name, other.name))
if self.device == -1:
if verbosity:
sys.stderr.write('Getting stat data for file %s\n' % self.name)
result = os.stat(self.name)
self.device = result[stat.ST_DEV]
self.inode = result[stat.ST_INO]
self.size = result[stat.ST_SIZE]

Since you need to compare stat data for all these objects, why not
simply pre-fetch them when you create the object? That way you can
simplify your __cmp__ function significantly.

[snip]
if self.device == other.device and self.inode == other.inode:
# same device, same inode, must be identical files return 0

At first I thought that my news client had mangled your function, but
going back to the original post, either *your* client has mangled it, or
I've found a bug in your code. "return 0" is commented out.

Is it possible that all the excess file comparisons are being done only
for hard links?

[snip]

if verbosity:
sys.stderr.write('Getting prefix for file %s\n' % self.name)
file = open(self.name, 'r')
self.prefix = file.read(self.max_prefix_length) file.close()

"prefix" is probably not the best name. Maybe "header" (as in "file
header")?

If the prefix/header is short enough, don't worry about fetching on
demand, at least not while you are still debugging. Just pre-fetch it, get
your code working, and then convert back to fetching the header on demand.

if other.prefix == -1:
if verbosity:
sys.stderr.write('Getting prefix for file %s\n' % other.name)
file = open(other.name, 'r')
other.prefix = file.read(self.max_prefix_length) file.close()
if self.prefix < other.prefix:
return -1
elif self.prefix > other.prefix:
return 1
# if we've reached this point, we know that: # 1) The files are not
hardlinks of each other # 2) They have identical sizes # 3) Their
prefixes are identical
# 4) We haven't yet tried doing a cryptographic hash, so compute them
if needed, and compare them if self.hash == '':
self.gen_hash()
if other.hash == '':
other.gen_hash()

Without knowing what gen_hash() does, it is hard to say whether the bug
might lie here. If you are using the tried-and-tested MD5 module, that
should be good, but if you have tried to roll your own, I would seriously
consider the possibility that you have a bug in the gen_hash() method.

if self.hash < other.hash:
return -1
elif self.hash > other.hash:
return 1
# if we've reached this point, we know that: # 1) The files are not
hardlinks of each other # 2) They have identical sizes # 3) Their
prefixes are identical
# 4) The cryptographic hashes are identical # 5) All that remains is a
"long form" comparison, so bite the bullet and do the I/O

Or just accept that the probability of a collision is so small that you
aren't ever going to find one.

According to this website:

http://www.iusmentis.com/technology/encryption/pgp/pgpattackfaq/hash/

the expected number of random unique files you would need to compare
before finding a single collision in the MD5 hashes is (very roughly)
10**70, or ten billion trillion trillion trillion trillion trillion.

Somehow, I don't think you've got that many legal ogg files *grin*.

If you are finding even a single MD5 collision in a trillion comparisons,
it would be a miracle, or more likely a bug.

if
verbosity:
sys.stderr.write('Doing byte for byte comparison of %s and %s\n' %
(self.name, other.name))
self_fp = open(self.name, 'r')
other_fp = open(other.name, 'r')
while 1:
self_buf = self_fp.read(self.bufsize) other_buf =
other_fp.read(self.bufsize) if self_buf < other_buf:
self_fp.close()
other_fp.close()
return -1
elif self_buf > other_buf:
self_fp.close()
other_fp.close()
return 1
if not self_buf and not other_buf:
self_fp.close()
other_fp.close()
return 0
if not self_buf:
self_fp.close()
other_fp.close()
return -1
if not other_buf:
self_fp.close()
other_fp.close()
return 1

But you've stated that an invariant of the problem is if you get to
the point of comparing file contents, the files must be the same size.
That means that it is impossible for one buffer to be empty when the other
is not. If you find that condition, you should raise an error -- it means
the file has changed size in the time between checking that the lengths
are equal, and reading the file contents into a buffer.
 
P

Paul Rubin

Steven D'Aprano said:
http://www.iusmentis.com/technology/encryption/pgp/pgpattackfaq/hash/

the expected number of random unique files you would need to compare
before finding a single collision in the MD5 hashes is (very roughly)
10**70, or ten billion trillion trillion trillion trillion trillion.

That's not right, the right number is around 2**64 or 2e19. See the
section of that page about the birthday attack. It's still an awful lot.

There are also known ways of deliberately constructing md5 collisions
(i.e. md5 is broken). Whether the OP should care about that depends
on the application.
 
S

Steven D'Aprano

That's not right, the right number is around 2**64 or 2e19. See the
section of that page about the birthday attack. It's still an awful lot.

Oh poot, a stupid typo! I entered 1e60 in my calculation instead of 1e6,
which is doubly embarrassing because the correct number to use is 1e9:

In MD5's case, 2**64 messages need to be tried. This is not a feasible
attack given today's technology. If you could try 1,000,000 messages per
second, it would take 584,942 years to find a collision. A machine that
could try 1,000,000,000 messages per second would take 585 years, on
average.
[end quote]

So the expected number of files (messages) needed to check to find a
collision is:

585*365*24*60*60*1e9 = 1.8e19

or more than ten million million million, which of course equals 2**64.



There are also known ways of deliberately constructing md5 collisions
(i.e. md5 is broken). Whether the OP should care about that depends
on the application.

Sure, but I don't he is deliberately trying to sabotage his own files :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top