Large Dictionaries

Thomas Ganss · May 29, 2006

Klaas said:
4. Insert your keys in sorted order.

This advice is questionable -
it depends on the at least on the db vendor and probably
sometimes on the sort method, if inserting pre-sorted
values is better.

My gut feeling on this matter is:
IF the insert times of pre-sorted values is far better
than the times of unsorted values, there is a chance
that the resulting tree is unbalanced: only 1 compare
operation after insert and no re-balancing of the tree.

re-indexing will probably give you far better access times
in this case. Another option is to drop non RI indexes used only
for query optimization and recreate them after batch insert.

my 0.02 EUR

thomas

Scott David Daniels · May 29, 2006

Thomas said:
This advice is questionable -....

My gut feeling on this matter is:
IF the insert times of pre-sorted values is far better
than the times of unsorted values, there is a chance
that the resulting tree is unbalanced: only 1 compare
operation after insert and no re-balancing of the tree.

re-indexing will probably give you far better access times
in this case. Another option is to drop non RI indexes used only
for query optimization and recreate them after batch insert.

Don't use your gut for such issues. Pre-sorted data is such
a common special case (in fact, the only easily describable
special case) that many systems handle this specially. For
example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

If you are using your gut without testing, go ahead and
presort. In any case, reading documents and testing beats
gut feels every time.

--Scott David Daniels
(e-mail address removed)

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · May 30, 2006

Roy said:
My guess would be that each resize grows the table by a constant
factor, which IIRC, works out to amortized O(n).

Right. For <50000 entries, the size is multiplied by 4 each time,
and doubled each time for large dictionaries.

It's not as good as
creating the dict the right size in the first place, but it's really
not that bad. Somewhat more troubling is that it can lead to memory
fragmentation problems.

Depends on your operating system, of course. You get over a page size
fairly quickly, and then many systems use anonymous mappings, where
a range of pages gets mapped from swap on allocation, and unmapped
on deallocation. So while this can cause address space fragmentation,
it doesn't cause memory fragmentation (i.e. memory that is allocated
by the process but can't be used).

I don't understand why Python doesn't have a way to give a size hint
when creating a dict. It seems so obvious to be able to say "d = dict
(size=10*1000*1000)" if you know beforehand that's how many keys
you're going to add.

There is no need for it. If dicts don't work well in some cases, these
cases should be investigated and fixed, instead of working around the
problem by loading the problem onto the developer.

As you said, it *should* have linear time, with the number of
insertions. If it doesn't (and it appears not to), that may be due to
hash collisions, or due to the time for computing hashes not being
constant.

Regards,
Martin

Lawrence D'Oliveiro · Jun 5, 2006

Scott David Daniels said:
For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

Steve Holden · Jun 5, 2006

Lawrence said:
But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

Isn't that just your definition of "reasonable"?

regards
Steve

Aahz · Jun 5, 2006

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

Read some of the old discussions in the python-dev archives.

Iain King · Jun 5, 2006

Lawrence said:
But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

An already sorted list can be pathological for Quicksort, depending on
how you code it.

Iain

Tim Peters · Jun 5, 2006

[Scott David Daniels]
O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro]

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)? O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Python's current sorting algorithm exploits pre-existing order of many
kinds. For example, take a sorted list, "cut it in half, and riffle
shuffle the two halves together"; e.g.,

[0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

That turns out to be a supernaturally good case too.

Jim Segrave · Jun 5, 2006

[Scott David Daniels]
O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro]

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

Click to expand...

For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)? O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof, whereas some other
sorts, quicksort being a prime example, require great care to
avoid pathological cases.

Tim Peters · Jun 5, 2006

[Jim Segrave]

Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof,

Write a heapsort and time it. It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start. Initially
reverse-sorted is a "good case" for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible). "good" and "bad" are both
O(N log N) here, though.

BTW, most people have never heard of Smoothsort, which was Dijkstra's
excruciating attempt to make heapsort exploit pre-existing order:

http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796.PDF

That's one of a hundred algorithms I rejected for Python ;-)

whereas some other sorts, quicksort being a prime example, require
great care to avoid pathological cases.

Indeed, Python gave up on quicksort for that reason. While the
samplesort hybrid that came between quicksort and the current sort
also had theoretical O(N**2) worst-case behavior, it was exceedingly
difficult to create such an input, and (unlike as for every quicksort
variant Python tried) no real-life case of undue sloth was ever
reported against it.

Jim Segrave · Jun 5, 2006

[Jim Segrave]

Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof,

Click to expand...

Write a heapsort and time it. It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start. Initially
reverse-sorted is a "good case" for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible). "good" and "bad" are both
O(N log N) here, though.

I did implement a crude heapsort before posting this and measured the
number of comparisions and the number of swaps.

==================================================

#!/usr/local/bin/python

import random

class HeapSorter(object):
def __init__(self, seq):
self.compares = 0
self.swaps = 0
self.length = len(seq)
self.seq = seq

def __sift(self, i):
while True:
j = 2 * i + 1
if j + 1 < self.length:
self.compares += 1
if self.seq[j + 1] > self.seq[j]:
j = j + 1
if j >= self.length:
return

self.compares += 1
if self.seq > self.seq[j]:
return

self.swaps += 1
self.seq, self.seq[j] = (self.seq[j], self.seq)
i = j

def __makeheap(self):
for i in range(self.length / 2, -1, -1):
self.__sift(i)

def sort(self):
self.__makeheap()
for i in range(self.length - 1, -1, -1):
self.swaps += 1
self.seq, self.seq[0] = (self.seq[0], self.seq)
self.length -= 1
self.__sift(0)

if __name__ == '__main__':

s = list(range(100000))
random.shuffle(s)

heap = HeapSorter(s)
heap.sort()
print "Random list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)
heap = HeapSorter(s)
heap.sort()
print "Sorted list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)

s.reverse()
heap = HeapSorter(s)
heap.sort()
print "Reverse Sorted list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)

==================================================

Which gave the following results:

/usr/home/jes% python ./heapsort
Random list: comapres 3019969, swaps 1575263
Sorted list: comapres 3112517, swaps 1650855
Reverse Sorted list: comapres 2926640, swaps 1497435

Assuming compares and swaps dominate other bookkeeping, then heapsort
is quite stable in its performance, although the constant in the O(N log N)
is much larger than for other sorts.

Lawrence D'Oliveiro · Jun 8, 2006

"Tim Peters said:
For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)? O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Shellsort works well with nearly-sorted data. It's basically a smarter
version of bubblesort with much improved efficiency. It's also very
compact to implement.

Lawrence D'Oliveiro · Jun 8, 2006

"Iain King said:
An already sorted list can be pathological for Quicksort, depending on
how you code it.

I did say "reasonable".

Tim Peters · Jun 8, 2006

[Tim Peters]
[Lawrence D'Oliveiro]

Shellsort works well with nearly-sorted data. It's basically a smarter
version of bubblesort with much improved efficiency. It's also very
compact to implement.

shellsort is much more a refinement of insertion-sort than of
bubblesort, but is not an O(N log N) algorithm regardlesss. A nice,
succinct summary of what was known about shellsort's behavior as of
Sedgewick's 1996 "Analysis of Shellsort and Related Algorithms" can be
found here:

http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/shell/shellen.htm

Marcin Ciura · Jun 9, 2006

Tim said:
shellsort is much more a refinement of insertion-sort than of
bubblesort, but is not an O(N log N) algorithm regardlesss.

With a judiciously chosen increment sequence,
the number of comparisons made by shellsort
*is* of the order N log N for practical values of N.
See Figure 8 in
http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
Marcin

Duncan Booth · Jun 9, 2006

Marcin said:
With a judiciously chosen increment sequence,
the number of comparisons made by shellsort
*is* of the order N log N for practical values of N.
See Figure 8 in
http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf

That isn't what the reference says. It only covers N up to a few thousand.
Practical values of N need to at least go up into the millions.

Read the summary:

Marcin Ciura · Jun 9, 2006

Duncan said:
That isn't what the reference says. It only covers N up to a few thousand.
Practical values of N need to at least go up into the millions.

Please look at the performance graph of Tokuda's increment sequence.
You can see that it scales pretty well at least up to 100 million.
Marcin

Duncan Booth · Jun 9, 2006

Marcin said:
Please look at the performance graph of Tokuda's increment sequence.
You can see that it scales pretty well at least up to 100 million.

Ah sorry, I misread it. It was sequences with several thousand elements,
which corresponds to N of 100 million.

Klaas · Jun 13, 2006

Thomas said:
This advice is questionable -
it depends on the at least on the db vendor and probably
sometimes on the sort method, if inserting pre-sorted
values is better.

The article I wrote that you quoted named a specific vendor (berkeley
db) for which my advice is unquestionable and well-known.

-Mike

ZODB memory problems (was: processing a Very Large file)	1	May 21, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

Large Dictionaries

Thomas Ganss

Scott David Daniels

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Lawrence D'Oliveiro

Steve Holden

Aahz

Iain King

Tim Peters

Jim Segrave

Tim Peters

Jim Segrave

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Tim Peters

Marcin Ciura

Duncan Booth

Marcin Ciura

Duncan Booth

Klaas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads