sorteddict PEP proposal [started off as orderedict]

R

Raymond Hettinger

[Mark Summerfield]
Below is a PEP proposal for a sorteddict. It arises out of a
discussion on this list that began a few weeks ago with the subject of
"An ordered dictionary for the Python library?"

It is worth remembering that the long sought after datatype was a dict
that could loop over keys in *insertion* order, not based on some
function of the key itself.

If there is positive feedback I will submit the PEP to the reviewers,
so if you think it is a good idea please say so.

As one who was submitted many PEPs, I can advise that the quickest way
to irrevocably kill an idea is to submit a PEP to early (we almost
didn't get genexps because the PEP was submitted pre-maturely).
Instead, I advise posting an ASPN recipe so the details can be
hammered-out and a fan club can be grown.

For this particular idea, I am -1 on inclusion in the standard
library. The PEP focuses mainly on implementation but is weakest in
the rationale section. Its advantages over "for k in sorted(d)" are
miniscule. To be considered for the collections module, there needs
to be compelling use case advantages either in terms of usability or
performance.

After grepping through my accumulated code base for "sorted", I see
remarkably few places where sortdict would have be preferred and in
those few cases, the advantage was so slight that the code would not
be worth changing. The majority of dictionary sorts are characterized
by fully building a dictionary and then sorting it. In those cases,
sortdict merely offers a different spelling and adds a small
performance cost during key insertion. The sortdict has some
potential in the comparatively rare case of a long lived dictionary
where periodic sorted key requests are interleaved with operations
that mutate the dictionary. In those cases, the separate tracking of
added/removed keys may save some sorting effort; however, there is an
offsetting cost of many __hash__/__cmp__ calls in the list
comprehension update and the modest savings was in all cases
trivialized by cost of whatever was initiating the dictionary
mutations (user inputs or web updates).

Also, there were cases when the sortdict was at a distinct
disadvantage. In those cases, there were multiple key functions
(key=record.name or key=record.date or key=record.size). There
standard approach using sorted() creates three separate sorted lists
and only one dict. In contrast, the sortdict approach would need
three sortdicts each with their own dict and sorted list.

IMHO, the collections module should be saved for something much more
powerful and generic like a dictionary that generates a callback
whenever it is mutated. A datatype like that would have it much
easier for you to implement something like this sortdict or the long
sought after dict that remembers its key insertion order.


Raymond
 
P

Paul Rubin

Mark Summerfield said:
The sorteddict API that has emerged so far is (1) apart from the
constructor, everything is identical to dict,

I don't see this as necessary, it's ok if it resembles dict as much
as dbm does.
(2) the constructor takes the same args as sorted(), so if you want
to seed with a dict or with keywords you write
sorteddict(dict(a=1,b=2), ...), (or you could create a sorteddict
and use update() since that takes the same args as dict's
constructor).

sorteddict could itself return a constructor:

d = sorteddict(key=int).new((('12', 'a'), ('3', 'b')))

gives sorted version of {'3':'b', '12':'a'}
 
J

James Stroud

Raymond said:
As one who was submitted many PEPs, I can advise that the quickest way
to irrevocably kill an idea is to submit a PEP to early (we almost
didn't get genexps because the PEP was submitted pre-maturely).

Maybe its the PEP killers who act prematurely because I friggin' love
genexps and the little generators they generate. LC can be crude by
comparison and is usually the construct that requires a compelling
reason for usage.

James
 
R

Raymond Hettinger

[James Stroud ]
Maybe its the PEP killers who act prematurely because I friggin' love
genexps and the little generators they generate.

No, it's the PEP author who controls when they ask for pronouncement.
The natural instinct is to ask for pronouncement as soon as you have
an implementation and a couple of advocates, but the smarter play is
to put the code in the wild and let it mature so that the kinks get
worked out, a fan club can grow, the patterns of usage become
established, and acceptance/rejection becomes self-evident.

Genexps got hung-up on details like whether the loop variable should
be visible (like it is with list comps), was it faster or slower than
list comps (unknown until an implementation was provided), could it be
fit into the grammar unambiguously (the answer was yes *after*
introducing a requirement for enclosing parentheses in some cases).
Also, there were some cases that had (and still have) surprising
behaviors -- in the end, those were addressed by adding advice to
consumer genexps right away and not pass them around like regular
generators where the scoping was more self-evident.

The other disadvantage of asking for pronouncement prematurely is the
risk that everyone gives it a nod without deep thought about whether
it is truly necessary and useful. I got dict.fromkeys() accepted
without much fuss but now am now sure it was a good idea -- it added
to an already heavy dict API but did not offer compelling usability
advantages or performance advantages. Had the set() builtin already
been in place, it is much less likely that dict.fromkeys would have
been accepted.


Raymond
 
D

Duncan Booth

Paul Hankin said:
So should Duncan's

def __removekey(self, key):
if key in self.__addkeys:
del self.__addkeys[key]
else:
self.__delkeys.add(key)

be changed to:

def __removekey(self, key):
if key in self.__addkeys:
del self.__addkeys[key]
self.__delkeys.add(key)

Yes, and the same in __addkey: if it's in __delkeys it should be
removed from there, and added to __addkeys. There's an invariant: any
key is in at most one of __addkeys and __delkeys.

No, don't do that. The code was fine as I had written it, except for the
minor point that sets don't support 'del'! Also there need to be some tests
which actually exercise those two branches of the code: once you add
appropriate tests it will become obvious that you really do need the else.

A key which is in dict must be either in __keycache or in __addkeys, but
never in both.

A key which is not in the dict is either in none of __keycache, __addkeys,
and __delkeys, or it is in both __keycache and __delkeys.

If you don't have the else in both __removekey and __addkey, then deleting
a key leaves the key present in both __keycache and __delkeys (as it
should), and re-adding the key then leaves it in both __keycache and
__addkeys which breaks the first condition and means iterating will see
the key twice. Alternatively adding a key not in the dictionary and then
deleting it leaves it only in __delkeys, which isn't as serious. Actually
maybe you only need it in __addkey, or maybe I still don't have enough
tests.

Here's the code with set removal fixed and the extra test added: try
removing the else from __addkey and you'll find it gives repeated indexes
on iteration.

However, what we really need now is some code which actually uses both
versions of sorteddict, and an ordinary dict sorted when needed, and
demonstrates a real benefit from one or both sorteddict over the ordinary
dict. I think my version should be faster, but it is more complicated and
often in Python the fastest way is also the simplest.

------------- sorteddict.py -----------------------
#!/usr/bin/env python
# Copyright (c) 2007 Qtrac Ltd. All rights reserved.
# This module is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
# The API and implementation are based on ideas from Paul Hankin writing
# on comp.lang.python.

"""
A dictionary that is sorted by key or by the given cmp or key function.

Provides a dictionary with the same methods and behavior as a standard
dict and that can be used as a drop-in replacement for a dict (apart
from the constructor), but which always returns iterators and lists
(whether of keys or values) in sorted order. It does not matter when
items are inserted or when removed, the items in the sorteddict are
always returned in sorted order. The ordering is implicitly based on the
key's __lt__() (or failing that __cmp__()) method if no cmp or key
function is given.

The main benefit of sorteddicts is that you never have to explicitly
sort.

One use case is where you have a set of objects that you want to make
available in various sorted orders. You could, of course, sort on demand,
but if the number of objects is very large (and if you're prepared to
sacrifice some memory), it may be faster to use a sorteddict. For
example:
.... def __init__(self, forename, surname, elected):
.... self.forename = forename
.... self.surname = surname
.... self.elected = elected.... ("Ramsay", "MacDonald", "1924-01-22"),
.... ("Arthur", "Balfour", "1902-07-11"),
.... ("Herbert Henry", "Asquith", "1908-04-07"),
.... ("Stanley", "Baldwin", "1924-11-04"),
.... ("David", "Lloyd George", "1916-12-07"),
.... ("Andrew", "Bonar Law", "1922-10-23"),
.... ("Henry", "Campbell-Bannerman", "1905-12-05"),
.... ("Stanley", "Baldwin", "1923-05-23"),
.... ("Ramsay", "MacDonald", "1929-06-05")):
.... pm = PrimeMinister(forename, surname, elected)
.... byForename[pm] = pm
.... bySurname[pm] = pm
.... byName[pm] = pm
.... byElected[pm] = pm
[pm.forename for pm in byForename.values()]
['Andrew', 'Arthur', 'David', 'Henry', 'Herbert Henry', 'Ramsay', \
'Ramsay', 'Stanley', 'Stanley']
[pm.surname for pm in bySurname.values()]
['Asquith', 'Baldwin', 'Baldwin', 'Balfour', 'Bonar Law', \
'Campbell-Bannerman', 'Lloyd George', 'MacDonald', 'MacDonald']
["%s %s %s" % (pm.forename, pm.surname, pm.elected) \
for pm in byName.values()]
['Ramsay MacDonald 1929-06-05', 'Ramsay MacDonald 1924-01-22', \
'David Lloyd George 1916-12-07', 'Henry Campbell-Bannerman 1905-12-05', \
'Andrew Bonar Law 1922-10-23', 'Arthur Balfour 1902-07-11', \
'Stanley Baldwin 1924-11-04', 'Stanley Baldwin 1923-05-23', \
'Herbert Henry Asquith 1908-04-07']
["%s %s %s" % (pm.forename, pm.surname, pm.elected) \
for pm in byElected.values()]
['Arthur Balfour 1902-07-11', 'Henry Campbell-Bannerman 1905-12-05', \
'Herbert Henry Asquith 1908-04-07', 'David Lloyd George 1916-12-07', \
'Andrew Bonar Law 1922-10-23', 'Stanley Baldwin 1923-05-23', \
'Ramsay MacDonald 1924-01-22', 'Stanley Baldwin 1924-11-04', \
'Ramsay MacDonald 1929-06-05']

Thanks to Python's object references, even though there are four
sorteddicts referring to the same PrimeMinister objects, only one instance
of each object is held in memory.
files = ["README.txt", "readme", "MANIFEST", "test.py"]
d = sorteddict([(name, name) for name in files], .... cmp=lambda a, b: cmp(a.lower(), b.lower()))
d.keys()
['MANIFEST', 'readme', 'README.txt', 'test.py']

Here are a few tests for some of the base class methods that are not
reimplemented:
d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
d.get("X", 21) 21
d.get("i") 4
d.has_key("a") True
d.has_key("x") False
"a" in d True
"x" in d False
len(d) 6
del d["n"]
del d["y"]
len(d) 4
d.clear()
len(d) 0
d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
d["i"] 4
d["y"] 6
d["z"]
Traceback (most recent call last):
....
KeyError: 'z'

Check that a mix of adding and removing keys doesn't cause confusion:
d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
d.keys() ['a', 'i', 'n', 's', 't', 'y']
del d["t"]
d["t"]=6
d.keys()
['a', 'i', 'n', 's', 't', 'y']

"""

__author__ = "Mark Summerfield"
__version__ = "1.1.0"


class sorteddict(dict):
"""A dictionary that is sorted by key or by the given cmp or key function.

The sorteddict always returns any list or iterator in sorted order.
For best performance prefer a key argument to a cmp one. If neither
is given __lt__() (falling back to __cmp__()) will be used for
ordering.

This particular implementation has reasonable performance if the
pattern of use is: lots of edits, lots of lookups, ..., but gives
its worst performance if the pattern of use is: edit, lookup, edit,
lookup, ..., in which case using a plain dict and sorted() will
probably be better.

If you want to initialize with a dict, either use
sorteddict(dict(...), ...), or create the sorteddict and then call
update with the arguments normally passed to a dict constructor.
"""

def __init__(self, iterable=None, cmp=None, key=None, reverse=False):
"""Initializes the sorteddict using the same arguments as sorted()
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.items() [('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]
>>> str(sorteddict()) '{}'
>>> e = sorteddict(d)
>>> e.items()
[('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]
"""
if iterable is None:
iterable = []
dict.__init__(self, iterable)
self.__cmp = cmp
self.__key = key
self.__reverse = reverse
self.__keycache = None
self.__addkeys = set()
self.__delkeys = set()

@property
def __keys(self):
if self.__keycache is None:
self.__keycache = dict.keys(self)
self.__addkeys = set()
self.__delkeys = set()
self.__keycache.sort(cmp=self.__cmp, key=self.__key,
reverse=self.__reverse)
else:
if self.__delkeys:
delkeys = self.__delkeys
self.__keycache = [key for key in self.__keycache if key not in delkeys]
self.__delkeys = set()

if self.__addkeys:
self.__keycache += list(self.__addkeys)
self.__addkeys = set()
self.__keycache.sort(cmp=self.__cmp, key=self.__key,
reverse=self.__reverse)
return self.__keycache


def __invalidate_key_cache(self):
self.__keycache = None
self.__addkeys = set()
self.__delkeys = set()

def __addkey(self, key):
if key in self.__delkeys:
self.__delkeys.remove(key)
else:
self.__addkeys.add(key)

def __removekey(self, key):
if key in self.__addkeys:
self.__addkeys.remove(key)
else:
self.__delkeys.add(key)

def update(self, *args, **kwargs):
"""Updates the sorteddict using the same arguments as dict
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5))
>>> d.update(a=4, z=-4)
>>> d.items() [('a', 4), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('z', -4)]
>>> del d["a"]
>>> del d["i"]
>>> d.update({'g': 9}, a=1, z=3)
>>> d.items() [('a', 1), ('g', 9), ('n', 3), ('s', 1), ('t', 5), ('z', 3)]
>>> e = sorteddict(dict(p=4, q=5))
>>> del d["a"]
>>> del d["n"]
>>> e.update(d)
>>> e.items()
[('g', 9), ('p', 4), ('q', 5), ('s', 1), ('t', 5), ('z', 3)]
"""
self.__invalidate_key_cache()
dict.update(self, *args, **kwargs)


@classmethod
def fromkeys(cls, iterable, value=None):
"""A class method that returns an sorteddict whose keys are
from the iterable and each of whose values is value
>>> d = sorteddict()
>>> e = d.fromkeys("KYLIE", 21)
>>> e.items() [('E', 21), ('I', 21), ('K', 21), ('L', 21), ('Y', 21)]
>>> e = sorteddict.fromkeys("KYLIE", 21)
>>> e.items()
[('E', 21), ('I', 21), ('K', 21), ('L', 21), ('Y', 21)]
"""
dictionary = cls()
for key in iterable:
dictionary[key] = value
return dictionary


def copy(self):
"""Returns a shallow copy of this sorteddict
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> e = d.copy()
>>> e.items() [('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]
>>> d = sorteddict()
>>> e = d.copy()
>>> e.items()
[]
"""
return sorteddict(dict.copy(self), cmp=self.__cmp,
key=self.__key, reverse=self.__reverse)


def clear(self):
"""Removes every item from this sorteddict
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> len(d) 6
>>> d.clear()
>>> len(d) 0
>>> d["m"] = 3
>>> d["a"] = 5
>>> d["z"] = 7
>>> d["e"] = 9
>>> d.keys()
['a', 'e', 'm', 'z']
"""
self.__invalidate_key_cache()
dict.clear(self)


def setdefault(self, key, value):
"""If key is in the dictionary, returns its value;
otherwise adds the key with the given value which is also
returned
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.setdefault("n", 99) 3
>>> d.values() [2, 4, 3, 1, 5, 6]
>>> d.setdefault("r", -20) -20
>>> d.items()[2:] [('n', 3), ('r', -20), ('s', 1), ('t', 5), ('y', 6)]
>>> d.setdefault("@", -11) -11
>>> d.setdefault("z", 99) 99
>>> d.setdefault("m", 50) 50
>>> d.keys()
['@', 'a', 'i', 'm', 'n', 'r', 's', 't', 'y', 'z']
"""
if key not in self:
self.__addkey(key)
return dict.setdefault(self, key, value)


def pop(self, key, value=None):
"""If key is in the dictionary, returns its value and removes it
from the dictionary; otherwise returns the given value
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.pop("n") 3
>>> "n" in d False
>>> d.pop("q", 41) 41
>>> d.keys() ['a', 'i', 's', 't', 'y']
>>> d.pop("a") 2
>>> d.pop("t") 5
>>> d.keys()
['i', 's', 'y']
"""
if key not in self:
return value
self.__removekey(key)
return dict.pop(self, key, value)


def popitem(self):
"""Returns and removes an arbitrary item from the dictionary
3
"""
item = dict.popitem(self)
self.__removekey(item[0])
return item


def keys(self):
"""Returns the dictionary's keys in sorted order
['a', 'i', 'n', 's', 't', 'y']
"""
return self.__keys[:]


def values(self):
"""Returns the dictionary's values in key order
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.values() [2, 4, 3, 1, 5, 6]
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6), ... reverse=True)
>>> d.values() [6, 5, 1, 3, 4, 2]
>>> d = sorteddict(dict(S=1, a=2, N=3, i=4, T=5, y=6), ... cmp=lambda a, b: cmp(a.lower(), b.lower()))
>>> d.keys()
['a', 'i', 'N', 'S', 'T', 'y']
"""
return [self[key] for key in self.__keys]


def items(self):
"""Returns the dictionary's items in key order
[('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]
"""
return [(key, self[key]) for key in self.__keys]


def __iter__(self):
"""Returns an iterator over the dictionary's keys
['a', 'i', 'n', 's', 't', 'y']
"""
return iter(self.__keys)


def iterkeys(self):
"""Returns an iterator over the dictionary's keys
['a', 'i', 'n', 's', 't', 'y']
"""
return iter(self.__keys)


def itervalues(self):
"""Returns an iterator over the dictionary's values in key order
[2, 4, 3, 1, 5, 6]
"""
for key in self.__keys:
yield self[key]


def iteritems(self):
"""Returns an iterator over the dictionary's values in key order
[('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]
"""
for key in self.__keys:
yield key, self[key]


def __delitem__(self, key):
"""Deletes the item with the given key from the dictionary
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> del d["s"]
>>> del d["y"]
>>> del d["a"]
>>> d.keys()
['i', 'n', 't']
"""
self.__removekey(key)
dict.__delitem__(self, key)


def __setitem__(self, key, value):
"""If key is in the dictionary, sets its value to value;
otherwise adds the key to the dictionary with the given value
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d["t"] = -17
>>> d["z"] = 43
>>> d["@"] = -11
>>> d["m"] = 22
>>> d["r"] = 5
>>> d.keys()
['@', 'a', 'i', 'm', 'n', 'r', 's', 't', 'y', 'z']
"""
if key not in self:
self.__addkey(key)
dict.__setitem__(self, key, value)


def __repr__(self):
""" sorteddict({'a': 1}, key=<function keyfn at ...>, reverse=True)
"""
args = []
if self:
args.append(dict.__repr__(self))
if self.__cmp is not None:
args.append('cmp=%r' % self.__cmp)
if self.__key is not None:
args.append('key=%r' % self.__key)
if self.__reverse is not False:
args.append('reverse=%r' % self.__reverse)
return "%s(%s)" % (self.__class__.__name__, ', '.join(args))


def __str__(self):
"""Returns a human readable string representation of the
dictionary

The returned string is proportional in size to the number of
items so could be very large.
"{1: 'x', 2: 'a', 3: 'm'}"
"""
return "{%s}" % ", ".join(
["%r: %r" % (key, self[key]) for key in self.__keys])


if __name__ == "__main__":
import doctest
doctest.testmod()

----------------------------------------------------
 
P

Paul Hankin

Paul Hankin said:
So should Duncan's
def __removekey(self, key):
if key in self.__addkeys:
del self.__addkeys[key]
else:
self.__delkeys.add(key)
be changed to:
def __removekey(self, key):
if key in self.__addkeys:
del self.__addkeys[key]
self.__delkeys.add(key)
Yes, and the same in __addkey: if it's in __delkeys it should be
removed from there, and added to __addkeys. There's an invariant: any
key is in at most one of __addkeys and __delkeys.

No, don't do that. The code was fine as I had written it, except for the
minor point that sets don't support 'del'! Also there need to be some tests
which actually exercise those two branches of the code: once you add
appropriate tests it will become obvious that you really do need the else.

A key which is in dict must be either in __keycache or in __addkeys, but
never in both.

Yes, I'm sorry: you're right.

But there's a different bug: if you delete a key that's not in the
dict, you'll add it to the deleted list before the exception for the
missing key is raised.

sd = sorteddict.sorteddict()
sd['a'] = 'a'
print sd.keys() = ['a']
try:
del sd['b']
except:
pass
sd['b'] = 'b'
print sd.keys()

The second print statement produces ['a'] rather than ['a', 'b']
 
A

Antoon Pardon

I think you missed some of the previous postings.

That is possible.
The sorteddict API that has emerged so far is (1) apart from the
constructor, everything is identical to dict, (2) the constructor
takes the same args as sorted(), so if you want to seed with a dict or
with keywords you write sorteddict(dict(a=1,b=2), ...), (or you could
create a sorteddict and use update() since that takes the same args as
dict's constructor).

Could your AVLTree support cmp and keys and reverse?

It already supports cmp. Although for the moment it is done by
subclassing. Like the following.

class myTree(AVLTree):
@staticmethod
def cmp(a, b);
return ....

I don't think adding support for keys will be difficult but it may be
tedious.

I would prefer not to support reverse because the effect can be acquired
by the other two and in contradiction with sorting, I think using it
will slow things down instead of speeding things up.
 
D

Duncan Booth

Paul Hankin said:
A key which is in dict must be either in __keycache or in __addkeys, but
never in both.

Yes, I'm sorry: you're right.

But there's a different bug: if you delete a key that's not in the
dict, you'll add it to the deleted list before the exception for the
missing key is raised.

sd = sorteddict.sorteddict()
sd['a'] = 'a'
print sd.keys() = ['a']
try:
del sd['b']
except:
pass
sd['b'] = 'b'
print sd.keys()

The second print statement produces ['a'] rather than ['a', 'b']

Yes, I think there are probably several cases where I need to account for
exceptions.

There's another serious problem: if Mark wants this module to be used and
stand any chance at all of eventually being incorporated into the standard
library, then he will have to change the license on his code: the GPL
simply isn't an option here.
 
B

Bryan Olson

Mark said:
The sorteddict API that has emerged so far is (1) apart from the
constructor, everything is identical to dict, (2) the constructor
takes the same args as sorted(), so if you want to seed with a dict or
with keywords you write sorteddict(dict(a=1,b=2), ...), (or you could
create a sorteddict and use update() since that takes the same args as
dict's constructor).

Could your AVLTree support cmp and keys and reverse?


I think you are expending too much effort on the no-brainers.
Yes, AVL-trees can support cmp and keys and reverse. I cannot
speak for Antoon, but have no doubt that adapting his code to
do so is within his ability.

Hrvoje Niksic is right that a simple layer over existing
hashed dicts "dooms the PEP to rejection". Thus the candidate
algorithms are the approximately-balanced tree structures:
AVL trees, red-black trees, 2-3 trees, b-trees, splay trees,
or skip-lists.

Implementing these is non-trivial, but easily within the
ability of many people here. Don't worry about the coding.
 
M

Mark Summerfield

Yes, I'm sorry: you're right.
But there's a different bug: if you delete a key that's not in the
dict, you'll add it to the deleted list before the exception for the
missing key is raised.
sd = sorteddict.sorteddict()
sd['a'] = 'a'
print sd.keys() = ['a']
try:
del sd['b']
except:
pass
sd['b'] = 'b'
print sd.keys()
The second print statement produces ['a'] rather than ['a', 'b']

Yes, I think there are probably several cases where I need to account for
exceptions.

There's another serious problem: if Mark wants this module to be used and
stand any chance at all of eventually being incorporated into the standard
library, then he will have to change the license on his code: the GPL
simply isn't an option here.

I have fixed __addkey() and __removekey():

def __addkey(self, key):
if key in self.__delkeys:
self.__delkeys.remove(key)
else:
self.__addkeys.add(key)


def __removekey(self, key):
if key in self.__addkeys:
self.__addkeys.remove(key)
else:
self.__delkeys.add(key)

I have also fixed __delitem__():

try:
dict.__delitem__(self, key)
except KeyError:
raise
else:
self.__removekey(key)

As for the license, while it is on PyPI, I'll leave it as GPL v 3. If
it was wanted for the standard library (and I can't see that ever
happening), I will happily change it to the one that is preferred for
Python modules.
 
M

Mark Summerfield

[Mark Summerfield]
Below is a PEP proposal for a sorteddict. It arises out of a
discussion on this list that began a few weeks ago with the subject of
"An ordered dictionary for the Python library?"

It is worth remembering that the long sought after datatype was a dict
that could loop over keys in *insertion* order, not based on some
function of the key itself.
If there is positive feedback I will submit the PEP to the reviewers,
so if you think it is a good idea please say so.

As one who was submitted many PEPs, I can advise that the quickest way
to irrevocably kill an idea is to submit a PEP to early (we almost
didn't get genexps because the PEP was submitted pre-maturely).
Instead, I advise posting an ASPN recipe so the details can be
hammered-out and a fan club can be grown.

For this particular idea, I am -1 on inclusion in the standard
library. The PEP focuses mainly on implementation but is weakest in
the rationale section. Its advantages over "for k in sorted(d)" are
miniscule. To be considered for the collections module, there needs
to be compelling use case advantages either in terms of usability or
performance.
[snip]

I gave up on the idea of submitting it as a PEP many postings ago:)
I've now changed the subject line to reflect that.

I will leave the pure python sorteddict on PyPI for anyone who wants
it.

I won't put it on the cookbook site for the time being, simply because
it is eating up much more time than I thought!
 
D

Duncan Booth

Mark Summerfield said:
As for the license, while it is on PyPI, I'll leave it as GPL v 3. If
it was wanted for the standard library (and I can't see that ever
happening), I will happily change it to the one that is preferred for
Python modules.

Ok, your choice, just be aware that by using such a restrictive license you
will dissuade a lot of people from using your code. You've prevented your
module being used in any existing projects with Python license, GPL v2, or
probably any license other than GPL v3.
 
P

Paul Rubin

Duncan Booth said:
Ok, your choice, just be aware that by using such a restrictive license you
will dissuade a lot of people from using your code. You've prevented your
module being used in any existing projects with Python license, GPL v2, or
probably any license other than GPL v3.

GPL2 projects seem to be migrating nicely to GPL3 so GPL2 shouldn't be
too much of a problem.

Since one of the Python license's goals is to allow re-use in
proprietary programs, and the GPL family's main goal is the exact
opposite, I'd say if someone chooses a GPL, then preventing the code
from being used in a Python-licensed project was part of the intention
and not an undesired consequence. While Python itself is unlikely to
switch to a new license in order to import GPL code, other projects
can and have done exactly that.
 
G

gatti

I don't see a focused discussion of computational complexity of a
sorted dict; its API cannot be simpler than sorting a dictionary and
it has issues and complications that have already been discussed
without completely satisfactory solutions, so the only possible reason
to adopt a sorted dict is that some important use case for mapping
types becomes significantly cheaper.

With n entries, the size of a non-sorted hashtable, of a hashtable
plus lists or sets of keys, and of reasonable sorted dict
implementations with trees are all O(n). No substantial space
advantage can be obtained by sorting dictionaries.

Iterating through all n entries of a mapping, once, in sorted order,
is O(n) time and O(1) space with an unsorted hash table, a hash table
with a sorted list of the keys and all types of tree that I know of.
If there is a performance gain, it must come from amortizing
insertions, deletions and index-building. (The other operation, value
updates for an existing key, doesn't matter: updates cause no
structural changes and they must not invalidate any iterator.)

Let's consider a very simple use case: n insertions followed by x
iterations through all entries and n*y lookups by key.

Cost for a hashtable and an ad hoc sorted list of the keys,
fundamentally equivalent to sorting a Python dict:

O(n) for insertions
O(n log n) for indexing
O(nx) for iterations
O(ny) for lookups

Cost for a tree:

O(n log n) for insertions
no indexing
O(nx) for iterations
O(ny log n) for lookups

The hashtable comes out ahead because of cheaper lookups, for any x
and y; note that without lookups there is no reason to use a mapping
instead of a list of (key,value) tuples.

With an equal number k of insertions and deletions between the
iterations, the hashtable must be reindexed x times:

O(n) for insertions
O(kx) for updates and deletions
O(nx log n) for indexing and reindexing
O(nx) for iterations
O(ny) for lookups

The tree might be cheaper:

O(n log n) for insertions
O(kx log n) for updates and deletions
no indexing and reindexing
O(nx) for iterations
O(ny log n) for lookups

For a fixed small k, or with k proportional to n, reindexing the
hashtable and lookups in the tree are equally mediocre.

Maybe we could make k changes in the middle of each iteration. For a
naively reindexed hashtable:

O(n) for insertions
O(kx) for updates and deletions
O(knx log n) for indexing and reindexing
O(nx) for iterations
O(ny) for lookups

For a tree, the costs remain as above: the new factor of n for the
hashtable is fatal. Clever updates of the existing index or use of a
heap would lower the cost, but they would need to be encapsulated as a
sorteddict implementation.

Is this a practical use case? When are sequential visits of all
elements in order frequently suspended to make insertions and
deletions, with a need for efficient lookup by key?
- Priority queues; but given the peculiar pattern of insertions and
deletions there are more suitable specialized data structures.
- A* and similar best-first algorithms.
It's a small but important niche; maybe it isn't important enough for
the standard library. Other absent datatypes like heaps, an immutable
mapping type similar to frozenset and tuple, or disjoint sets with
would be more fundamental and general, and a mapping that remembers
the order of insertion regardless of keys would be equally useful.
In the Java collections framework all these kinds of mapping and
others coexist peacefully, but Python doesn't have the same kitchen
sink approach to basic libraries.

Regarding the API, a sorted dict should not expose random access by an
entry's position in the sequence: it is a gratuitous difficulty for
the implementor and, more importantly, a perversion of the mapping
data type. For that purpose there are lists and tuples, or explicit
indices like those of the Boost multi-index containers (http://
www.boost.org/libs/multi_index).
The only differences with dict should be the constraint that items(),
keys(), values(), iteritems(), iterkeys(), itervalues() return entries
sorted by key.

Regards,

Lorenzo Gatti
 
A

Antoon Pardon

Is this a practical use case? When are sequential visits of all
elements in order frequently suspended to make insertions and
deletions, with a need for efficient lookup by key?

Does it need to be a sequential visit of *all* elements?

Suppose you have a mapping of start times to tasks. You can then want to
iterate over all tasks that need to be started between noon en 4 pm next
monday. If you have a hashtable you still will need to sort all the keys
even if you will visit only 10%. If you have a tree you can just visit the
specified keys.
 
D

Duncan Booth

Antoon Pardon said:
Does it need to be a sequential visit of *all* elements?

Suppose you have a mapping of start times to tasks. You can then want
to iterate over all tasks that need to be started between noon en 4 pm
next monday. If you have a hashtable you still will need to sort all
the keys even if you will visit only 10%. If you have a tree you can
just visit the specified keys.
It still depends on the exact pattern of use. If you have an implementation
which tracks additions and deletions and sorts them into the list when
required (as we came up with earlier) then this is much more efficient that
re-sorting the whole list every time. Sorting a large sorted list with a
few unsorted elements on the end is comparable to inserting the elements
individually into a tree, and you still have the hashtable benefits on
accessing elements to help level the playing field.

For me though, the most convincing use-case for a sorted dictionary is one
that I don't think has been mentioned yet:
There are situations when you want to use a dictionary with existing
library code that doesn't care about the random key ordering, but you have
additional requirements which the original code didn't know about.

For example, in the sorteddict code I added an implementation for the
__repr__ method and an associated doctest. Unlike iteration over sorteddict
itself, I didn't bother to make __repr__ stable, so in that particular
doctest it only tests the repr of a sorteddict with a single element.
If that was fixed to make the repr stable it would be a real benefit for
writing any doctests which want to produce a dictionary as a result.

Another example would be if you had a library which serialised a dictionary
to xml. There is nothing wrong with the library if it doesn't care about
order, but if you have some other reason why you want the xml to be stable
(e.g. because you store it in a version control system and want to compare
revisions) then a sorteddict would allow you to impose that behaviour on
the library from outside.

Contrary to my earlier insistence that sorteddict is only really useful if
you can have a key parameter, both of these examples simply want an
arbitrary but defined order of iteration for dictionary keys. A much
simpler sorteddict that has been discussed earlier would be sufficient.
 
T

thebjorn

Another example would be if you had a library which serialised a dictionary
to xml. There is nothing wrong with the library if it doesn't care about
order, but if you have some other reason why you want the xml to be stable
(e.g. because you store it in a version control system and want to compare
revisions) then a sorteddict would allow you to impose that behaviour on
the library from outside.

Contrary to my earlier insistence that sorteddict is only really useful if
you can have a key parameter, both of these examples simply want an
arbitrary but defined order of iteration for dictionary keys. A much
simpler sorteddict that has been discussed earlier would be sufficient.

In fact, a dictionary that maintains insertion order would work in
this case too.

-- bjorn
 
D

Duncan Booth

thebjorn said:
In fact, a dictionary that maintains insertion order would work in
this case too.
It would be better than the current situation, but say I have elements 'a',
'b', and 'c'. Next run of the program I delete 'b', then the run after that
I add 'b' back into the list so now I might get 'a', 'c', 'b'. Sorting
would ensure that I can tell that I actually reverted a change.

Right now I think there are probably three dict variants needed: sorteddict
(still waiting for a convincing use case), ordereddict (lots of use cases),
and this one: stabledict.
 
T

thebjorn

Right now I think there are probably three dict variants needed: sorteddict
(still waiting for a convincing use case), ordereddict (lots of use cases),
and this one: stabledict.

What's stabledict? I'm assuming that ordereddict is a mapping that
maintains insertion order(?)

The only other mapping type I use very frequently is a dict where the
keys are limited to valid identifiers, and where attribute lookup
(d.foo) is defined as key lookup (d['foo']). It makes lots of code
easier to read (and write).

In the Smalltalk collection hierarchy SortedCollection is a subclass
of OrderedCollection, which implies to me that it'd be better to add
an ordereddict first.

-- bjorn
 
P

Paul Rubin

Mark Summerfield said:
Below is a PEP proposal for a sorteddict. ...

Is this proposal dead? I'd been meaning to post some thoughts which I
still haven't gotten around to writing up, and am wondering whether to
keep it on my todo list.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,119
Latest member
IrmaNorcro
Top