Proposed implementation for an Ordered Dictionary

Discussion in 'Python' started by Raymond Hettinger, Feb 26, 2009.

  1. Here's a proposed implementation for Py2.7 and Py3.1:

    http://code.activestate.com/recipes/576669/

    Would like you guys to kick the tires, exercise it a bit, and let me
    know what you think. The recipe runs under 2.6 and 3.0 without
    modification so it should be easy to play with.

    The main virtue of the proposed API is a near-zero learning curve.
    The API is a same as for regular dictionaries but it remembers the
    insertion order and used that for iteration, repr, etc.


    Raymond
     
    Raymond Hettinger, Feb 26, 2009
    #1
    1. Advertising

  2. Raymond Hettinger <python <at> rcn.com> writes:

    >
    > Here's a proposed implementation for Py2.7 and Py3.1:
    >
    > http://code.activestate.com/recipes/576669/
    >
    > Would like you guys to kick the tires, exercise it a bit, and let me
    > know what you think. The recipe runs under 2.6 and 3.0 without
    > modification so it should be easy to play with.


    Why not just inherit from collections.MutableMapping?
     
    Benjamin Peterson, Feb 26, 2009
    #2
    1. Advertising

  3. Raymond Hettinger <> writes:

    > Here's a proposed implementation for Py2.7 and Py3.1:
    >
    > http://code.activestate.com/recipes/576669/


    Several methods like __contains__() and __getitem__() are not
    overridden, so their performance is just as fast as a regular
    dictionary.

    Methods like __setitem__ and __delitem__ are overridden but have a
    fast path depending on whether or not the key already exists.

    It seems that __delitem__ of an existing key is O(n), whereas it's
    amortized constant time for dicts. (__setitem__ is constant time for
    both.) Is there a way to avoid this?

    If not, it should probably be documented, since it differs from dict.
     
    Hrvoje Niksic, Feb 26, 2009
    #3
  4. [Benjamin Peterson]
    > Why not just inherit from collections.MutableMapping?


    It makes the recipe shorter to inherit some methods from dict. Also,
    subclassing from dict gives a speedup for __getitem__(), __len__(),
    and get().
     
    Raymond Hettinger, Feb 26, 2009
    #4
  5. [Hrvoje Niksic]
    > It seems that __delitem__ of an existing key is O(n), whereas it's
    > amortized constant time for dicts.  (__setitem__ is constant time for
    > both.)  Is there a way to avoid this?


    I don't see any fast/clean way. It's possible to tracking pending
    deletions
    and do them all at once but that's a bit messy and slow.


    > If not, it should probably be documented, since it differs from dict.


    That makes sense.


    Raymond
     
    Raymond Hettinger, Feb 26, 2009
    #5
  6. Raymond Hettinger

    Paul Rubin Guest

    Raymond Hettinger <> writes:
    > I don't see any fast/clean way. It's possible to tracking pending
    > deletions and do them all at once but that's a bit messy and slow.


    What about using a second dictionary (indexed by the incrementing
    counter) instead of a list to record the insertion order? Then you
    have O(1) deletion, and traversal takes an O(n log n) sorting
    operation.
     
    Paul Rubin, Feb 26, 2009
    #6
  7. [Paul Rubin]
    > What about using a second dictionary (indexed by the incrementing
    > counter) instead of a list to record the insertion order?  Then you
    > have O(1) deletion, and traversal takes an O(n log n) sorting
    > operation.


    My first attempt used exactly that approach and it works fine
    though it does complicate the code quite a bit and slows down
    all of the common operations by a constant factor.


    Raymond
     
    Raymond Hettinger, Feb 26, 2009
    #7
  8. > > What about using a second dictionary (indexed by the incrementing
    > > counter) instead of a list to record the insertion order?  Then you
    > > have O(1) deletion, and traversal takes an O(n log n) sorting
    > > operation.


    > My first attempt used exactly that approach and it works fine
    > though it does complicate the code quite a bit and slows down
    > all of the common operations by a constant factor.


    FWIW, here is the code for that approach.

    --------------------------------------
    from collections import MutableMapping

    class OrderedDict(dict):

    def __init__(self, *args, **kwds):
    if len(args) > 1:
    raise TypeError('expected at 1 argument, got %d', len
    (args))
    self.clear()
    self.update(*args, **kwds)

    def clear(self):
    self.__cached_sort = None
    self.__cnt = 0
    self.__order = {}
    dict.clear(self)

    def __setitem__(self, key, value):
    if key not in self:
    self.__cached_sort = None
    self.__cnt += 1
    self.__order[key] = self.__cnt
    dict.__setitem__(self, key, value)

    def __delitem__(self, key):
    if key in self:
    self.__cached_sort = None
    del self.__order[key]
    dict.__delitem__(self, key)

    def __sorted(self):
    # return keys in insertion order and cache the result
    if self.__cached_sort is None:
    self.__cached_sort = sorted(dict.keys(self),
    key=self.__order.__getitem__)
    return self.__cached_sort

    def __iter__(self):
    return iter(self.__sorted())

    def __reversed__(self):
    return iter(reversed(self.__sorted()))

    def popitem(self):
    # O(n) cost on the first pop and O(1) on successive pops
    if not self:
    raise KeyError
    key = self.__sorted().pop()
    del self.__order[key]
    value = dict.pop(self, key)
    return key, value

    # Methods with indirect access via the above methods

    setdefault = MutableMapping.setdefault
    update = MutableMapping.update
    pop = MutableMapping.pop
    keys = MutableMapping.keys
    values = MutableMapping.values
    items = MutableMapping.items

    def __repr__(self):
    return '{' + ', '.join(map('%r: %r'.__mod__, self.items())) +
    '}'

    def copy(self):
    return self.__class__(self)

    @classmethod
    def fromkeys(cls, iterable, value=None):
    d = cls()
    for key in iterable:
    d[key] = value
    return d

    def __reduce__(self):
    # pickle an items list and any extra info in the instance dict
    items = list(self.items())
    inst_dict = vars(self).copy()
    for attr in vars(self.__class__):
    inst_dict.pop(attr, None)
    return (self.__class__, (items,), inst_dict)
     
    Raymond Hettinger, Feb 26, 2009
    #8
  9. Raymond Hettinger

    Paul Rubin Guest

    Raymond Hettinger <> writes:
    > > What about using a second dictionary

    > My first attempt used exactly that approach and it works fine
    > though it does complicate the code quite a bit and slows down
    > all of the common operations by a constant factor.


    Ehh, I guess I'm not surprised at the slowdown and extra complexity
    from the second dict. Oh well. If the module really turns out to be
    really used a lot, another (messy) approach would be to write a C
    extension that uses a doubly linked list some day.
     
    Paul Rubin, Feb 27, 2009
    #9
  10. [Paul Rubin]
    > Ehh, I guess I'm not surprised at the slowdown and extra complexity
    > from the second dict.  Oh well.  If the module really turns out to be
    > really used a lot, another (messy) approach would be to write a C
    > extension that uses a doubly linked list some day.


    That seems like an ideal implementation to me.

    O(1): appending, popping, searching, and deletion
    O(n): traversal

    Raymond
     
    Raymond Hettinger, Feb 27, 2009
    #10
  11. Raymond Hettinger

    Guest

    Raymond Hettinger:
    >Paul Rubin:
    >>another (messy) approach would be to write a C
    >>extension that uses a doubly linked list some day.

    >
    > That seems like an ideal implementation to me.


    This was my Python implementation, where the delete too is O(1), but
    it's slow:
    http://code.activestate.com/recipes/498195/

    I think the C extension with the doubly linked list is the best
    approach.
    Note that in modern CPUs caches are able to change the performance a
    lot. So reducing the memory used is very important. So using the XOR
    (or subtraction) trick to use only 1 pointer for each key-value may be
    useful to keep performance not too much far from the normal python
    dicts:
    http://en.wikipedia.org/wiki/XOR_linked_list

    Bye,
    bearophile
     
    , Feb 27, 2009
    #11
  12. Raymond Hettinger

    Paul Rubin Guest

    writes:
    > So using the XOR (or subtraction) trick to use only 1 pointer for
    > each key-value may be useful to keep performance not too much far
    > from the normal python dicts: http://en.wikipedia.org/wiki/XOR_linked_list


    I don't see how to delete a randomly chosen node if you use that
    trick, since the hash lookup doesn't give you two consecutive nodes
    in the linked list to xor together. Maybe you could do something
    intricate with skip lists and end up with O(log n) deletion.
     
    Paul Rubin, Feb 27, 2009
    #12
  13. Raymond Hettinger

    Guest

    Paul Rubin:
    >I don't see how to delete a randomly chosen node if you use that trick, since the hash lookup doesn't give you two consecutive nodes in the linked list to xor together.<


    Thank you, I think you are right, I am sorry.
    So on 32-bit CPUs you need to add 8 bytes to each value.

    On 64-bit CPUs you may use a double indexed list, instead of a double
    linked list. And each index can be stored in 6 bytes instead of 8
    (281_474_976_710_656 buckets may be enough for most people), so you
    need only 12 bytes for two indexes instead of 16 bytes, this helps the
    L1 cache (and the small L2 cache too on Core i7) a bit. But this may
    slow down too much the ordered iteration.

    Bye,
    bearophile
     
    , Feb 27, 2009
    #13
  14. Raymond Hettinger

    Steve Holden Guest

    wrote:
    > Paul Rubin:
    >> I don't see how to delete a randomly chosen node if you use that trick, since the hash lookup doesn't give you two consecutive nodes in the linked list to xor together.<

    >
    > Thank you, I think you are right, I am sorry.
    > So on 32-bit CPUs you need to add 8 bytes to each value.
    >
    > On 64-bit CPUs you may use a double indexed list, instead of a double
    > linked list. And each index can be stored in 6 bytes instead of 8
    > (281_474_976_710_656 buckets may be enough for most people), so you
    > need only 12 bytes for two indexes instead of 16 bytes, this helps the
    > L1 cache (and the small L2 cache too on Core i7) a bit. But this may
    > slow down too much the ordered iteration.
    >

    A sort of premature pessimization, then.

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
     
    Steve Holden, Feb 27, 2009
    #14
  15. Raymond Hettinger

    Guest

    Steve Holden:
    > A sort of premature pessimization, then.


    Maybe not, the save in memory may lead to higher speed anyway. So you
    need to test it to know the overall balance. And in data structures
    with general purpose you want all the speed you can get.

    Bye,
    bearophile
     
    , Feb 27, 2009
    #15
  16. [Steve Holden]
    > A sort of premature pessimization, then.


    QOTW!

    _ ~
    @ @
    \_/


    Raymond
     
    Raymond Hettinger, Feb 28, 2009
    #16
  17. Raymond Hettinger

    Guest


    >> A sort of premature pessimization, then.


    QOTW!

    +1

    Malcolm
     
    , Feb 28, 2009
    #17
  18. Raymond Hettinger wrote:
    > [Paul Rubin]
    >> Ehh, I guess I'm not surprised at the slowdown and extra complexity
    >> from the second dict. Oh well. If the module really turns out to be
    >> really used a lot, another (messy) approach would be to write a C
    >> extension that uses a doubly linked list some day.

    >
    > That seems like an ideal implementation to me.
    >
    > O(1): appending, popping, searching, and deletion
    > O(n): traversal
    >
    > Raymond


    Sometimes, it's useful to be able to
    obtain the data in the sorted sequence.

    You might consider adding functionality
    like:

    def seqItems(self):
    '''To return the items, sorted
    by key. '''
    return [self[k] for k in
    self.seqKeys()]

    def seqKeys(self):
    ''' To return the keys sorted. '''
    k= self.keys()
    k.sort()
    return k

    def seqValues(self):
    ''' To return the values, with
    their keys, sorted by value. '''
    v= [(it[1], it[0]) for it in
    self.items()]
    v.sort()
    return v

    Colin W.
     
    Colin J. Williams, Feb 28, 2009
    #18
  19. Colin J. Williams wrote:

    > Sometimes, it's useful to be able to
    > obtain the data in the sorted sequence.
    >
    > You might consider adding functionality
    > like:
    >
    > def seqItems(self):
    > '''To return the items, sorted
    > by key. '''
    > return [self[k] for k in
    > self.seqKeys()]


    Amazingly, the Python time-machine provides such functionality! Using just a
    handful of pre-existing pieces, you can do this:

    sorted(mydict.items())
    sorted(mydict.keys())
    [mydict[x] for x in sorted(mydict.keys)]

    and any other sequence you might need. That's the beauty of a general
    purpose programming language. You don't need everything to be a built-in
    *wink*



    --
    Steven
     
    Steven D'Aprano, Feb 28, 2009
    #19
  20. Steven D'Aprano wrote:
    > Colin J. Williams wrote:
    >
    >> Sometimes, it's useful to be able to
    >> obtain the data in the sorted sequence.
    >>
    >> You might consider adding functionality
    >> like:
    >>
    >> def seqItems(self):
    >> '''To return the items, sorted
    >> by key. '''
    >> return [self[k] for k in
    >> self.seqKeys()]

    >
    > Amazingly, the Python time-machine provides such functionality! Using just a
    > handful of pre-existing pieces, you can do this:
    >
    > sorted(mydict.items())
    > sorted(mydict.keys())
    > [mydict[x] for x in sorted(mydict.keys)]
    >
    > and any other sequence you might need. That's the beauty of a general
    > purpose programming language. You don't need everything to be a built-in
    > *wink*
    >
    >

    Thanks, you are right, you have a neat
    way of handling the first two, they
    work with an instance of Raymond
    Hettinger's. OrderedDict.

    The third suggestion has a couple of
    problems, which are fixed below.

    if __name__ == '__main__':
    d=
    OrderedDict.fromkeys('abracadabra',
    value= 'zzz')
    print(d)
    print(d.seqKeys())
    print(d.seqItems())
    print(d.seqValues())
    # Steven D'Aprano's alternative
    mydict= d.copy()
    print sorted(mydict.items())
    print sorted(mydict.keys())
    # print [mydict[x] for x in
    sorted(mydict.keys)] Instance object is
    not iterable
    print(sorted(iter([(x[1], x[0]) for
    x in mydict.iteritems()]))) # This works

    Colin W.
     
    Colin J. Williams, Mar 1, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Leif K-Brooks

    Ordered dictionary?

    Leif K-Brooks, Jan 22, 2004, in forum: Python
    Replies:
    13
    Views:
    768
    David M. Wilson
    Jan 23, 2004
  2. Tom Anderson
    Replies:
    0
    Views:
    320
    Tom Anderson
    Nov 26, 2005
  3. Navkirat Singh
    Replies:
    6
    Views:
    3,065
    Navkirat Singh
    Jul 29, 2010
  4. Chris Rebert
    Replies:
    0
    Views:
    529
    Chris Rebert
    Jul 29, 2010
  5. DL

    Ordered list inside ordered list

    DL, Nov 9, 2009, in forum: Javascript
    Replies:
    6
    Views:
    327
    Dr J R Stockton
    Nov 21, 2009
Loading...

Share This Page