general problem when subclassing a built-in class


K

kj

Suppose that you want to implement a subclass of built-in class, to
meet some specific design requirements.

Where in the Python documentation can one find the information
required to determine the minimal[1] set of methods that one would
need to override to achieve this goal?

In my experience, educated guesswork doesn't get one very far with
this question.

Here's a *toy example*, just to illustrate this last point.

Suppose that one wants to implement a subclass of dict, call it
TSDict, to meet these two design requirements:

1. for each one of its keys, an instance of TSDict should keep a
timestamp (as given by time.time, and accessible via the new method
get_timestamp(key)) of the last time that the key had a value
assigned to it;

2. other than the added capability described in (1), an instance
of TSDict should behave *exactly* like a built-in dictionary.

In particular, we should be able to observe behavior like this:
d = TSDict((('uno', 1), ('dos', 2)), tres=3, cuatro=4)
d['cinco'] = 5
d {'cuatro': 4, 'dos': 2, 'tres': 3, 'cinco': 5, 'uno': 1}
d.get_timestamp('uno')
1293046586.644436


OK, here's one strategy, right out of OOP 101:

#---------------------------------------------------------------------------
from time import time

class TSDict(dict):
def __setitem__(self, key, value):
# save the value and timestamp for key as a tuple;
# see footnote [2]
dict.__setitem__(self, key, (value, time()))

def __getitem__(self, key):
# extract the value from the value-timestamp pair and return it
return dict.__getitem__(self, key)[0]

def get_timestamp(self, key):
# extract the timestamp from the value-timestamp pair and return it
return dict.__getitem__(self, key)[1]
#---------------------------------------------------------------------------

This implementation *should* work (again, at least according to
OOP 101), but, in fact, it doesn't come *even close*:
d = TSDict((('uno', 1), ('dos', 2)), tres=3, cuatro=4)
d['cinco'] = 5
d {'cuatro': 4, 'dos': 2, 'tres': 3, 'cinco': (5, 1293059516.942985), 'uno': 1}
d.get_timestamp('uno')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/tsdict.py", line 23, in get_timestamp
return dict.__getitem__(self, key)[1]
TypeError: 'int' object is not subscriptable


From the above you can see that TSDict fails at *both* of the design
requirements listed above: it fails to add a timestamp to all keys in
the dictionary (e.g. 'uno', ..., 'cuatro' didn't get a timestamp), and
get_timestamp bombs; and it also fails to behave in every other
respect exactly like a built-in dict (e.g., repr(d) reveals the
timestamps and how they are kept).


So back to the general problem: to implement a subclass of a built-in
class to meet a given set of design specifications. Where is the
documentation needed to do this without guesswork?

Given results like the one illustrated above, I can think of only two
approaches (other than scrapping the whole idea of subclassing a
built-in class in the first place):

1) Study the Python C source code for the built-in class in the hopes
of somehow figuring out what API methods need to be overridden;

2) Through blind trial-and-error, keep trying different implementation
strategies and/or keep overriding additional built-in class methods
until the behavior of the resulting subclass approximates
sufficiently the design specs.

IMO, both of these approaches suck. Approach (1) would take *me*
forever, since I don't know the first thing about Python's internals,
and even if I did, going behind the documented API like that would
make whatever I implement very likely to break with future releases of
Python. Approach (2) could also take a very long time (probably much
longer than the implementation would take if no guesswork was
involved), but worse than that, one would have little assurance that
one's experimentation has truly uncovered all the necessary details;
IME, programming-by-guesswork leads to numerous and often nasty bugs.

Is there any other way?

TIA!

~kj


[1] The "minimal" bit in the question statement is just another way of
specifying a "maximal" reuse of the built-in's class code.

[2] For this example, I've accessed the parent's methods directly
through dict rather than through super(TSDict, self), just to keep
the code as uncluttered as possible, but the results are the same
if one uses super.
 
Ad

Advertisements

O

Owen Jacobson

Suppose that you want to implement a subclass of built-in class, to
meet some specific design requirements.

Where in the Python documentation can one find the information
required to determine the minimal[1] set of methods that one would
need to override to achieve this goal?

(Rest elided.)

The short answer is that you can't everything you want.

If some behaviour or implementation detail isn't documented, you can't
make any assumptions about how a class works, even if you really want
to. Whatever you figure out by "educated guesswork" (or, even better,
reading the source) is, absent documentation, not guaranteed, and
behaviour that's not guaranteed can and will change whenever the code's
maintaners feel like it. This is true for for builtins, for things in
the standard library, and for things in competently-maintained third
party libraries.

Requiring that every class document its internals extensively enough to
permit subclasses to modify behaviour in totally arbitrary ways has to
be balanced against allowing each class's interface to usefully
abstract its internal workings. Especially in a duck-typed language
like Python, there's very little need for subclassing simply to
intercept one or two method calls, and the resulting composition-based
system is both more flexible and, being based on documented and stable
interfaces, easier to maintain and support.

Subclassing is a powerful tool, not suited to every design problem. In
your case (tracking timestamps in a dict-like container), you're
probably better off implementing your own dict-like container, with
support for as much or as little of the protocol you want. You can do
that without subclassing dict: the protocol for dict is not difficult
to implement, especially if your backing storage is a real dict.

What forces do you imagine require you to use a subclass for this?

-o
 
Ad

Advertisements

B

bill

In said:
Where in the Python documentation can one find the information
required to determine the minimal[1] set of methods that one would
need to override to achieve this goal?
I don't know the answer to that question, but imho it's
the wrong question to ask. Instead you should be asking what
design will let me implement the subclass making the fewest possible
assumptions about the superclass. (hint: the design you're
thinking about leads to a world of pain.)
2. other than the added capability described in (1), an instance
of TSDict should behave *exactly* like a built-in dictionary.

(minor point: what about repr? it shouldn't look like a dict, imho)
#---------------------------------------------------------------------------
from time import time
class TSDict(dict):
def __setitem__(self, key, value):
# save the value and timestamp for key as a tuple;
# see footnote [2]
dict.__setitem__(self, key, (value, time()))
def __getitem__(self, key):
# extract the value from the value-timestamp pair and return it
return dict.__getitem__(self, key)[0]
def get_timestamp(self, key):
# extract the timestamp from the value-timestamp pair and return it
return dict.__getitem__(self, key)[1]
#---------------------------------------------------------------------------


man, you're asking for trouble! even if you knew all you want to know,
unless you had some guarantee that it wouldn't change in a later release,
you'd still have to override pretty much all the methods. the problem
here is not lack of information, but a horrible design (sorry to be blunt).

Just keep the timestamps in a parallel dict and get on with life:

from time import time
from collections import MutableMapping

class TSDict2(dict, MutableMapping):
def __init__(self, *args, **kwargs):
dict.__init__(self, *args, **kwargs)
self.__ts = dict.fromkeys(self.keys(), time())

def __setitem__(self, key, value):
self.__ts[key] = time()
dict.__setitem__(self, key, value)

setdefault = MutableMapping.setdefault
update = MutableMapping.update

def get_timestamp(self, key):
if self.has_key(key):
return self.__ts[key]
raise KeyError(key)


that just took the time needed to type it and worked the first time:
d = TSDict2((('uno', 1), ('dos', 2)), tres=3, cuatro=4); d['cinco'] = 5; d.setdefault('six', 6); 6
d.update((('tres', 'amigos'), ('seven', 7),), eight=8); d {'seven': 7, 'six': 6, 'cuatro': 4, 'cinco': 5, 'eight': 8, 'dos': 2, 'tres': 'amigos', 'uno': 1}
for p in sorted([(k, d.get_timestamp(k)) for k in d.keys()], key=lambda p: p[1]): print "%s\t%f" % p
....
cuatro 1293131496.917215
dos 1293131496.917215
uno 1293131496.917215
cinco 1293131496.917233
six 1293131496.917279
tres 1293131501.676962
seven 1293131501.676974
eight 1293131501.676981


If you insist on bashing your skull against your original problem,
take a look at collections.OrderedDict or collections.Counter to
see how they use ABCs to tame dict.

(still, even if you used all the remaining available
MutableMapping methods in your class, i don't know how you'd
get the dict constructor to return the right value, ie no timestamps,
when you pass it an
instance of your subclass as argument. i don't think
there's a TSDict.__method__ you can write for that... anyway
my TSDict2 doesn't have this problem either.)

take home message: respect the privacy of your superclasses and they'll
be nice to you.

--bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top