(I've been following this sub-thread, but I can't find a suitable place to jump
aboard, so I'll reply here)
Chris said:
I've been struggling to understand your position in this. The following is (in
part) an attempt to recover the argument that must underlie it. I'm reasonably
happy that it makes sense, but am less confident that it adequately captures
your argument -- not least because I'm unable to recover a theoretical
justification for an emphasis on immutability...
Starting in the abstract (not Java specific, and not assuming OO, although I
will use the word "objects" -- for lack of anything better).
We want to be able to express the notion of The One True Equality: that two
"objects" are truly equal if they cannot be distinguished in any way, and so
that "they" are fully interchangeable (since if we could distinguish between
them by using one of them in context, rather than the other, then they wouldn't
be interchangeable, and we'd have a test that could distinguish between them).
Call that "unconditional equivalence" or just "equivalence" (since the proper
application of the word "equality" is part of what's under discussion in this
thread).
Now lets introduce a little OO into the mix. That means that we get the idea
of object identity to play with. I don't yet want to make any assumptions
about the proper relationship between unconditional equivalence and object
identity except to note that obviously an object is "equivalent" to itself (we
can say that now that we've introduced objects and so can use concepts like
"itself").
Now lets introduce a new concept. Unfortunately I can't think of a good name
for it that doesn't bring conceptual baggage with it (and precisely the
conceptual baggage that I'm attempting to understand), so I'll use a bland
word. Some objects are (or play a role that is) "special", in that no other
object is allowed to play that role unless it is unconditionally equivalent to
the first. An example of this is any instance of java.lang.Class, say
java.lang.String.class. That object is "special" in that any "other" object
that purports to play that role /must/ be indistinguishable from it by any
possible test (that's expressible in Java).
Note that being "special" does not in any way imply being immutable. Also note
that I have not yet made any connection between what it is to be "special" and
the concept of object identity.
[As an almost completely irrelevant aside -- I'm not sure that "immutable" is
even a first rank OO concept, at least not in the hard-line version of OO that
I preach. "state" (and hence [im]mutability) is merely part of the
implementation of an object, and hence not to be accorded importance comparable
with that of identity and behaviour. "state" is just one of the ways that we
achieve behaviour that differs polymorphically between objects and changes
across time.]
Now for another concept that I don't have a theory-neutral word for. Call an
object "content defined" if it has the property that once you know a certain
amount about it, then you know /everything/. Intuitively you could say that if
two "objects" have the same state then they are unconditionally equivalent.
(But that is a bad approximation -- see the above caveat about "state" -- a
better formulation would probably talk about there being a fixed (in advance)
algorithm that compared finitely many behaviours of the "two" objects, and
which if it could not find a difference, then that would imply that no other
behaviour could ever exhibit any difference.) One example of this might be a
document (a PDF or similar). Once you know how many bytes it has (or
characters if your prefer) and which bytes they are, then (in many contexts)
you know /everything/ about that document -- there is simply no room for two
documents to have the same contents but nevertheless be "different". (And if
the documents are stored in a filesystem in such a way that the same document
may appear as (i.e. "in") more than one file, then that may well lead to
semantic errors since the semantics of "files" don't correctly represent those
of "documents".)
Note that being "content defined" does not imply immutability. (Which is one
reason why I didn't call the property having "value semantics" since that -- to
me -- does imply immutability. The other reason is that "value semantics" may
be taken as implying a definite stance w.r.t. object identity -- which I'm
still trying to avoid). OTOH, to be "content defined" /is/ to be
special" -- its the limit case where the external part of the role of the
object (that contributes to "special"-ness) has shrunk to zero.
[BTW. I'm sorry that this is so slow and long-winded. I'm inching through it
because I think this area is something of a conceptual minefield, and I'd like
to retain the use of /both/ my legs ;-)]
That -- at last -- is the end of the groundwork. Now we can consider how to
implement these concepts, and specifically how to implement them in the world
of OO.
AFAIK, there are just three ways that we can implement "special"-ness. None is
perfect, and the three techniques have different trade-offs between them:
a) The VM itself can understand this concept and provide the correct semantics
directly.
b) We can implement "special" by /imposing/ uniqueness -- using object identity
to implement unconditional equivalence.
c) We can implement "special" by providing a customised version of the test for
equivalence, one that understands that two distinct objects can nevertheless be
considered to be equivalent. In this case you could say that we are
representing the abstract "special" object by an equivalence class of actual
objects.
Option (a) requires VM support, and is even then only applicable to certain
cases. Option (b) may require more housekeeping effort than its worth (e.g.
you might have to maintain some sort of weak collection of objects, and check
that whenever a new object is created to ensure that you don't create
illegitimate duplicates). Option (c) suffers from the problem that object
identity /isn't/ hidden. Not only does it "leak out" through == comparison (in
Java) but also through (at least) identity hashing, weak collections,
finalisation, synchronisation, and the operations wait() and
notify()/notifyAll().
I want to emphasise that (at least as I see it) these are /implementation/
techniques. Different ways of attempting to map our abstract notion of what it
is to be "special" onto the real behaviour of concrete objects.
[As yet another aside, and in an attempt to bolster the above assertion, I'll
mention that Smalltalk makes use of all three techniques to implement its
equivalents of Java's primitive types. The number 122 is undoubtedly
"special" -- you don't want there to be two distinguishable 122s in a system --
but that can be implemented in various ways. The Smalltalk implementation I
work with uses (a) for integers that can be expressed in 31 or less bits, and
uses (c) for integers outside that range and for floating point numbers. It
uses (b) for characters. This is an area where ST implementations vary, for
instance some use (a) for some floats and/or characters.]
Of course, the /big/ problem with option (c) is that it is incompatible with
object mutability. You might be able to live with the leakages I mentioned
above (if you are lucky and/or disciplined), but if you allow the objects to
change state in any detectable way, then you don't stand a hope of maintaining
the correct semantics for unconditional equivalence.
One thing I want to emphasise a little, is that although (c) is incompatible
with mutability, (b) is an implementation of the same concept that is not
incompatible. I.e. being "content defined" does in itself imply immutability
(once some niggles about mutating one object to be "the same" as another are
ironed out). And of course it is perfectly possible to be "special" and
mutable.
Chris Smith mentioned the non-OO, or at best pseudo-OO nature of pure values.
If you have such concepts that you wish to map onto an OO implementation then
"they" are certainly "content defined" (in my sense) and hence "special". In
such cases (c) is a decent implementation choice -- you are unlikely to want to
change the state of such objects (what would that even mean ?) and the way that
(c) subverts the object-nature of the thing represented would not be a
significant problem when you aren't trying to represent some "thing" in the
first place. (Actually (a) might be even better, but that's not an available
option in Java.)
But (b) is still a valid alternative design, even for implementing value
semantics (it would amount to do doing some sort of interning of the "values").
And for the wider case where all that's required is "content defined" it may
even be a better choice (at least if the content is also immutable). In the
wider still case where we're trying to implement equivalence for "special"
objects, technique (b) may be the only possible option.
Digressing slightly: how should "equivalence" be spelled in Java ? It doesn't
seem to me that Java serves us very well here. Ideally, the One True Equality
ought to be spelled '==' (or '=' but that's out for obvious reasons). Any
other spelling is likely to lead to fragile code -- it's simply not reasonable
to expect programmers to remember to call a method with a name like
equivalent() or equals() every time. So, ideally we would have some sort of =
operator that would expand out to a VM primitive when technique (a) was in use
(and for primitive types -- which are a sort of special-case of (a) anyway).
It would expand out to an identity comparison when (b) was used:
return (this issameobejctas that);
and would expand out to a comparison of the contents of "content defined"
objects (with maybe an identity comparison first as an optimisation) if they
were implemented using (c).
Of course, Java doesn't allow that. So we are stuck with trying to shoe-horn
our One True Equality into one or the other of:
explicit identity comparison
a call to equals()
(we could also define our own unconditionallyEquivalentTo() method, but I don't
think that offers much in the way of practical advantages over plain old
equals()). It's tempting to say that all comparison must be done via the
equals() method, since that can (internally) use a field-by-field comparison or
an identity comparison as appropriate to the implementation (b) or (c). But --
even without the problems caused by nulls -- I don't think that's very
practical. It seems (on the whole, and as the lesser of two evils) better to
lift the skirt of the implementation a little and allow programmers to "know"
which operation they are supposed to use to compare which objects.
BTW, I've restricted this discussion to the case where our notion of
equivalence has some non-trivial structure w.r.t the notion of object identity.
In the most general case the only object that is unconditionally equivalent to
(cannot possibly be distinguished from) some object, is that object itself. So,
in general, unconditional equivalence degenerates (uncontentiously) to object
identity. The interesting cases are the ones where the semantics allow/require
a less degenerate solution.
Anyway, at the end of all that, I can (to a degree) accept that:
value semantics -> non-mutable -> override (and use) equals()
but I think it's also important to note that:
a) that's not the only possible implementation of value semantics.
b) value semantics is not the only legitimate application of that
implementation technique.
=========
Attempting to relate this stuff to the ostensible topic of the thread ;-) I
think that any object that is "special" /can/ legitimately implement clone() as
"return this;". Some implementations (b) /must/ implement it like that, but an
implementation that uses (c) may also choose to do so for performance reasons
(it's a legitimate optimisation since "special" objects implemented using (c)
are necessarily immutable).
=========
Now, to return to a couple of subjects that I skimped in the foregoing. I
rather assumed two things: one is that there is a need for a One True Equals
test at all, then second is that the method named equals() is intended to
correspond to that condition.
Let's define another comparison: "pretty similar" -- two objects will be
"pretty similar" iff (according to the class designer's intuition) the two
objects would be considered to be "the same" in many contexts. Being "the
same" is obviously application and context dependent; being "pretty similar" is
a specific, hardwired, choice of some context(s) as most significant or most
useful.
Obviously there's a progression of comparisons:
Object identity => unqualified equivalence => pretty similar(ity)
(where '=>' is logical implication). Similarly the natural /default/
implementation of "pretty similar" is "unqualified equivalence" and that of
"unqualified equivalence" is object identity.
Incidentally, given those natural defaults, it is impossible to tell from the
actual implementations of equals() in the standard libraries whether they are
intended to implement equivalence or mere similarity. Is Object.equals() an
implementation of "pretty similar" (implemented as a fallback to equivalence)
or is it an implementation of "unqualified equivalence" that is mistakenly
overridden by implementations that look more like "pretty similar" ?
I would like to claim that equals() should not in fact be viewed as an
implementation of "unqualified equivalence", but that it is easier to
understand and use it if it is seen as meaning "pretty similar" (noting, of
course, that the two notions are often identical).
For instance, Chris Smith mentions the example of Date. If I'm interpreting
him correctly (and not reading too much into a throwaway remark) he sees Date
as an attempt to implement value semantics using the technique I've labelled
(c) (overriding equals() to conceal the difference between distinct objects
that represent the same date). Seen from that POV, Date is broken because
although it has a correct implementation of "unqualified equivalence" as
equals(), the objects are mutable which means that implementation technique (c)
is not a valid way to represent "content defined" values. I don't dispute that
interpretation, but I also don't think its the best interpretation to make. If
instead you see Dates as "just plain objects" (not value objects, not "content
defined") then "unqualified equivalence" for Dates degenerates to object
identity (so there's no problem expressing that if/when we wish), and the
natural (or at least /a/ natural) interpretation of "pretty similar" is exposed
by the equals() method. Seen that way, Dates are full objects (not
pseudo-objects i.e. their object nature is not merely an artefact of the way
Java works) -- which is certainly how /I/ see Dates -- there's nothing
particularly wrong with them being mutable (it may be a poor design, but its
not a semantic error), and nothing is broken about the equals() method, indeed
it does just what you'd expect.
OTOH, seeing things like this doesn't only lead to a new tolerance for the
standard library -- it does suggest criticisms too. If equals() is seen as
being the One True Equals, then it is at least understandable (if not
forgivable) that the standard collections do not always provide a way to use an
alternate comparison criterion. If you see it my way (which I think is "pretty
similar" to Thomas's position) then it becomes harder to understand why nobody
at Sun (or Apache) has apparently seen the need for a HashSet (etc) with a
pluggable "equality" criterion.
[After all those words I feel I should end up with some sort of summary or
maybe draw a moral -- but I can't think of one...]
-- chris