What's the cleanest way to compare 2 dictionary?

J

John Henry

Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Thanks,
 
P

Paddy

John said:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values

Something like:

a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq


If you now simple set arithmatic, it should read OK.

- Paddy.
 
J

John Henry

Paddy said:
John said:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values

Something like:

a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq


If you now simple set arithmatic, it should read OK.

- Paddy.

Thanks, that's very clean. Give me good reason to move up to Python
2.4.
 
J

John Henry

John said:
Paddy said:
John said:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values

Something like:

a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq


If you now simple set arithmatic, it should read OK.

- Paddy.

Thanks, that's very clean. Give me good reason to move up to Python
2.4.

Oh, wait, works in 2.3 too.

Just have to:

from sets import Set as set
 
J

John Machin

John said:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Paddy has already pointed out a necessary addition to your requirement
definition: common keys with different values.

Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:

a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -> 2 to
save typing :)

Now that happens if the other dictionary contains:

b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}

Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?

All comparisons are equal, but some comparisons are more equal than
others :)

Cheers,
John
 
J

John Henry

John,

Yes, there are several scenerios.

a) Comparing keys only.

That's been answered (although I haven't gotten it to work under 2.3
yet)

b) Comparing records.

Now it gets more fun - as you pointed out. I was assuming that there
is no short cut here. If the key exists on both set, and if I wish to
know if the records are the same, I would have to do record by record
comparsion. However, since there are only a handful of records per
key, this wouldn't be so bad. Maybe I just overload the compare
operator or something.

John said:
John said:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Paddy has already pointed out a necessary addition to your requirement
definition: common keys with different values.

Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:

a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -> 2 to
save typing :)

Now that happens if the other dictionary contains:

b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}

Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?

All comparisons are equal, but some comparisons are more equal than
others :)

Cheers,
John
 
P

Paddy

John said:
John said:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Paddy has already pointed out a necessary addition to your requirement
definition: common keys with different values.

Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:

a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -> 2 to
save typing :)

Now that happens if the other dictionary contains:

b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}

Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?

All comparisons are equal, but some comparisons are more equal than
others :)

Cheers,
John

Hi Johns,
The following is my attempt to give more/deeper comparison info.
Assume you have your data parsed and presented as two dicts a and b
each having as values a dict representing a record.
Further assume you have a function that can compute if two record level
dicts are the same and another function that can compute if two values
in a record level dict are the same.

With a slight modification of my earlier prog we get:

def komparator(a,b, check_equal):
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if check_equal(a[k],b[k]))
common_neq = _common - common_eq
return (a_xclusive, b_xclusive, common_eq, common_neq)

a_xclusive, b_xclusive, common_eq, common_neq = komparator(a,b,
record_dict__equality_checker)

common_neq = [ (key,
komparator(a[key],b[key], value__equality_checker) )
for key in common_neq ]

Now we get extra info on intra record differences with little extra
code.

Look out though, you could get swamped with data :)

- Paddy.
 
J

John Machin

John said:
John,

Yes, there are several scenerios.

a) Comparing keys only.

That's been answered (although I haven't gotten it to work under 2.3
yet)

(1) What's the problem with getting it to work under 2.3?
(2) Why not upgrade?
b) Comparing records.

You haven't got that far yet. The next problem is actually comparing
two *collections* of records, and you need to decide whether for
equality purposes the collections should be treated as an unordered
list, an ordered list, a set, or something else. Then you need to
consider how equality of records is to be defined e.g. case sensitive
or not.
Now it gets more fun - as you pointed out. I was assuming that there
is no short cut here. If the key exists on both set, and if I wish to
know if the records are the same, I would have to do record by record
comparsion. However, since there are only a handful of records per
key, this wouldn't be so bad. Maybe I just overload the compare
operator or something.

IMHO, "something" would be better than "overload the compare operator".
In any case, you need to DEFINE what you mean by equality of a
collection of records, *then* implement it.

"only a handful":. Naturally 0 and 1 are special, but otherwise the
number of records in the bag shoudn't really be a factor in your
implementation.

HTH,
John
 
J

John Henry

John said:
(1) What's the problem with getting it to work under 2.3?
(2) Why not upgrade?

Let me comment on this part first, I am still chewing other parts of
your message.

When I do it under 2.3, I get:

common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax

Don't know why that is.

I can't upgrade yet. Some part of my code doesn't compile under 2.4
and I haven't got a chance to investigate further.
 
M

Marc 'BlackJack' Rintsch

John Henry said:
When I do it under 2.3, I get:

common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax

Don't know why that is.

There are no generator expressions in 2.3. Turn it into a list
comprehension::

common_eq = set([k for k in _common if a[k] == b[k]])

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Henry

Thank you. That works.

John Henry said:
When I do it under 2.3, I get:

common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax

Don't know why that is.

There are no generator expressions in 2.3. Turn it into a list
comprehension::

common_eq = set([k for k in _common if a[k] == b[k]])

Ciao,
Marc 'BlackJack' Rintsch
 
P

Paddy

I have gone the whole hog and got something thats run-able:

========dict_diff.py=============================

from pprint import pprint as pp

a = {1:{'1':'1'}, 2:{'2':'2'}, 3:dict("AA BB CC".split()), 4:{'4':'4'}}
b = { 2:{'2':'2'}, 3:dict("BB CD EE".split()), 5:{'5':'5'}}
def record_comparator(a,b, check_equal):
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if check_equal(a[k],b[k]))
common_neq = _common - common_eq
return {"A excl keys":a_xclusive, "B excl keys":b_xclusive,
"Common & eq":common_eq, "Common keys neq
values":common_neq}

comp_result = record_comparator(a,b, dict.__eq__)

# Further dataon common keys, neq values
common_neq = comp_result["Common keys neq values"]
common_neq = [ (key, record_comparator(a[key],b[key], str.__eq__))
for key in common_neq ]
comp_result["Common keys neq values"] = common_neq

print "\na =",; pp(a)
print "\nb =",; pp(b)
print "\ncomp_result = " ; pp(comp_result)

==========================================

When run it gives:

a ={1: {'1': '1'},
2: {'2': '2'},
3: {'A': 'A', 'C': 'C', 'B': 'B'},
4: {'4': '4'}}

b ={2: {'2': '2'}, 3: {'C': 'D', 'B': 'B', 'E': 'E'}, 5: {'5': '5'}}

comp_result =
{'A excl keys': set([1, 4]),
'B excl keys': set([5]),
'Common & eq': set([2]),
'Common keys neq values': [(3,
{'A excl keys': set(['A']),
'B excl keys': set(['E']),
'Common & eq': set(['B']),
'Common keys neq values': set(['C'])})]}


- Paddy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,436
Messages
2,571,696
Members
48,796
Latest member
Greg L.
Top