More elegant solution for diffing two sequences

  • Thread starter Ulrich Eckhardt
  • Start date
U

Ulrich Eckhardt

Hi!

I'm trying to write some code to diff two fonts. What I have is every
character (glyph) of the two fonts in a list. I know that the list is sorted
by the codepoints of the characters. What I'd like to ask is whether there
is a more elegant solution to the loop below or whether there are any rough
corners in my code (see below). Note that I'm targeting Python 2, not 3 yet.

Thank you!

Uli


def diff_font(font1, font2):
it1 = iter(font1.glyphs)
it2 = iter(font2.glyphs)
try:
glyph1 = it1.next()
glyph2 = it2.next()
while True:
if glyph1.codepoint < glyph2.codepoint:
# glyph only in first font
diff_glyph(glyph1, None)
glyph1 = it1.next()
elif glyph2.codepoint < glyph2.codepoint:
# glyph only in second font
diff_glyph(None, glyph2)
glyph2 = it2.next()
else:
# glyph in both fonts
diff_glyph(glyph1, glyph2)
glyph1 = it1.next()
glyph2 = it2.next()
except StopIteration, msg:
# remaining glyphs in either font are exclusive to that font
for glyph in it1:
diff_glyph(glyph, None)
for glyph in it1:
diff_glyph(None, glyph)
 
L

Lie Ryan

I'm trying to write some code to diff two fonts. What I have is every
character (glyph) of the two fonts in a list. I know that the list is sorted
by the codepoints of the characters. What I'd like to ask is whether there
is a more elegant solution to the loop below or whether there are any rough
corners in my code (see below). Note that I'm targeting Python 2, not 3 yet.

Use sets:

glyph_1 = set(font1.glyphs)
glyph_2 = set(font2.glyphs)
only_in_1 = glyph_1 - glyph_2
only_in_2 = glyph_2 - glyph_1
in_both = glyph_1 & glyph_2

that is assuming font1.glyphs's value are hashable.
 
R

r0g

Lie said:
Use sets:

glyph_1 = set(font1.glyphs)
glyph_2 = set(font2.glyphs)
only_in_1 = glyph_1 - glyph_2
only_in_2 = glyph_2 - glyph_1
in_both = glyph_1 & glyph_2

that is assuming font1.glyphs's value are hashable.


Ooh that's lovely! I wonder why I've never stumbled across sets in
python before? These are SO going to be my favourite new things for the
next few weeks.

:D

Roger.
 
U

Ulrich Eckhardt

Lie said:
Use sets:

glyph_1 = set(font1.glyphs)
glyph_2 = set(font2.glyphs)
only_in_1 = glyph_1 - glyph_2
only_in_2 = glyph_2 - glyph_1
in_both = glyph_1 & glyph_2

that is assuming font1.glyphs's value are hashable.

Thinking about it, I perhaps should store the glyphs in a set from the
beginning. Question is, can I (perhaps by providing the right hash function)
sort them by their codepoint? I'll have to look at the docs...

Thank you for this nudge in the right direction!

Uli
 
N

Neil Cerutti

Lie Ryan wrote:
Thinking about it, I perhaps should store the glyphs in a set
from the beginning. Question is, can I (perhaps by providing
the right hash function) sort them by their codepoint? I'll
have to look at the docs...

No, sets are unordered in Python.

You'll need to sort them when you need them sorted, or keep a
sorted list separately.
 
M

MRAB

Ulrich said:
Thinking about it, I perhaps should store the glyphs in a set from the
beginning. Question is, can I (perhaps by providing the right hash function)
sort them by their codepoint? I'll have to look at the docs...

Thank you for this nudge in the right direction!
For sets you need __hash__ and __eq__, and for sorting you need __lt__.
Here's a simple example:

class Glyph(object):
def __init__(self, codepoint):
self.codepoint = codepoint
def __hash__(self):
return self.codepoint
def __eq__(self, other):
return self.codepoint == other.codepoint
def __lt__(self, other):
return self.codepoint < other.codepoint
def __repr__(self):
return "Glyph(%s)" % self.codepoint
 
L

Lie Ryan

Thinking about it, I perhaps should store the glyphs in a set from the
beginning. Question is, can I (perhaps by providing the right hash function)
sort them by their codepoint? I'll have to look at the docs...

Python does not guarantee that a particular characteristic of the hash
function will lead to a particular characteristic of the ordering of th
eset. Though AFAICT, the current set's ordering is determined by the
hash modulus the set's hashtable's real size, but if you rely on this
you're on your own. It's better if you sorted() them when you want a
sorted view (or turn to set just before finding the differences).

You can reduce the penalty of creating new data structure with something
like:

a = [...]
b = [...]
s_a = set(a)
s_a -= set(b)

that only creates two new sets (instead of three) and probably might be
faster too (though you'd need to profile to be sure).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top