A bug in difflib module? (find_longest_match)

N

n00m

from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))


yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
I think it's line #314 in difflib "who's to blame" --
 
G

Gabriel Genellina

from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))


yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
I think it's line #314 in difflib "who's to blame" --

Me too. Could you think of some alternative? Simply disabling that
"popularity check" would slow down the algorithm, according to the
comments.
 
N

n00m

Gabriel Genellina:
from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))


============== RESTART ====================
yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
I think it's line #314 in difflib "who's to blame" --

Me too. Could you think of some alternative? Simply disabling that
"popularity check" would slow down the algorithm, according to the
comments.

No idea :)
 
G

Gabriel Genellina


The "ignore popular elements" is only an optmization, and it should not be
applied in your case because it forces the algorithm to yield an invalid
result.
I can think of two alternatives:
- tune up the conditions when the optimization is used, or
- make it user configurable

SequenceMatcher is a public class, and it is also internally used by
Differ and others to compare both sequences of lines *and* pairs of
similar lines (considered as sequences of characters). In this last usage
the "ignore popular elements" has no much sense, as shown in your example
feeding directly two dissimilar strings.
In principle one should disable the "populardict" stuff when dealing with
strings. Below is a simple attempt to detect that case:

(around line 311 in difflib.py)

b_is_string = isinstance(b, basestring) # add this line
for i, elt in enumerate(b):
if elt in b2j:
indices = b2j[elt]
if not b_is_string and n >= 200 and len(indices) * 100 >
n: # change this line
populardict[elt] = 1
del indices[:]
else:
indices.append(i)
else:
b2j[elt] =
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top