A bug in difflib module? (find_longest_match)

n00m · May 1, 2008

from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))

yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
I think it's line #314 in difflib "who's to blame" --

Gabriel Genellina · May 1, 2008

En Thu said:
from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))

yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
I think it's line #314 in difflib "who's to blame" --

Me too. Could you think of some alternative? Simply disabling that
"popularity check" would slow down the algorithm, according to the
comments.

n00m · May 1, 2008

Gabriel Genellina:

En Thu said:
En Thu said:

from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))

============== RESTART ====================

Click to expand...

yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
I think it's line #314 in difflib "who's to blame" --

Click to expand...

Me too. Could you think of some alternative? Simply disabling that
"popularity check" would slow down the algorithm, according to the
comments.

No idea

Gabriel Genellina · May 1, 2008

En Thu said:
No idea

The "ignore popular elements" is only an optmization, and it should not be
applied in your case because it forces the algorithm to yield an invalid
result.
I can think of two alternatives:
- tune up the conditions when the optimization is used, or
- make it user configurable

SequenceMatcher is a public class, and it is also internally used by
Differ and others to compare both sequences of lines *and* pairs of
similar lines (considered as sequences of characters). In this last usage
the "ignore popular elements" has no much sense, as shown in your example
feeding directly two dissimilar strings.
In principle one should disable the "populardict" stuff when dealing with
strings. Below is a simple attempt to detect that case:

(around line 311 in difflib.py)

b_is_string = isinstance(b, basestring) # add this line
for i, elt in enumerate(b):
if elt in b2j:
indices = b2j[elt]
if not b_is_string and n >= 200 and len(indices) * 100 >
n: # change this line
populardict[elt] = 1
del indices[:]
else:
indices.append(i)
else:
b2j[elt] =

How to speed this code	3	Nov 16, 2022
Don't understand SequenceMatcher from difflib	0	Jun 21, 2011
Mutability issue	1	Dec 11, 2023
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
Nasty gotcha/bug in heapq.nlargest/nsmallest	4	May 15, 2008
Index Error during backpropagation in a multilayer neural network.	1	Jun 17, 2023
Translater + module + tkinter	1	Feb 16, 2023
I am making a Snake game and it has a: "raise Terminator/turtle.Terminator" message.	2	Dec 20, 2021

A bug in difflib module? (find_longest_match)

n00m

Gabriel Genellina

n00m

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads