SequenceMatcher bug ?

E

eliben

Hello,

This is about Python 2.5.2 - I don't know if there were fixes to this
module in 2.6/3.0

I think I ran into a bug with difflib.SequenceMatcher class.
Specifically, its ratio() method. The following:

SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio
()

returns 0.0

While the same with 500 replaced by 100 returns .99... something
Looking at the code of SequenceMatcher there's some caching going on
when the sequences are longer than 200 elements (and indeed, I can
reproduce the bug above 200 but not below). Can anyone confirm that
this misbehaves and suggest a workaround ?

P.S. quick_ratio() works fine, it seems.

Thanks
Eli
 
R

rdmurray

This is about Python 2.5.2 - I don't know if there were fixes to this
module in 2.6/3.0

I think I ran into a bug with difflib.SequenceMatcher class.
Specifically, its ratio() method. The following:

SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio
()

returns 0.0

While the same with 500 replaced by 100 returns .99... something
Looking at the code of SequenceMatcher there's some caching going on
when the sequences are longer than 200 elements (and indeed, I can
reproduce the bug above 200 but not below). Can anyone confirm that
this misbehaves and suggest a workaround ?

Python 2.5.2 (r252:60911, Sep 29 2008, 20:34:04)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
from difflib import SequenceMatcher
SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 +
[5]).ratio()
0.99900299102691925

--RDM
 
E

eliben

This is about Python 2.5.2 - I don't know if there were fixes to this
module in 2.6/3.0
I think I ran into a bug with difflib.SequenceMatcherclass.
Specifically, its ratio() method. The following:
SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio
()
returns 0.0
While the same with 500 replaced by 100 returns .99... something
Looking at the code ofSequenceMatcherthere's some caching going on
when the sequences are longer than 200 elements (and indeed, I can
reproduce the bug above 200 but not below). Can anyone confirm that
this misbehaves and suggest a workaround ?

Python 2.5.2 (r252:60911, Sep 29 2008, 20:34:04)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> from difflib importSequenceMatcher
SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 +
[5]).ratio()

0.99900299102691925

Strange. I could reproduce the problem both on ActiveState Python
2.5.2 for Windows, and in the online Try Python evaluator:

http://try-python.mired.org/

Eli
 
R

rdmurray

This is about Python 2.5.2 - I don't know if there were fixes to this
module in 2.6/3.0
I think I ran into a bug with difflib.SequenceMatcherclass.
Specifically, its ratio() method. The following:
SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio
()
returns 0.0
While the same with 500 replaced by 100 returns .99... something
Looking at the code ofSequenceMatcherthere's some caching going on
when the sequences are longer than 200 elements (and indeed, I can
reproduce the bug above 200 but not below). Can anyone confirm that
this misbehaves and suggest a workaround ?

Python 2.5.2 (r252:60911, Sep 29 2008, 20:34:04)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> from difflib importSequenceMatcher
SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 +
[5]).ratio()

0.99900299102691925

Strange. I could reproduce the problem both on ActiveState Python
2.5.2 for Windows, and in the online Try Python evaluator:

http://try-python.mired.org/

My system is Gentoo, which installs python from source. Maybe gentoo
applies patches that the binary releases don't have.

--RDM
 
G

Gabriel Genellina

On Mon, 8 Dec 2008 at 23:46, eliben wrote:
This is about Python 2.5.2 - I don't know if there were fixes to this
module in 2.6/3.0

I think I ran into a bug with difflib.SequenceMatcherclass.
Specifically, its ratio() method. The following:

SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio
()

returns 0.0

While the same with 500 replaced by 100 returns .99... something
Looking at the code ofSequenceMatcherthere's some caching going on
when the sequences are longer than 200 elements (and indeed, I can
reproduce the bug above 200 but not below). Can anyone confirm that
this misbehaves and suggest a workaround ?

Python 2.5.2 (r252:60911, Sep 29 2008, 20:34:04)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more
information.>>> from difflib importSequenceMatcher
SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 +
[5]).ratio()

0.99900299102691925

Strange. I could reproduce the problem both on ActiveState Python
2.5.2 for Windows, and in the online Try Python evaluator:

http://try-python.mired.org/

My system is Gentoo, which installs python from source. Maybe gentoo
applies patches that the binary releases don't have.

I can't reproduce the problem. I got exactly the same results (0.999...)
with all the releases I have at hand, ranging from 3.0 back to 2.1.3, all
on Windows.
And http://try-python.mired.org/ says the same thing.
 
E

eliben

My system is Gentoo, which installs python from source.  Maybe gentoo
I can't reproduce the problem. I got exactly the same results (0.999...)  
with all the releases I have at hand, ranging from 3.0 back to 2.1.3, all  
on Windows.
Andhttp://try-python.mired.org/says the same thing.

What ? This can't be.

1. Go to http://try-python.mired.org/
2. Type
import difflib
3. Type
difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()

Don't you get 0 as the answer ?

The same with ActivePython on Windows ?

Here's a snapshot from my run on a Linux box:

Python 2.5.2 (r252:60911, Oct 20 2008, 09:11:31)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-10)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import difflib

difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
0.0

Executing exactly the same steps, do you not get 0.0 ?
 
S

Steve Holden

eliben said:
I can't reproduce the problem. I got exactly the same results (0.999...)
with all the releases I have at hand, ranging from 3.0 back to 2.1.3, all
on Windows.
Andhttp://try-python.mired.org/says the same thing.

What ? This can't be.

1. Go to http://try-python.mired.org/
2. Type
import difflib
3. Type
difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()

Don't you get 0 as the answer ?

The same with ActivePython on Windows ?

Here's a snapshot from my run on a Linux box:

Python 2.5.2 (r252:60911, Oct 20 2008, 09:11:31)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-10)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import difflib

difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
0.0

Executing exactly the same steps, do you not get 0.0 ?
I get 0.0 from everything I've tried it on so far: Cygwin 2.5.1, Windows
2.4.4, Windows 2.5.2, Windows 2.6, Ubuntu 2.5.2, Red Hat 2.4.3,
try-python. All my own machines are Intel architecture, if that makes
any difference. I'm not sure how gentoo would have patched their 2.5.2
when the bug, if 2.6 still contains it.

I find it strange that try-python gives one person a different result
from everyone else. What is this bizarre influence on web sites?

regards
Steve
 
G

Gabriel Genellina

I can't reproduce the problem. I got exactly the same results
(0.999...)  
with all the releases I have at hand, ranging from 3.0 back to 2.1.3,
all  
on Windows.
Andhttp://try-python.mired.org/says the same thing.

What ? This can't be.

1. Go to http://try-python.mired.org/
2. Type
import difflib
3. Type
difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()

Don't you get 0 as the answer ?

Ah, but that isn't the same expression you posted originally:

SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio()

Using *that* expression I got near 1.0 always. But leaving out the [5] at
the end, it's true, it gives the wrong answer. Old Python 2.1.3 didn't
have this problem:

Python 2.1.3 (#35, Apr 8 2002, 17:47:50) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
import difflib
difflib.SequenceMatcher(None, [4] + [5] * 500, [5] * 500).ratio() 0.99900099900099903
difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio() 0.99750623441396513
difflib.SequenceMatcher(None, [4] + [5] * 100, [5] * 100).ratio()
0.99502487562189057

I've updated the tracker item.
 
T

Tim Roberts

Gabriel Genellina said:
What ? This can't be.

1. Go to http://try-python.mired.org/
2. Type
import difflib
3. Type
difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()

Don't you get 0 as the answer ?

Ah, but that isn't the same expression you posted originally:

SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio()

Using *that* expression I got near 1.0 always. But leaving out the [5] at
the end, it's true, it gives the wrong answer.
...
I've updated the tracker item.

Your assessment that it is the same problem as #1528074 is correct. It's
the "popularity" optimization. The key here is that the second sequence
consists of more than 200 identical items. For example, all of the
following give the same bad result:

difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
difflib.SequenceMatcher(None, [4] + [5] , [5] * 200).ratio()
difflib.SequenceMatcher(None, [4] , [5] * 200).ratio()

If you print get_matching_blocks(), you'll see that there are none, because
the "b" sequence is optimized completely away. The #1528074 calls it
"working by designed" and suggests updating the doc. However, I would
argue that it's worth checking for this.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,679
Members
48,796
Latest member
Greg L.

Latest Threads

Top