Efficiently determine where documents differ

R

Richard

Hello,

I have been using the difflib library to find where 2 large HTML
documents differ. The Differ().compare() method does this, but it is
very slow - atleast 100x slower than the unix diff command.

How can I efficiently determine where 2 documents differ in Python?
(Ideally I am after the positions rather the actual text, which is
what SequenceMatcher().get_opcodes() returns.)

Richard
 
G

Gabriel Genellina

I have been using the difflib library to find where 2 large HTML
documents differ. The Differ().compare() method does this, but it is
very slow - atleast 100x slower than the unix diff command.

Differ compares sequences of lines *and* lines as sequences of characters
to provide intra-line differences. The diff command only processes lines.
If you aren't interested in intra-line differences, use a SequenceMatcher
instead. Or, invoke the diff command using subprocess.Popen +
communicate.
 
R

Richard

Differ compares sequences of lines *and* lines as sequences of characters  
to provide intra-line differences. The diff command only processes lines.
If you aren't interested in intra-line differences, use a SequenceMatcher  
instead. Or, invoke the diff command using   subprocess.Popen +  
communicate.


thank you very much Gabriel! Passing a list of the document lines
makes the efficiency comparable to the diff command.
Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top