difflib-like library supporting moved blocks detection?

Discussion in 'Python' started by Vlastimil Brom, Jul 13, 2011.

  1. Hi all,
    I'd like to ask about the availability of a text diff library, like
    difflib, which would support the detection of moved text blocks.
    Currently I am almost happy with the results of
    difflib.SequenceMatcher in my app (comparing different versions of
    natural language texts), however the only drawback seems to be the
    missing detection of moves of the text parts. I was thinking of simply
    recomparing the deleted and inserted blocks using difflib again, but
    this obviously won't be a general solution.
    I found several algorithm discussions, but unfortunately no suitable
    python implementation. (E.g. Medite -
    http://www-poleia.lip6.fr/~ganascia/Medite_Project - seems to be
    implemented in Python but it targets some rather special and probably
    much more complex textological issues, than my current needs.)
    Does maybe someone know such python library (or possibly a way of
    enhancing difflib) for this task (i.e character-wise comparing of
    texts - detecting insertion, deletion, substitution and move of text
    blocks)?

    Thanks in advance,
    Vlastimil Brom
     
    Vlastimil Brom, Jul 13, 2011
    #1
    1. Advertising

  2. Vlastimil Brom

    Chris Torek Guest

    In article <>
    Vlastimil Brom <> wrote:
    >I'd like to ask about the availability of a text diff library, like
    >difflib, which would support the detection of moved text blocks.


    If you allow arbitrary moves, the "minimal edit distance" problem
    (string-to-string edit) becomes substantially harder. If you only
    allow insert, delete, or in-place-substitute, you have what is
    called the "Levenshtein distance" case. If you allow transpositions
    you get "Damerau-Levenshtein". These are both solveable with a
    dynamic programming algorithm. Once you allow move operations,
    though, the problem becomes NP-complete.

    See http://pages.cs.brandeis.edu/~shapird/publications/JDAmoves.pdf
    for instance. (They give an algorithm that produces "usually
    acceptable" results in polynomial time.)
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Intel require I note that my opinions are not those of WRS or Intel
    Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
    email: gmail (figure it out) http://web.torek.net/torek/index.html
     
    Chris Torek, Jul 14, 2011
    #2
    1. Advertising

  3. 2011/7/14 Chris Torek <>:
    > In article <>
    > Vlastimil Brom  <> wrote:
    >>I'd like to ask about the availability of a text diff library, like
    >>difflib, which would support the detection of moved text blocks.

    >
    > If you allow arbitrary moves, the "minimal edit distance" problem
    > (string-to-string edit) becomes substantially harder.  If you only
    > allow insert, delete, or in-place-substitute, you have what is
    > called the "Levenshtein distance" case.  If you allow transpositions
    > you get "Damerau-Levenshtein".  These are both solveable with a
    > dynamic programming algorithm.  Once you allow move operations,
    > though, the problem becomes NP-complete.
    >
    > See http://pages.cs.brandeis.edu/~shapird/publications/JDAmoves.pdf
    > for instance.  (They give an algorithm that produces "usually
    > acceptable" results in polynomial time.)
    > --
    > In-Real-Life: Chris Torek, Wind River Systems
    >
    >

    Thanks for the references and explanation!
    I do realise the added complexity with taking the moves into account;
    given that, my current needs and the usually satisfying results
    obtained easily with difflib, I am not going to try to implement some
    more complex diffing algorithm.
    However, it turns out that the mentioned naive approach with just
    recomparing the text additions and removals may be partially viable -
    with some luck, i.e. given, the relevant segments are identified as
    deletions and inserts and isolated by difflib in the first place (and
    not subsumed under larger changes or split).

    For illustration, the rough simplified code is attached (sorry for the
    style and possible quirks...)
    Just now the more similar text segments are just collected, it would
    be also possible to sort them on their similarity ratio; the current
    approach also allows to highlight potentially multiple moved segments.

    Comments and suggestions are, of course, welcome,
    regards,
    vbr

    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    # # # # # # # # # # # # # # # # # # # # #

    #! Python
    # -*- coding: utf-8 -*-

    import difflib
    import itertools

    def compare_moves(a, b, similarity_threshold=0.6):
    """
    Poor man's text comparison with simple moves check. Compares two
    strings using difflib
    and additionally tries to detect moved blocks
    by comparing similar deleted and inserted segments with each other
    - given the similarity_threshold.
    """

    seq_matcher = difflib.SequenceMatcher(isjunk=None, a=a, b=b, autojunk=False)
    diff_raw = [[tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]] for tag, i1,
    i2, j1, j2 in seq_matcher.get_opcodes()]

    deleted, inserted = {}, {}
    for tag, i1, i2, j1, j2 in seq_matcher.get_opcodes():
    if tag == 'delete':
    deleted[(i1, i2)] = [tag, i1, i2, j1, j2, a[i1:i2]]
    elif tag == 'insert':
    inserted[(i1, i2)] = [tag, i1, i2, j1, j2, b[j1:j2]]

    possibly_moved_blocks = []
    for deleted_item, inserted_item in
    itertools.product(deleted.values(), inserted.values()):
    similarity_ratio = difflib.SequenceMatcher(isjunk=None,
    a=deleted_item[5], b=inserted_item[5], autojunk=False).ratio()
    if similarity_ratio >= similarity_threshold:
    possibly_moved_blocks.append([deleted_item, inserted_item,
    similarity_ratio])

    print diff_raw
    print possibly_moved_blocks


    if __name__ == "__main__":
    compare_moves("abcXYZdeABfghijklmnopABBCq",
    "ABCDabcdeACfgXYXZhijklmnopq", similarity_threshold = 0.6)

    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    # # # # # #
    # output: #
    [['insert', 0, 0, 0, 4, '', 'ABCD'], ['equal', 0, 3, 4, 7, 'abc',
    'abc'], ['delete', 3, 6, 7, 7, 'XYZ', ''], ['equal', 6, 9, 7, 10,
    'deA', 'deA'], ['replace', 9, 10, 10, 11, 'B', 'C'], ['equal', 10, 12,
    11, 13, 'fg', 'fg'], ['insert', 12, 12, 13, 17, '', 'XYXZ'], ['equal',
    12, 21, 17, 26, 'hijklmnop', 'hijklmnop'], ['delete', 21, 25, 26, 26,
    'ABBC', ''], ['equal', 25, 26, 26, 27, 'q', 'q']]

    [[['delete', 21, 25, 26, 26, 'ABBC'], ['insert', 0, 0, 0, 4, 'ABCD'],
    0.75], [['delete', 3, 6, 7, 7, 'XYZ'], ['insert', 12, 12, 13, 17,
    'XYXZ'], 0.8571428571428571]]
     
    Vlastimil Brom, Jul 15, 2011
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Arjen
    Replies:
    3
    Views:
    456
    Scott Allen
    Feb 27, 2005
  2. kuangye
    Replies:
    3
    Views:
    338
    Erik Wikström
    Oct 11, 2008
  3. Vlastimil Brom
    Replies:
    4
    Views:
    334
    Paul Rubin
    Jun 24, 2010
  4. matt
    Replies:
    1
    Views:
    285
    George Ogata
    Aug 6, 2004
  5. David Mark
    Replies:
    7
    Views:
    129
    David Mark
    May 12, 2010
Loading...

Share This Page