Comparing two book chapters (text files)

Discussion in 'Python' started by Nick Matzke, Feb 5, 2009.

  1. Nick Matzke

    Nick Matzke Guest

    Hi all,

    So I have an interesting challenge. I want to compare two book
    chapters, which I have in plain text format, and find out (a) percentage
    similarity and (b) what has changed.

    Some features make this problem different than what seems to be the
    standard text-matching problem solvable with e.g. difflib. Here is what
    I mean:

    * there is no guarantee that single lines from each file will be
    directly comparable -- e.g., if a few words are inserted into a
    sentence, then a chunk of the sentence will be moved to the next line,
    then a chunk of that line moved to the next, etc.

    * Also, there are cases where paragraphs have been moved around,
    sections re-ordered, etc. So it can't just be a "linear" match.

    I imagine this kind of thing can't be all that hard in the grand scheme
    of things, but I couldn't find an easily applicable solution readily
    available. I have advanced beginner python skills but am not quite
    where I could do this kind of thing from scratch without some guidance
    about the likely functions, libraries etc. to use.

    PS: I am going to have to do this for multiple book chapters so various
    software packages, e.g. for windows, are not really usable.

    Any help is much appreciated!!

    Cheers,
    Nick



    --
    ====================================================
    Nicholas J. Matzke
    Ph.D. student, Graduate Student Researcher
    Huelsenbeck Lab
    Center for Theoretical Evolutionary Genomics
    4151 VLSB (Valley Life Sciences Building)
    Department of Integrative Biology
    University of California, Berkeley

    Lab websites:
    http://ib.berkeley.edu/people/lab_detail.php?lab=54
    http://fisher.berkeley.edu/cteg/hlab.html
    Dept. personal page:
    http://ib.berkeley.edu/people/students/person_detail.php?person=370
    Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
    Lab phone: 510-643-6299
    Dept. fax: 510-643-6264
    Cell phone: 510-301-0179
    Email:

    Mailing address:
    Department of Integrative Biology
    3060 VLSB #3140
    Berkeley, CA 94720-3140

    -----------------------------------------------------
    "[W]hen people thought the earth was flat, they were wrong. When people
    thought the earth was spherical, they were wrong. But if you think that
    thinking the earth is spherical is just as wrong as thinking the earth
    is flat, then your view is wronger than both of them put together."

    Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
    14(1), 35-44. Fall 1989.
    http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
    ====================================================
     
    Nick Matzke, Feb 5, 2009
    #1
    1. Advertising

  2. Nick Matzke

    andrew cooke Guest

    On Feb 4, 10:20 pm, Nick Matzke <> wrote:
    > So I have an interesting challenge.  I want to compare two book
    > chapters, which I have in plain text format, and find out (a) percentage
    > similarity and (b) what has changed.


    no idea if it will help, but i found this yesterday - http://www.nltk.org/

    it's a python toolkit for natural language processing. there's a book
    at http://www.nltk.org/book with much more info.

    andrew
     
    andrew cooke, Feb 5, 2009
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?Ym9iYnk=?=

    fable the lost chapters wont work

    =?Utf-8?B?Ym9iYnk=?=, Nov 18, 2006, in forum: ASP .Net
    Replies:
    1
    Views:
    597
    Scott M.
    Nov 19, 2006
  2. Replies:
    3
    Views:
    386
  3. Replies:
    0
    Views:
    683
  4. nanothermite911fbibustards

    >>> Assembler Book - Read or Download Individual Chapters - Volume 5

    nanothermite911fbibustards, Jul 16, 2010, in forum: C Programming
    Replies:
    0
    Views:
    342
    nanothermite911fbibustards
    Jul 16, 2010
  5. nanothermite911fbibustards
    Replies:
    0
    Views:
    230
    nanothermite911fbibustards
    Jul 16, 2010
Loading...

Share This Page