[NEWBIE] Multiline match on two files ignoring newlines, tabs & blank chars

Discussion in 'Perl Misc' started by Ga, Dec 15, 2003.

  1. Ga

    Ga Guest

    Hi all,

    I have two files (some thousends of files in pairs, indeed...), file1 and
    file2. File2 looks similar to file1, but:
    - it contains more data than file1 (and such info is what I need to get)
    - it is formatted differently

    Example:

    File1:

    Divitias alius fulvo sibi congerat auro Et teneat culti iugera multa soli,
    Me mea paupertas vita traducat inerti, Dum meus adsiduo luceat igne focus.

    File2:

    Divitias alius fulvo sibi congerat auro
    Et teneat culti iugera multa soli,
    Quem labor adsiduus vicino terreat hoste,
    Martia cui somnos classica pulsa fugent.
    Me mea paupertas vita traducat inerti,
    Dum meus adsiduo luceat igne focus.

    In the example file2 contains the same *data* as file1 (rows 1,2,5,6), but
    formatted differently. What I would like to do is a script to get rows 3
    and 4 of file2 without losing it's format (tabs, spaces and newlines).

    I played a little with regexp and nested cycles on both files but it's
    really becoming too complicated and I think there should be some "easy-
    way" I'm missing.

    Any hint, anybody?

    Thanx alot.

    G.
     
    Ga, Dec 15, 2003
    #1
    1. Advertisements

  2. Ga

    Anno Siegel Guest

    I don't think there is. Identifying text differences looks simple,
    intuitively, but it isn't.

    Let's for the moment forget about format differences and assume you have
    two equally formatted strings. Also, the differences are only insertions,
    no deletions or other changes happen to the first text. Still the problem
    of determining an insertion isn't unique.

    Suppose one string is "the right way", and the other is "the right to go
    the right way". Has "right to go the" been inserted after "the", or has
    "to go the right" been inserted after "right"? Somehow your algorithm
    will have to decide. And that is only a single insertion, with multiple
    ones the problems become more formidable.

    The diff program (Unix) tackles these problems on a line-by-basis.
    A possible approach would be to split your files into one-word lines,
    run them through diff and interpret the output. There are also modules
    on CPAN that incorporate the diff algorithm without an external program.

    Ambiguities like the one above will be resolved in one way or another.
    If it matters, a manual check can't be avoided. You will also have the
    problem of putting the insertions back into their original format, but
    that should be solvable.

    Anno
     
    Anno Siegel, Dec 15, 2003
    #2
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.