[NEWBIE] Multiline match on two files ignoring newlines, tabs & blank chars

Discussion in 'Perl Misc' started by Ga, Dec 15, 2003.

  1. Ga

    Ga Guest

    Hi all,

    I have two files (some thousends of files in pairs, indeed...), file1 and
    file2. File2 looks similar to file1, but:
    - it contains more data than file1 (and such info is what I need to get)
    - it is formatted differently

    Example:

    File1:

    Divitias alius fulvo sibi congerat auro Et teneat culti iugera multa soli,
    Me mea paupertas vita traducat inerti, Dum meus adsiduo luceat igne focus.

    File2:

    Divitias alius fulvo sibi congerat auro
    Et teneat culti iugera multa soli,
    Quem labor adsiduus vicino terreat hoste,
    Martia cui somnos classica pulsa fugent.
    Me mea paupertas vita traducat inerti,
    Dum meus adsiduo luceat igne focus.

    In the example file2 contains the same *data* as file1 (rows 1,2,5,6), but
    formatted differently. What I would like to do is a script to get rows 3
    and 4 of file2 without losing it's format (tabs, spaces and newlines).

    I played a little with regexp and nested cycles on both files but it's
    really becoming too complicated and I think there should be some "easy-
    way" I'm missing.

    Any hint, anybody?

    Thanx alot.

    G.
     
    Ga, Dec 15, 2003
    #1
    1. Advertising

  2. Ga

    Anno Siegel Guest

    Ga <> wrote in comp.lang.perl.misc:
    > Hi all,
    >
    > I have two files (some thousends of files in pairs, indeed...), file1 and
    > file2. File2 looks similar to file1, but:
    > - it contains more data than file1 (and such info is what I need to get)
    > - it is formatted differently
    >
    > Example:
    >
    > File1:
    >
    > Divitias alius fulvo sibi congerat auro Et teneat culti iugera multa soli,
    > Me mea paupertas vita traducat inerti, Dum meus adsiduo luceat igne focus.
    >
    > File2:
    >
    > Divitias alius fulvo sibi congerat auro
    > Et teneat culti iugera multa soli,
    > Quem labor adsiduus vicino terreat hoste,
    > Martia cui somnos classica pulsa fugent.
    > Me mea paupertas vita traducat inerti,
    > Dum meus adsiduo luceat igne focus.
    >
    > In the example file2 contains the same *data* as file1 (rows 1,2,5,6), but
    > formatted differently. What I would like to do is a script to get rows 3
    > and 4 of file2 without losing it's format (tabs, spaces and newlines).
    >
    > I played a little with regexp and nested cycles on both files but it's
    > really becoming too complicated and I think there should be some "easy-
    > way" I'm missing.


    I don't think there is. Identifying text differences looks simple,
    intuitively, but it isn't.

    Let's for the moment forget about format differences and assume you have
    two equally formatted strings. Also, the differences are only insertions,
    no deletions or other changes happen to the first text. Still the problem
    of determining an insertion isn't unique.

    Suppose one string is "the right way", and the other is "the right to go
    the right way". Has "right to go the" been inserted after "the", or has
    "to go the right" been inserted after "right"? Somehow your algorithm
    will have to decide. And that is only a single insertion, with multiple
    ones the problems become more formidable.

    The diff program (Unix) tackles these problems on a line-by-basis.
    A possible approach would be to split your files into one-word lines,
    run them through diff and interpret the output. There are also modules
    on CPAN that incorporate the diff algorithm without an external program.

    Ambiguities like the one above will be resolved in one way or another.
    If it matters, a manual check can't be avoided. You will also have the
    problem of putting the insertions back into their original format, but
    that should be solvable.

    Anno
     
    Anno Siegel, Dec 15, 2003
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sandeep
    Replies:
    1
    Views:
    893
    Raymond DeCampo
    Jan 15, 2006
  2. qwweeeit
    Replies:
    2
    Views:
    652
    qwweeeit
    Dec 14, 2005
  3. rantingrick

    Tabs -vs- Spaces: Tabs should have won.

    rantingrick, Jul 16, 2011, in forum: Python
    Replies:
    95
    Views:
    1,859
    Roy Smith
    Jul 19, 2011
  4. John Kopanas
    Replies:
    2
    Views:
    296
    Gregory Brown
    Jan 29, 2007
  5. Markus Schirp

    multiline regexp and newlines

    Markus Schirp, Sep 28, 2008, in forum: Ruby
    Replies:
    2
    Views:
    110
    Markus Schirp
    Sep 29, 2008
Loading...

Share This Page