[NEWBIE] Multiline match on two files ignoring newlines, tabs & blank chars

G

Ga

Hi all,

I have two files (some thousends of files in pairs, indeed...), file1 and
file2. File2 looks similar to file1, but:
- it contains more data than file1 (and such info is what I need to get)
- it is formatted differently

Example:

File1:

Divitias alius fulvo sibi congerat auro Et teneat culti iugera multa soli,
Me mea paupertas vita traducat inerti, Dum meus adsiduo luceat igne focus.

File2:

Divitias alius fulvo sibi congerat auro
Et teneat culti iugera multa soli,
Quem labor adsiduus vicino terreat hoste,
Martia cui somnos classica pulsa fugent.
Me mea paupertas vita traducat inerti,
Dum meus adsiduo luceat igne focus.

In the example file2 contains the same *data* as file1 (rows 1,2,5,6), but
formatted differently. What I would like to do is a script to get rows 3
and 4 of file2 without losing it's format (tabs, spaces and newlines).

I played a little with regexp and nested cycles on both files but it's
really becoming too complicated and I think there should be some "easy-
way" I'm missing.

Any hint, anybody?

Thanx alot.

G.
 
A

Anno Siegel

Ga said:
Hi all,

I have two files (some thousends of files in pairs, indeed...), file1 and
file2. File2 looks similar to file1, but:
- it contains more data than file1 (and such info is what I need to get)
- it is formatted differently

Example:

File1:

Divitias alius fulvo sibi congerat auro Et teneat culti iugera multa soli,
Me mea paupertas vita traducat inerti, Dum meus adsiduo luceat igne focus.

File2:

Divitias alius fulvo sibi congerat auro
Et teneat culti iugera multa soli,
Quem labor adsiduus vicino terreat hoste,
Martia cui somnos classica pulsa fugent.
Me mea paupertas vita traducat inerti,
Dum meus adsiduo luceat igne focus.

In the example file2 contains the same *data* as file1 (rows 1,2,5,6), but
formatted differently. What I would like to do is a script to get rows 3
and 4 of file2 without losing it's format (tabs, spaces and newlines).

I played a little with regexp and nested cycles on both files but it's
really becoming too complicated and I think there should be some "easy-
way" I'm missing.

I don't think there is. Identifying text differences looks simple,
intuitively, but it isn't.

Let's for the moment forget about format differences and assume you have
two equally formatted strings. Also, the differences are only insertions,
no deletions or other changes happen to the first text. Still the problem
of determining an insertion isn't unique.

Suppose one string is "the right way", and the other is "the right to go
the right way". Has "right to go the" been inserted after "the", or has
"to go the right" been inserted after "right"? Somehow your algorithm
will have to decide. And that is only a single insertion, with multiple
ones the problems become more formidable.

The diff program (Unix) tackles these problems on a line-by-basis.
A possible approach would be to split your files into one-word lines,
run them through diff and interpret the output. There are also modules
on CPAN that incorporate the diff algorithm without an external program.

Ambiguities like the one above will be resolved in one way or another.
If it matters, a manual check can't be avoided. You will also have the
problem of putting the insertions back into their original format, but
that should be solvable.

Anno
 

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top