How to compare two large text files?

www · Jun 19, 2007

Hi,

I have two text files, text1.xml and text2.xml. text1.xml is the bench
mark file and text2.xml is generated by my Java program. In my junit
test, I want to compare the generated text2.xml against the bench mark
file text1.xml. The bench mark file text1.xml is much bigger than
text2.xml, because it has other lines in it. All the lines in text2.xml,
except a few lines containing timing information, should find a match in
text1.xml

In another words, lines 2 to 1101 in text2.xml should be identical with
lines 850 to 1949 in text1.xml; lines 1103 all the way to the end in
text2.xml should be identical with lines from 2799 to the end in text1.xml.

One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.

I tried the way below, but is too slow:

1)append all the lines in text1.xml into a huge single String line, e.g.
benchLine;
2)append lines 2 to 1101 in text2.xml into another huge String line,
e.g. patternLine;
3)Pattern pattern = Pattern.compile(".*" + patternLine + ".*"); //use
the huge line to generage a regular expression pattern
pattern.macher(benchLine).find(); //hope it is true
4)repeat step 2) but with lines 1103 to the end in text2.xml and step 3

But this is tooooo slow.

Thank you for your help.

Oliver Wong · Jun 20, 2007

[...]

In another words, lines 2 to 1101 in text2.xml should be identical with
lines 850 to 1949 in text1.xml; lines 1103 all the way to the end in
text2.xml should be identical with lines from 2799 to the end in
text1.xml.

One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.

I tried the way below, but is too slow:

1)append all the lines in text1.xml into a huge single String line, e.g.
benchLine;
2)append lines 2 to 1101 in text2.xml into another huge String line,
e.g. patternLine;
3)Pattern pattern = Pattern.compile(".*" + patternLine + ".*"); //use
the huge line to generage a regular expression pattern
pattern.macher(benchLine).find(); //hope it is true
4)repeat step 2) but with lines 1103 to the end in text2.xml and step 3

But this is tooooo slow.

If the requirements are "lines 2 to 1101 in text2 should be identical
with lines 850 to 1949 in text 1; etc.", then I recommend you hardcode the
line numbers.

If the requirements are "lines 2 to 1101 should be identical with some
portion of text1, but I don't want to specify where exactly", then you
STILL have to hardcode the line numbers "2" and "1101", if not the line
numbers of text1.

So I recommend you hardcode the line numbers.

- Oliver

Christopher Benson-Manica · Jun 20, 2007

www said:
One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.

Why not specify them in a properties file, or possibly allow them to be
specified by command-line arguments? That should be fairly easy.

Roedy Green · Jun 29, 2007

One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.

If you don't burn in a line number, you need to search for some
pattern.

Here are some ways you can proceed. This is a Chinese menu:

1. Use the SPLIT utility. Embed commands in your files for SPLIT to
work on to split out the useful sections. then have it extract the two
pieces that should be identical. Split is very fast. See
http://mindprod.com/products.html#SPLIT

2. count lines and squirt out the extracted juice to two files. Make
the line counts named constants so you won't feel so guilty, or
perhaps calculated named constants to assuage your guilt even further.

3. scan for patterns using indexOf or regex and split the file.
see http://mindprod.com/jgloss/string.html
http://mindprod.com/jgloss/regex.html

4. calculate a checksum of just the interesting parts and compare the
checksums. See http://mindprod.com/products1.html#UNTOUCH for code to
calculate a fast Adlerian checksum. See
http://mindprod.com/jgloss/adler.html
http://mindprod.com/jgloss/md5.html

5. compare the two extracts with MS FC file Compare.

6. compare the two parts by first comparing length. Then read and
compare chunk by chunk, reading raw bytes. You just want a boolean,
not the offset or the text differences. see
http://mindprod.com/products1.html#HUNKIO

7. feed the two extracts into a DIFF program.

Compare two hierarchial files	0	Sep 8, 2005
How to compare numeric values between two xml files?	3	Nov 7, 2007
compare 2 text files - check for difference - Please help	11	Dec 9, 2008
how to compare two fields in python	6	Apr 30, 2013
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
comparing two test files	5	Dec 23, 2011
How to compare two SOAP Envelope or two Document or two XML files	3	Dec 6, 2006
To compare the content in two files..	4	Nov 17, 2010

How to compare two large text files?

www

Oliver Wong

Christopher Benson-Manica

Roedy Green

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads