How to compare two large text files?

W

www

Hi,

I have two text files, text1.xml and text2.xml. text1.xml is the bench
mark file and text2.xml is generated by my Java program. In my junit
test, I want to compare the generated text2.xml against the bench mark
file text1.xml. The bench mark file text1.xml is much bigger than
text2.xml, because it has other lines in it. All the lines in text2.xml,
except a few lines containing timing information, should find a match in
text1.xml

In another words, lines 2 to 1101 in text2.xml should be identical with
lines 850 to 1949 in text1.xml; lines 1103 all the way to the end in
text2.xml should be identical with lines from 2799 to the end in text1.xml.

One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.

I tried the way below, but is too slow:

1)append all the lines in text1.xml into a huge single String line, e.g.
benchLine;
2)append lines 2 to 1101 in text2.xml into another huge String line,
e.g. patternLine;
3)Pattern pattern = Pattern.compile(".*" + patternLine + ".*"); //use
the huge line to generage a regular expression pattern
pattern.macher(benchLine).find(); //hope it is true
4)repeat step 2) but with lines 1103 to the end in text2.xml and step 3

But this is tooooo slow.

Thank you for your help.
 
O

Oliver Wong

[...]
In another words, lines 2 to 1101 in text2.xml should be identical with
lines 850 to 1949 in text1.xml; lines 1103 all the way to the end in
text2.xml should be identical with lines from 2799 to the end in
text1.xml.

One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.

I tried the way below, but is too slow:

1)append all the lines in text1.xml into a huge single String line, e.g.
benchLine;
2)append lines 2 to 1101 in text2.xml into another huge String line,
e.g. patternLine;
3)Pattern pattern = Pattern.compile(".*" + patternLine + ".*"); //use
the huge line to generage a regular expression pattern
pattern.macher(benchLine).find(); //hope it is true
4)repeat step 2) but with lines 1103 to the end in text2.xml and step 3

But this is tooooo slow.

If the requirements are "lines 2 to 1101 in text2 should be identical
with lines 850 to 1949 in text 1; etc.", then I recommend you hardcode the
line numbers.

If the requirements are "lines 2 to 1101 should be identical with some
portion of text1, but I don't want to specify where exactly", then you
STILL have to hardcode the line numbers "2" and "1101", if not the line
numbers of text1.

So I recommend you hardcode the line numbers.

- Oliver
 
C

Christopher Benson-Manica

www said:
One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.

Why not specify them in a properties file, or possibly allow them to be
specified by command-line arguments? That should be fairly easy.
 
R

Roedy Green

One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.

If you don't burn in a line number, you need to search for some
pattern.

Here are some ways you can proceed. This is a Chinese menu:

1. Use the SPLIT utility. Embed commands in your files for SPLIT to
work on to split out the useful sections. then have it extract the two
pieces that should be identical. Split is very fast. See
http://mindprod.com/products.html#SPLIT

2. count lines and squirt out the extracted juice to two files. Make
the line counts named constants so you won't feel so guilty, or
perhaps calculated named constants to assuage your guilt even further.

3. scan for patterns using indexOf or regex and split the file.
see http://mindprod.com/jgloss/string.html
http://mindprod.com/jgloss/regex.html

4. calculate a checksum of just the interesting parts and compare the
checksums. See http://mindprod.com/products1.html#UNTOUCH for code to
calculate a fast Adlerian checksum. See
http://mindprod.com/jgloss/adler.html
http://mindprod.com/jgloss/md5.html

5. compare the two extracts with MS FC file Compare.

6. compare the two parts by first comparing length. Then read and
compare chunk by chunk, reading raw bytes. You just want a boolean,
not the offset or the text differences. see
http://mindprod.com/products1.html#HUNKIO

7. feed the two extracts into a DIFF program.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,052
Latest member
LucyCarper

Latest Threads

Top