How to compare two large text files?

Discussion in 'Java' started by www, Jun 19, 2007.

  1. www

    www Guest

    Hi,

    I have two text files, text1.xml and text2.xml. text1.xml is the bench
    mark file and text2.xml is generated by my Java program. In my junit
    test, I want to compare the generated text2.xml against the bench mark
    file text1.xml. The bench mark file text1.xml is much bigger than
    text2.xml, because it has other lines in it. All the lines in text2.xml,
    except a few lines containing timing information, should find a match in
    text1.xml

    In another words, lines 2 to 1101 in text2.xml should be identical with
    lines 850 to 1949 in text1.xml; lines 1103 all the way to the end in
    text2.xml should be identical with lines from 2799 to the end in text1.xml.

    One way for comparison is using for loop and with these hard-coded line
    number, compare line by line. But I hate the hard-coded line number.

    I tried the way below, but is too slow:

    1)append all the lines in text1.xml into a huge single String line, e.g.
    benchLine;
    2)append lines 2 to 1101 in text2.xml into another huge String line,
    e.g. patternLine;
    3)Pattern pattern = Pattern.compile(".*" + patternLine + ".*"); //use
    the huge line to generage a regular expression pattern
    pattern.macher(benchLine).find(); //hope it is true
    4)repeat step 2) but with lines 1103 to the end in text2.xml and step 3

    But this is tooooo slow.

    Thank you for your help.
    www, Jun 19, 2007
    #1
    1. Advertising

  2. www

    Oliver Wong Guest

    "www" <> wrote in message
    news:f591ep$u2h$...
    [...]
    >
    > In another words, lines 2 to 1101 in text2.xml should be identical with
    > lines 850 to 1949 in text1.xml; lines 1103 all the way to the end in
    > text2.xml should be identical with lines from 2799 to the end in
    > text1.xml.
    >
    > One way for comparison is using for loop and with these hard-coded line
    > number, compare line by line. But I hate the hard-coded line number.
    >
    > I tried the way below, but is too slow:
    >
    > 1)append all the lines in text1.xml into a huge single String line, e.g.
    > benchLine;
    > 2)append lines 2 to 1101 in text2.xml into another huge String line,
    > e.g. patternLine;
    > 3)Pattern pattern = Pattern.compile(".*" + patternLine + ".*"); //use
    > the huge line to generage a regular expression pattern
    > pattern.macher(benchLine).find(); //hope it is true
    > 4)repeat step 2) but with lines 1103 to the end in text2.xml and step 3
    >
    > But this is tooooo slow.


    If the requirements are "lines 2 to 1101 in text2 should be identical
    with lines 850 to 1949 in text 1; etc.", then I recommend you hardcode the
    line numbers.

    If the requirements are "lines 2 to 1101 should be identical with some
    portion of text1, but I don't want to specify where exactly", then you
    STILL have to hardcode the line numbers "2" and "1101", if not the line
    numbers of text1.

    So I recommend you hardcode the line numbers.

    - Oliver
    Oliver Wong, Jun 20, 2007
    #2
    1. Advertising

  3. www <> wrote:

    > One way for comparison is using for loop and with these hard-coded line
    > number, compare line by line. But I hate the hard-coded line number.


    Why not specify them in a properties file, or possibly allow them to be
    specified by command-line arguments? That should be fairly easy.

    --
    C. Benson Manica | I *should* know what I'm talking about - if I
    cbmanica(at)gmail.com | don't, I need to know. Flames welcome.
    Christopher Benson-Manica, Jun 20, 2007
    #3
  4. www

    Roedy Green Guest

    On Tue, 19 Jun 2007 12:49:29 -0400, www <> wrote, quoted
    or indirectly quoted someone who said :

    >One way for comparison is using for loop and with these hard-coded line
    >number, compare line by line. But I hate the hard-coded line number.


    If you don't burn in a line number, you need to search for some
    pattern.

    Here are some ways you can proceed. This is a Chinese menu:

    1. Use the SPLIT utility. Embed commands in your files for SPLIT to
    work on to split out the useful sections. then have it extract the two
    pieces that should be identical. Split is very fast. See
    http://mindprod.com/products.html#SPLIT

    2. count lines and squirt out the extracted juice to two files. Make
    the line counts named constants so you won't feel so guilty, or
    perhaps calculated named constants to assuage your guilt even further.

    3. scan for patterns using indexOf or regex and split the file.
    see http://mindprod.com/jgloss/string.html
    http://mindprod.com/jgloss/regex.html

    4. calculate a checksum of just the interesting parts and compare the
    checksums. See http://mindprod.com/products1.html#UNTOUCH for code to
    calculate a fast Adlerian checksum. See
    http://mindprod.com/jgloss/adler.html
    http://mindprod.com/jgloss/md5.html

    5. compare the two extracts with MS FC file Compare.

    6. compare the two parts by first comparing length. Then read and
    compare chunk by chunk, reading raw bytes. You just want a boolean,
    not the offset or the text differences. see
    http://mindprod.com/products1.html#HUNKIO

    7. feed the two extracts into a DIFF program.
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 29, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. YC
    Replies:
    1
    Views:
    4,829
    siva chelliah
    Aug 13, 2003
  2. edw
    Replies:
    2
    Views:
    9,460
  3. vaggelis
    Replies:
    0
    Views:
    3,498
    vaggelis
    Jul 13, 2003
  4. GenxLogic
    Replies:
    3
    Views:
    1,266
    andrewmcdonagh
    Dec 6, 2006
  5. Joe Young

    Compare two extremely large lists?

    Joe Young, Jan 17, 2011, in forum: Perl Misc
    Replies:
    7
    Views:
    143
    Joe Young
    Jan 18, 2011
Loading...

Share This Page