N
nebiyou1
I have two CSV files (file1 and file2).
Each file contains 50,000 + lines. Each line is very long.
The first entry in each line is a key as shown in the example below.
Example:
key1,val1,val2,val3....
key2,val1,val2,val3.....
etc
I would like to do a "diff"(unix like) between file1 and file2 and
find out
the lines that are different. The entries in each file are not
necessarily sorted by key.
Approaches I considered.
1. To load all the entries into a hashmap using the key's as a key and
the rest of the line as a value. This is a problem because of memory.
2. To generate the hashcode for each line and use the key and the
hashcode to
store in a hash map and do the comparison. The problem here is that
two different strings could have the same hashcode (false positive).
Any suggestions would be appreciated.
Each file contains 50,000 + lines. Each line is very long.
The first entry in each line is a key as shown in the example below.
Example:
key1,val1,val2,val3....
key2,val1,val2,val3.....
etc
I would like to do a "diff"(unix like) between file1 and file2 and
find out
the lines that are different. The entries in each file are not
necessarily sorted by key.
Approaches I considered.
1. To load all the entries into a hashmap using the key's as a key and
the rest of the line as a value. This is a problem because of memory.
2. To generate the hashcode for each line and use the key and the
hashcode to
store in a hash map and do the comparison. The problem here is that
two different strings could have the same hashcode (false positive).
Any suggestions would be appreciated.