Java algorithm (diff) question.

N

nebiyou1

I have two CSV files (file1 and file2).
Each file contains 50,000 + lines. Each line is very long.
The first entry in each line is a key as shown in the example below.

Example:
key1,val1,val2,val3....
key2,val1,val2,val3.....
etc

I would like to do a "diff"(unix like) between file1 and file2 and
find out
the lines that are different. The entries in each file are not
necessarily sorted by key.

Approaches I considered.
1. To load all the entries into a hashmap using the key's as a key and
the rest of the line as a value. This is a problem because of memory.

2. To generate the hashcode for each line and use the key and the
hashcode to
store in a hash map and do the comparison. The problem here is that
two different strings could have the same hashcode (false positive).

Any suggestions would be appreciated.
 
N

nebiyou1

I have two CSV files (file1 and file2).
Each file contains 50,000 + lines. Each line is very long.
The first entry in each line is a key as shown in the example below.

Example:
key1,val1,val2,val3....
key2,val1,val2,val3.....
etc

I would like to do a "diff"(unix like) between file1 and file2 and
find out
the lines that are different. The LINES in each file are not
necessarily sorted by key.

Approaches I considered.
1. To load all the entries into a hashmap using the key's as a key and
the rest of the line as a value. This is a problem because of memory.

2. To generate the hashcode for each line and use the key and the
hashcode to
store in a hash map and do the comparison. The problem here is that
two different strings could have the same hashcode (false positive).

Any suggestions would be appreciated.

made a clarification to the original post.
 
P

Patricia Shanahan

actually i can use the key+hashcode. Thanks for reading:)

I find the final remark a little bit worrying. If the key alone is
sufficient, why use hash code at all? On the other hand, if the key is
not unique, what prevents a hash code collision among lines with the
same key?

Patricia
 
R

Roedy Green

I would like to do a "diff"(unix like) between file1 and file2 and
find out
the lines that are different. The entries in each file are not
necessarily sorted by key.

If these files are too huge to put in RAM, here is a rough outline of
a diff algorithm.

Create entries like this:
line#, offset, len, digest-64

for each file.
sort by digest, line#

If the files are identical you should see

two identical lists of quads.

If they are not identical, you can identify the parts of the files
that are identical, and also the parts that are identical, but
shifted.

You can then identify insertions and deletions. You can then
build the traditional DIFF display of any portion of the two files.

The method assumes all lines in the original file are unique. If you
have blank lines, or other non-unique lines, they need to be made
unique


see http://mindprod.com/jgloss/digest.html
 
N

NebiyouGirma

If these files are too huge to put in RAM, here is a rough outline of
adiffalgorithm.

Create entries like this:
line#, offset, len, digest-64

for each file.
sort by digest, line#

If the files are identical you should see

two identical lists of quads.

If they are not identical, you can identify the parts of the files
that are identical, and also the parts that are identical, but
shifted.

You can then identify insertions and deletions. You can then
build the traditionalDIFFdisplay of any portion of the two files.

The method assumes all lines in the original file are unique. If you
have blank lines, or other non-unique lines, they need to be made
unique

seehttp://mindprod.com/jgloss/digest.html

Thanks. The lines in each file are guaranteed to be unique. Further,
each line has also a unique key to identify it.
My requirement is only to compare if lines with the same key in file1
and file2.

For this I can build,
HashMap<Key,Key-hashcode> for each line and do the comparison.
My worry here is that there is a chance that I could get a false
positive for two strings of the same Key.

Does digest-64 guarantee that two different strings always hash to
different values? How would it help?
 
N

NebiyouGirma

Thanks. The lines in each file are guaranteed to be unique. Further,
each line has also a unique key to identify it.
My requirement is only to compare if lines with the same key in file1
and file2.

For this I can build,
HashMap<Key,Key-hashcode> for each line and do the comparison.
My worry here is that there is a chance that I could get a false
positive for two strings of the same Key.

Does digest-64 guarantee that two different strings always hash to
different values? How would it help?

Thanks. The lines in each file are guaranteed to be unique. Further,
each line has also a unique key to identify it. The lines are not
sorted by key, however.
My requirement is only to compare if lines with the same key in file1
and file2 are different.

For this I can build, HashMap<Key,Key-hashcode> for a file (key= the
key, value= hashcode).
My worry here is that there is a chance that I could get a false
positive for two strings of the same Key.

Does digest-64 guarantee that two different strings always hash to
different values? How would it help?

can you elaborate on ur algorithm.
 
L

Lew

Thanks. The lines in each file are guaranteed to be unique. Further,
each line has also a unique key to identify it.
My requirement is only to compare if lines with the same key in file1
and file2.

For this I can build,
HashMap<Key,Key-hashcode> for each line and do the comparison.
My worry here is that there is a chance that I could get a false
positive for two strings of the same Key.

You just said that the key is unique; doesn't that mean that there cannot be
two Strings with the same "Key"?

Note that HashMap uses the hashCode() of each key to locate the linked list of
possible values for that key. Note also that any Map promises exactly one
entry for each key that is equivalent according to equals(). In other words,
in any Map the key provides a unique lookup.

It is pathologically unusual to store the hash code of an item as the value of
that item in a Map. What do you think this will allow you to accomplish?
Does digest-64 guarantee that two different strings always hash to
different values?

By definition, a hash guarantees that at least two items from the hash
function domain will map to the same hash code.
 
R

Roedy Green

Does digest-64 guarantee that two different strings always hash to
different values? How would it help?

A 64 bit digest collision with a properly done hash function is 1 in
2^64 which is roughly one in 10,000,000,000,000,000,000 Your odds of
winning the lottery are considerably better than that.

Even with 32 bit hashes, (properly implemented) the odds of a
collision are one in 1 billion for any given pair. but the odds get
worse as you compare more pairs.
 
L

Lew

Roedy said:
A 64 bit digest collision with a properly done hash function is 1 in
2^64 which is roughly one in 10,000,000,000,000,000,000 Your odds of
winning the lottery are considerably better than that.

Even with 32 bit hashes, (properly implemented) the odds of a
collision are one in 1 billion for any given pair. but the odds get
worse as you compare more pairs.

I, too, am curious how the digest will help.
 
I

Ingo R. Homann

Hi,

I have two CSV files (file1 and file2).
Each file contains 50,000 + lines. Each line is very long.
The first entry in each line is a key as shown in the example below.

Example:
key1,val1,val2,val3....
key2,val1,val2,val3.....
etc

I would like to do a "diff"(unix like) between file1 and file2 and
find out
the lines that are different. The entries in each file are not
necessarily sorted by key.

Approaches I considered.
1. To load all the entries into a hashmap using the key's as a key and
the rest of the line as a value. This is a problem because of memory.

2. To generate the hashcode for each line and use the key and the
hashcode to
store in a hash map and do the comparison. The problem here is that
two different strings could have the same hashcode (false positive).

Any suggestions would be appreciated.

IMHO the only clean solution is to store the data in a database and then
iterate over all keys.

Ciao,
Ingo
 
D

Daniel Pitts

actually i can use the key+hashcode. Thanks for reading:)


My suggestion is to actually sort the files based on key, and then
compare the files one line at a time, only moving to the next line in
a particular file if the key for that line is less than the key for
the current line in the other file.

Ofcourse, this only works if the key's have some sort of natural
ordering, and you can manage to sort the files in the first place.
 
R

Rogan Dawes

Lew said:
I, too, am curious how the digest will help.

Perhaps I misunderstand the OP, but IIUC, he wants to be able to
identify new lines and deleted lines between two large unsorted files.
In particular, each line is very long.

Each line begins with a unique key (but I think that the rest of the
line MAY change?). So, rather than storing the unique key, and the whole
line (which is large), store the unique key, and the hash of the line,
which is a fixed length (128 - 256 bits, depending on hash). Depending
on the collision space, you may want to choose a better hash.

e.g.
HashMap<UniqueKey, MD5(line-UniqueKey)>
HashMap<UniqueKey, SHA256(line-UniqueKey)>

So, if the hashes for the unique key change, then report the difference,
otherwise move to the next item.

Your diff report will be in the order chosen by the hash's iterator, of
course. If you have specific requirements, use a TreeMap with a suitable
Comparator.

Rogan
 
L

Lew

Rogan said:
So, if the hashes for the unique key change, then report the difference,
otherwise move to the next item.

But if the hashes are the same, it's not a guarantee that the item has not
changed. It may, unlikely but it may have changed to another value that
hashes to the same value.

The odds of this might be acceptable, but it does require very careful choice
of hashing algorithm.
 
R

Rogan Dawes

Lew said:
But if the hashes are the same, it's not a guarantee that the item has
not changed. It may, unlikely but it may have changed to another value
that hashes to the same value.

The odds of this might be acceptable, but it does require very careful
choice of hashing algorithm.

Of course. That is why I mentioned using something like SHA-256, which
would be VERY unlikely to collide. 1 in 2^256 = 1.15792089 × 10^77

Even SHA-1 is unlikely to collide: 1 in 2^160 = 1.46150164 × 10^48

Rogan
 
B

bugbear

I have two CSV files (file1 and file2).
Each file contains 50,000 + lines. Each line is very long.
The first entry in each line is a key as shown in the example below.

Example:
key1,val1,val2,val3....
key2,val1,val2,val3.....
etc

I would like to do a "diff"(unix like) between file1 and file2 and
find out
the lines that are different. The entries in each file are not
necessarily sorted by key.

Start here:

http://citeseer.ist.psu.edu/myers86ond.html

BugBear
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,167
Latest member
SusanaSwan
Top