hashing function

D

Dan Jones

I'm working with some large (several hundred megs) flat database files. I
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key rather than storing the entire record, and look for
collisions in the hash value.

Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces? If not, any pointers to a module or information on
writing a hashing function in Perl would be appreciated. Hashing functions
usually involve low level bit twiddling. While it's probably possible to
do this directly in Perl (what isn't?), I don't know enough Perl to do it.
Right now, I'm looking at using a C function, then having to integrate that
with Perl. I'd really prefer to keep this a pure Perl script if I can.

I've been through the Camel, the Panther, and the Ram without finding
anything relevant. The Cookbook does mention that using hashes to search
for dupes is memory intensive if you have large records but doesn't provide
any alternatives that I could find. Searching for information on hashing
functions and Perl on the web has proven to be an exercise in futility due
to the naming collision with the variable type.

Thanks in advance for any assistance.
 
J

Juha Laiho

Dan Jones said:
I'm working with some large (several hundred megs) flat database files. I
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. ....
If not, any pointers to a module or information on writing a hashing
function in Perl would be appreciated.

MD5 ans SHA1 are two commonly used hashing functions, and implementations
for both are available as perl modules.
 
X

Xavier Noria

Storing the hash code and an index is what the $hash{$record}++ idiom
does, the post seem to assume that the record itself is being stored in
the hash. Just to avoid a misconception, that's not the case.

Did you already try that simple solution and it wasn't suitable?

-- fxn
 
X

Xavier Noria

I don't know what was I thinking about, that's completely wrong. I am
very sorry.

-- fxn
 
X

xhoster

Dan Jones said:
I'm working with some large (several hundred megs) flat database files.
I need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash.

I don't consider that to be at all obvious. I often want to and often do
store several hundred meg of data in a hash.

Is there no subset of the record which would be suitable for identifying
duplicates?
What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key

You would potentially need a list of index keys, not just one index key.
Otherwise what would you do when different records collide on the
same hash values?
rather than storing the entire record, and look
for collisions in the hash value.

And then use the index values to go back and look up the whole records
and compare them properly? Wouldn't it be easier to use system sort
routines or a proper database server?
Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces?

This seems to work. No gaurantees:

#!/usr/bin/perl -wl

use strict;
use Inline 'C' ;

foreach (1..1_000_000) {
print calc($_);
};

__DATA__
__C__

size_t calc(SV* sv) {
char * c;
size_t size;
size_t hash;

c=SvPV(sv,size);
PERL_HASH(hash,c,size);
return hash;
};


If not, any pointers to a module or information
on writing a hashing function in Perl would be appreciated. Hashing
functions usually involve low level bit twiddling. While it's probably
possible to do this directly in Perl (what isn't?), I don't know enough
Perl to do it. Right now, I'm looking at using a C function, then having
to integrate that with Perl. I'd really prefer to keep this a pure Perl
script if I can.

The C code behind PERL_HASH is given in PERLDOC PERLGUTS. You could easily
translate that into Perl.

Others have mentioned modules for hashing functions which are intended
more for security than efficiency.

Xho
 
M

Mike Deskevich

rather than trying to store the data in the hash (which you said you
didn't want to do), why not just use the hash to keep track of how many
records there are of each key. something like this

%hash={} #or however you make an empty hash, i don't use them often so
i don't know off the top of my head how to do it.
while (<DATAFILE>)
{
$key=get_key_from_record ($_); #or however you get the key from the
record just read in
$num=(%hash{$key}++);
if ($num>1)
{
#record is a dupe, do what you want with it
}
}
 
J

John W. Krahn

Dan said:
I'm working with some large (several hundred megs) flat database files. I
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key rather than storing the entire record, and look for
collisions in the hash value.

Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces? If not, any pointers to a module or information on
writing a hashing function in Perl would be appreciated. Hashing functions
usually involve low level bit twiddling. While it's probably possible to
do this directly in Perl (what isn't?), I don't know enough Perl to do it.
Right now, I'm looking at using a C function, then having to integrate that
with Perl. I'd really prefer to keep this a pure Perl script if I can.

I've been through the Camel, the Panther, and the Ram without finding
anything relevant. The Cookbook does mention that using hashes to search
for dupes is memory intensive if you have large records but doesn't provide
any alternatives that I could find. Searching for information on hashing
functions and Perl on the web has proven to be an exercise in futility due
to the naming collision with the variable type.

Here is an example using Digest::MD5 that may suit your needs:

#!/usr/bin/perl
use warnings;
use strict;

use Digest::MD5 'md5';


my $file = 'database.file';

my ( $loc, %data ) = 0;

@ARGV = $file;

while ( <> ) {
push @{ $data{ md5 $_ } }, $loc;
$loc = tell ARGV;
}

close ARGV;

# remove all duplicates except the first one
my @dups = sort { $a <=> $b } map splice( @$_, 1 ), values %data;

# or alternately remove all except the last one
# my @dups = sort { $a <=> $b } map splice( @$_, 0, $#$_ ), values %data;


( $loc, $^I, @ARGV ) = ( 0, '.bck', $file );

while ( <> ) {
if ( @dups and $loc == $dups[ 0 ] ) {
shift @dups;
next;
}
print;
$loc = tell ARGV;
}

__END__



John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,876
Messages
2,569,932
Members
46,206
Latest member
BernardPer

Latest Threads

Top