hashing function

Discussion in 'Perl Misc' started by Dan Jones, Nov 28, 2004.

  1. Dan Jones

    Dan Jones Guest

    I'm working with some large (several hundred megs) flat database files. I
    need to examine the records for duplicates. Obviously, I don't want to
    store several hundred megs of data in a hash. What I'd like to do is to
    read each record, generate a hash value for the record, store that hash
    value and an index key rather than storing the entire record, and look for
    collisions in the hash value.

    Perl obviously uses an internal hashing function to create it's hash
    variables. Is it possible to access this function or to get the actual
    hash value it produces? If not, any pointers to a module or information on
    writing a hashing function in Perl would be appreciated. Hashing functions
    usually involve low level bit twiddling. While it's probably possible to
    do this directly in Perl (what isn't?), I don't know enough Perl to do it.
    Right now, I'm looking at using a C function, then having to integrate that
    with Perl. I'd really prefer to keep this a pure Perl script if I can.

    I've been through the Camel, the Panther, and the Ram without finding
    anything relevant. The Cookbook does mention that using hashes to search
    for dupes is memory intensive if you have large records but doesn't provide
    any alternatives that I could find. Searching for information on hashing
    functions and Perl on the web has proven to be an exercise in futility due
    to the naming collision with the variable type.

    Thanks in advance for any assistance.
    Dan Jones, Nov 28, 2004
    #1
    1. Advertising

  2. Dan Jones

    Juha Laiho Guest

    Dan Jones <> said:
    >I'm working with some large (several hundred megs) flat database files. I
    >need to examine the records for duplicates. Obviously, I don't want to
    >store several hundred megs of data in a hash.

    ....
    >If not, any pointers to a module or information on writing a hashing
    >function in Perl would be appreciated.


    MD5 ans SHA1 are two commonly used hashing functions, and implementations
    for both are available as perl modules.
    --
    Wolf a.k.a. Juha Laiho Espoo, Finland
    (GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
    PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
    "...cancel my subscription to the resurrection!" (Jim Morrison)
    Juha Laiho, Nov 28, 2004
    #2
    1. Advertising

  3. Dan Jones

    Xavier Noria Guest

    Storing the hash code and an index is what the $hash{$record}++ idiom
    does, the post seem to assume that the record itself is being stored in
    the hash. Just to avoid a misconception, that's not the case.

    Did you already try that simple solution and it wasn't suitable?

    -- fxn
    Xavier Noria, Nov 28, 2004
    #3
  4. Dan Jones

    Xavier Noria Guest

    I don't know what was I thinking about, that's completely wrong. I am
    very sorry.

    -- fxn
    Xavier Noria, Nov 28, 2004
    #4
  5. Dan Jones

    Guest

    Dan Jones <> wrote:
    > I'm working with some large (several hundred megs) flat database files.
    > I need to examine the records for duplicates. Obviously, I don't want to
    > store several hundred megs of data in a hash.


    I don't consider that to be at all obvious. I often want to and often do
    store several hundred meg of data in a hash.

    Is there no subset of the record which would be suitable for identifying
    duplicates?

    > What I'd like to do is to
    > read each record, generate a hash value for the record, store that hash
    > value and an index key


    You would potentially need a list of index keys, not just one index key.
    Otherwise what would you do when different records collide on the
    same hash values?

    > rather than storing the entire record, and look
    > for collisions in the hash value.


    And then use the index values to go back and look up the whole records
    and compare them properly? Wouldn't it be easier to use system sort
    routines or a proper database server?

    > Perl obviously uses an internal hashing function to create it's hash
    > variables. Is it possible to access this function or to get the actual
    > hash value it produces?


    This seems to work. No gaurantees:

    #!/usr/bin/perl -wl

    use strict;
    use Inline 'C' ;

    foreach (1..1_000_000) {
    print calc($_);
    };

    __DATA__
    __C__

    size_t calc(SV* sv) {
    char * c;
    size_t size;
    size_t hash;

    c=SvPV(sv,size);
    PERL_HASH(hash,c,size);
    return hash;
    };



    > If not, any pointers to a module or information
    > on writing a hashing function in Perl would be appreciated. Hashing
    > functions usually involve low level bit twiddling. While it's probably
    > possible to do this directly in Perl (what isn't?), I don't know enough
    > Perl to do it. Right now, I'm looking at using a C function, then having
    > to integrate that with Perl. I'd really prefer to keep this a pure Perl
    > script if I can.


    The C code behind PERL_HASH is given in PERLDOC PERLGUTS. You could easily
    translate that into Perl.

    Others have mentioned modules for hashing functions which are intended
    more for security than efficiency.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Nov 28, 2004
    #5
  6. rather than trying to store the data in the hash (which you said you
    didn't want to do), why not just use the hash to keep track of how many
    records there are of each key. something like this

    %hash={} #or however you make an empty hash, i don't use them often so
    i don't know off the top of my head how to do it.
    while (<DATAFILE>)
    {
    $key=get_key_from_record ($_); #or however you get the key from the
    record just read in
    $num=(%hash{$key}++);
    if ($num>1)
    {
    #record is a dupe, do what you want with it
    }
    }
    Mike Deskevich, Nov 28, 2004
    #6
  7. Dan Jones wrote:
    > I'm working with some large (several hundred megs) flat database files. I
    > need to examine the records for duplicates. Obviously, I don't want to
    > store several hundred megs of data in a hash. What I'd like to do is to
    > read each record, generate a hash value for the record, store that hash
    > value and an index key rather than storing the entire record, and look for
    > collisions in the hash value.
    >
    > Perl obviously uses an internal hashing function to create it's hash
    > variables. Is it possible to access this function or to get the actual
    > hash value it produces? If not, any pointers to a module or information on
    > writing a hashing function in Perl would be appreciated. Hashing functions
    > usually involve low level bit twiddling. While it's probably possible to
    > do this directly in Perl (what isn't?), I don't know enough Perl to do it.
    > Right now, I'm looking at using a C function, then having to integrate that
    > with Perl. I'd really prefer to keep this a pure Perl script if I can.
    >
    > I've been through the Camel, the Panther, and the Ram without finding
    > anything relevant. The Cookbook does mention that using hashes to search
    > for dupes is memory intensive if you have large records but doesn't provide
    > any alternatives that I could find. Searching for information on hashing
    > functions and Perl on the web has proven to be an exercise in futility due
    > to the naming collision with the variable type.


    Here is an example using Digest::MD5 that may suit your needs:

    #!/usr/bin/perl
    use warnings;
    use strict;

    use Digest::MD5 'md5';


    my $file = 'database.file';

    my ( $loc, %data ) = 0;

    @ARGV = $file;

    while ( <> ) {
    push @{ $data{ md5 $_ } }, $loc;
    $loc = tell ARGV;
    }

    close ARGV;

    # remove all duplicates except the first one
    my @dups = sort { $a <=> $b } map splice( @$_, 1 ), values %data;

    # or alternately remove all except the last one
    # my @dups = sort { $a <=> $b } map splice( @$_, 0, $#$_ ), values %data;


    ( $loc, $^I, @ARGV ) = ( 0, '.bck', $file );

    while ( <> ) {
    if ( @dups and $loc == $dups[ 0 ] ) {
    shift @dups;
    next;
    }
    print;
    $loc = tell ARGV;
    }

    __END__



    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Nov 29, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Owen Jacobson
    Replies:
    3
    Views:
    4,558
    CBFalconer
    May 26, 2005
  2. Pat

    Hashing function

    Pat, May 18, 2004, in forum: C++
    Replies:
    2
    Views:
    456
    Ivan Vecerina
    May 18, 2004
  3. Matt Bull

    Hashing function (possibly OT)

    Matt Bull, Jul 7, 2003, in forum: C Programming
    Replies:
    2
    Views:
    319
    Matt Bull
    Jul 7, 2003
  4. Dave

    Hashing Function

    Dave, Nov 17, 2005, in forum: Python
    Replies:
    0
    Views:
    299
  5. Lawrence
    Replies:
    16
    Views:
    516
Loading...

Share This Page