Trying to convert small C++ subroutine (by Peter Weinberger) to Perl

D

David Filmer

Greetings.

In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
(1989, Prentice Hall, page 14), I find a short hashing subroutine
("hash" as in cryptographic hash, not an associative array) by the
famous Peter Weinberger.

The subroutine accepts a string, and returns a numerical value. One
possible application of such a subroutine would be for storing (and
retrieving) of a vast number of files (many thousands or millions) -
the files could be stored in an arbitrary number of directories/
filesystems (ie, "hash buckets") and retrieved based on the hash value
of their filename.

I wish to convert this C++ subroutine to a Perl subroutine. But my
knowledge of C++ is limited to what I learned in a class in college -
I have never coded in C++ and I am severely deficient in this
language.

Based on my limited C++ skills, I have made an effort to do this
conversion, but it does not produce the expected results. I was
hoping someone here who was more familiar with C++ could point out the
error of my ways.

Here is subroutine from the book, based on Peter Weinberger's code:

int
hashpjw( char *s ) {
const prime = 211;
unsigned hash = 0, g;

for( char *p = s; *p, p++ ) {
hash = ( hash << 4 ) + *p;
// assumes 32 bit int size
if( g = hash & 0xf0000000 ) {
hash ^= g >> 24;
hsh ^= g;
}
}
return hash % prime;
}


Here is my <lame>attempt</lame> to convert it to Perl:

#!/usr/bin/perl
use strict;

print hashpjw( 'bar', 123 );

sub hashpjw {
my( $char, $s ) = @_;
my $prime = 211;
my ( $hash, $g, $p );
for my $char( $p = $s, $p, $p++ ) {
$hash = ( $hash << 4 ) + $p;
if( $g = $hash & 0xf0000000 ) {
$hash = $hash ^ ($g >> 24);
$hash = $hash ^ $g;
}
}
return $hash % $prime;
}
__END__

Any assistance is greatly appreciated!
 
J

Jens Thoms Toerring

David Filmer said:
In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
(1989, Prentice Hall, page 14), I find a short hashing subroutine
("hash" as in cryptographic hash, not an associative array) by the
famous Peter Weinberger.
The subroutine accepts a string, and returns a numerical value. One
possible application of such a subroutine would be for storing (and
retrieving) of a vast number of files (many thousands or millions) -
the files could be stored in an arbitrary number of directories/
filesystems (ie, "hash buckets") and retrieved based on the hash value
of their filename.

Mmmm, but you also have to take into account possible "hash
collisions"...
I wish to convert this C++ subroutine to a Perl subroutine. But my
knowledge of C++ is limited to what I learned in a class in college -
I have never coded in C++ and I am severely deficient in this
language.
Based on my limited C++ skills, I have made an effort to do this
conversion, but it does not produce the expected results. I was
hoping someone here who was more familiar with C++ could point out the
error of my ways.
Here is subroutine from the book, based on Peter Weinberger's code:
int
hashpjw( char *s ) {
const prime = 211;
unsigned hash = 0, g;
for( char *p = s; *p, p++ ) {
hash = ( hash << 4 ) + *p;
// assumes 32 bit int size
if( g = hash & 0xf0000000 ) {
hash ^= g >> 24;
hsh ^= g;
}
}
return hash % prime;
}

Your 'prime' is awfully small, with that you never get more
than 211 different hash values...
Here is my <lame>attempt</lame> to convert it to Perl:
#!/usr/bin/perl
use strict;

use warnings;

;-)
print hashpjw( 'bar', 123 );

Why do you pass it two arguments if the C/C++ version only
takes a single one, the string of which you want to compute
the hash value?
sub hashpjw {
my( $char, $s ) = @_;
my $prime = 211;
my ( $hash, $g, $p );
for my $char( $p = $s, $p, $p++ ) {
$hash = ( $hash << 4 ) + $p;
if( $g = $hash & 0xf0000000 ) {
$hash = $hash ^ ($g >> 24);
$hash = $hash ^ $g;
}
}
return $hash % $prime;
}

Try this instead (and pass it just the string)

sub hashpjw {
my ( $s, $prime, $hash ) = ( shift, 211, 0 );

for ( split //, $s ) {
$hash = ( $hash << 4 ) + ord $_;
if ( my $g = $hash & 0xf0000000 ) {
$hash ^= $g >> 24;
$hash ^= $g;
}
}

return $hash % $prime;
}

One warning: the C/C++ version is rather likely meant to be used
with 8-bit characters. The Perl version as it is will happily
accept e.g. Unicode characters and then the result will differ
from the C/C++ version!
Regards, Jens
 
S

smallpond

#!/usr/bin/perl
use strict;

print hashpjw( 'bar', 123 );

Call hashpjw with two args, a string 'bar' and a number 123
sub hashpjw {
my( $char, $s ) = @_;

Get two args: $char = 'bar' and $s = 123

for my $char( $p = $s, $p, $p++ ) {

Here you define a new $char hiding the one that has 'bar'.

Also this loop is almost certainly not doing what you want, unless
what you want is very strange:

perl -e 'for $c ($p=123, $p, $p++) { print $c,"\n" }'
124
124
123
Any assistance is greatly appreciated!

Start by treating perl as a different language than C. First off,
there are no pointers, so trying to do pointer arithmetic is, well,
pointless.
 
U

Uri Guttman

hi dave!

DF> In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
DF> (1989, Prentice Hall, page 14), I find a short hashing subroutine
DF> ("hash" as in cryptographic hash, not an associative array) by the
DF> famous Peter Weinberger.

well, perl hashes are called that because they use a hash function to
allow O(1) (or close to it) access!

DF> The subroutine accepts a string, and returns a numerical value. One
DF> possible application of such a subroutine would be for storing (and
DF> retrieving) of a vast number of files (many thousands or millions) -
DF> the files could be stored in an arbitrary number of directories/
DF> filesystems (ie, "hash buckets") and retrieved based on the hash value
DF> of their filename.

which is also how perl hashes work!

DF> int
DF> hashpjw( char *s ) {
DF> const prime = 211;
DF> unsigned hash = 0, g;

DF> for( char *p = s; *p, p++ ) {
DF> hash = ( hash << 4 ) + *p;
DF> // assumes 32 bit int size
DF> if( g = hash & 0xf0000000 ) {
DF> hash ^= g >> 24;
DF> hsh ^= g;
DF> }
DF> }
DF> return hash % prime;
DF> }


DF> print hashpjw( 'bar', 123 );

you pass a string to $char but a number to $s

DF> sub hashpjw {
DF> my( $char, $s ) = @_;

the c++ code declares only a single char pointer arg which is similar to
a perl string. and as i pointed out above, you pass in two args.

DF> my $prime = 211;
DF> my ( $hash, $g, $p );
DF> for my $char( $p = $s, $p, $p++ ) {

that redeclares $char so the string arg you pass in is never used. also
this loop just scans the string to be hashed. there are many easy idioms
for this. this s about the simplest but may be slow on very long
strings:

foreach my $char ( split //, $s ) {

this works better on long strings:

while( my ($char) = $s =~ /(.)/sg ) {


DF> $hash = ( $hash << 4 ) + $p;

you never set $p to anything. it needs to be the next char. so change $p
to $char. *p is accessing the next char in the c++ string.

DF> if( $g = $hash & 0xf0000000 ) {
DF> $hash = $hash ^ ($g >> 24);
DF> $hash = $hash ^ $g;
DF> }

that looks ok but i can't tell without testing.

DF> }
DF> return $hash % $prime;

same here.

well, you never scan the input string which is never properly passed
in. fix those bugs and see what happens. also test this out with a
single char string so you can debug all the steps more easily.

uri
 
S

sln

David Filmer said:
In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
(1989, Prentice Hall, page 14), I find a short hashing subroutine
[snip]

Try this instead (and pass it just the string)

sub hashpjw {
my ( $s, $prime, $hash ) = ( shift, 211, 0 );

for ( split //, $s ) {
$hash = ( $hash << 4 ) + ord $_;
if ( my $g = $hash & 0xf0000000 ) {
$hash ^= $g >> 24;
$hash ^= $g;
}
}

return $hash % $prime;
}

The oputput for this is the same as the C function, for
'bar' its 168. However the C function as typed in won't compile
without some fixes.

If there were ever a need for speed, the last two $hash asignments
can be combined into a single asignment.

$hash = ($hash ^ $g >> 24) ^ $g;

-sln
 
S

sln

Greetings.

In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
(1989, Prentice Hall, page 14), I find a short hashing subroutine
("hash" as in cryptographic hash, not an associative array) by the
famous Peter Weinberger.

The subroutine accepts a string, and returns a numerical value. One
possible application of such a subroutine would be for storing (and
retrieving) of a vast number of files (many thousands or millions) -
the files could be stored in an arbitrary number of directories/
filesystems (ie, "hash buckets") and retrieved based on the hash value
of their filename.

I wish to convert this C++ subroutine to a Perl subroutine. But my
knowledge of C++ is limited to what I learned in a class in college -
I have never coded in C++ and I am severely deficient in this
language.

Based on my limited C++ skills, I have made an effort to do this
conversion, but it does not produce the expected results. I was
hoping someone here who was more familiar with C++ could point out the
error of my ways.

Here is subroutine from the book, based on Peter Weinberger's code:

int
hashpjw( char *s ) {
const prime = 211;
unsigned hash = 0, g;

for( char *p = s; *p, p++ ) {
hash = ( hash << 4 ) + *p;
// assumes 32 bit int size
if( g = hash & 0xf0000000 ) {
hash ^= g >> 24;
hsh ^= g;
}
}
return hash % prime;
}
[snip]

Jen has a nice Perl solution.
But, the C code won't compile under C++, probably some typo's.

C++ does not support default-int
There is no default type for const.
The for loop has a ',' where it should have a ';'.

Fixed up and speedier version:

int
hashpjw( char *s ) {
const int prime = 211;
unsigned hash = 0, g;

for( char *p = s; *p; p++ ) {
hash = ( hash << 4 ) + *p;
// assumes 32 bit int size
if( g = hash & 0xf0000000 )
hash = (hash ^ g >> 24) ^ g;
}
return hash % prime;
}

-sln
 
T

Ted Zlatanov

On 24 Feb 2009 09:11:20 GMT (e-mail address removed) (Jens Thoms Toerring) wrote:

JTT> for ( split //, $s ) {
JTT> $hash = ( $hash << 4 ) + ord $_;
....
JTT> The Perl version as it is will happily accept e.g. Unicode
JTT> characters and then the result will differ from the C/C++ version!

An easy way to fix that is to replace the lines with

for ( unpack "C*", $s) {
$hash = ( $hash << 4 ) + $_;

which will probably also be faster, a nice property for a hashing
function (unpack over split, and no ord() call).

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top