16bit hash

R

Robin Becker

Josiah said:
hash(obj)&65535

- Josiah
yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases
 
N

Nick Craig-Wood

Robin Becker said:
Is the any way to get an efficient 16bit hash in python?

Here is a 32 bit crc of which you could use the bottom 16 bits as a 16
bit hash...

And in case you are wondering how fast it is...

$ python -m timeit -s 's="."*1024*1024; from binascii import crc32' 'crc32(s)'
100 loops, best of 3: 4.25 msec per loop

Which is 235 MB/s

Ignore the bit in the binascii docs which says "Since the algorithm is
designed for use as a checksum algorithm, it is not suitable for use
as a general hash algorithm". Actually CRCs make pretty good hash
algorithms if you want something for making a hash table. They make
very bad cryptographic hash generators since they are linear and thus
easily forgeable. That said you aren't going to be doing any serious
crypto with only 16 bits.
 
D

Dan Bishop

yes I thought of that, but cannot figure out if the internal hash really
distributes the bits evenly. Particularly since it seems to treat integers etc
as special cases

And then hash(-1) is a SPECIAL special case.
-2
 
P

Paul Rubin

Robin Becker said:
yes I thought of that, but cannot figure out if the internal hash
really distributes the bits evenly. Particularly since it seems to
treat integers etc as special cases

The built-in hash function doesn't attempt even distribution, it tries
to make sure that inputs with small differences hash to distinct
values. If you want something really near-random, try

import sha
hash = int(sha.new(repr(obj)).hexdigest()[:4], 16)

It won't be super-fast but hey, you're using Python.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Is the any way to get an efficient 16bit hash in python?

Unfortunately, that problem is underspecified. The simplest
and most efficient 16-bit hash I can come up with is

def super_efficient_hash(o):
return 0

Now, you say: this gives collisions.
So yes, it does - but you didn't ask for it to be
perfect, just efficient.

But I don't think you are looking for a perfect hash,
but one that distributes the values evenly.

Unfortunately, that *still* would be underspecified:
you need to specify the distribution of the input values
to be able to tell whether the distribution of hash
values is even.

More generally: for most 16-bit hash functions H that
are capable of yielding all 65536 hash values, and for
any number N, there is a set S1 of N objects where
H distributes even, and a set S2 of N objects where
all hash values collide.
(the only exceptions are hash functions which produce
some output values only for a finite number of inputs)

So: what are your input data, and what is the
distribution among them?

Regards,
Martin
 
P

Paul Rubin

Martin v. Löwis said:
So: what are your input data, and what is the
distribution among them?

With good enough hash functions one shouldn't need to care about
the input distribution. Basically functions like SHA can be
used as extractors:

http://en.wikipedia.org/wiki/Extractor

If there's a concern that the input distribution is specially
concocted to give nonuniform results with some known hash function,
then use one unknown to the input provider, e.g.

import hmac
def hash(obj, key='some string unknown to the input source'):
return int(hmac.HMAC(key,repr(obj)).hexdigest()[:4], 16)

Anyway I don't have the impression that the OP is concerned with this
type of issue. Otherwise s/he'd want much longer hashes than 16 bits.
 
R

Robin Becker

Martin v. Löwis wrote:

0 the ideal hash

:)

can't be argued with
.......
So: what are your input data, and what is the
distribution among them?

Regards,
Martin
I'm trying to create UniqueID's for dynamic postscript fonts. According to my
resources we don't actually need to use these, but if they are required by a
particular postscript program (perhaps to make a print run efficient) then the
private range of these ID's is 4000000<=UID<=4999999 ie a range of one million.

So I probably really need an 18 bit hash

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and written
out with a template to make the postscript definition.

The UniqueID is used by PS interpreters to avoid recreating particular glyphs so
ideally I would number these fonts sequentially using a global count, but in
practice several processes separated by application and time can produce
postscript which eventually gets merged back together.

If the UID's clash then the printer produces very strange output.

I'm fairly sure there's no obvious python way to ensure the separated processes
can communicate except via the printer. So either I use a python based scheme
which reduces the risk of clashes ie random or some data based hash scheme or I
attempt to produce a postscript solution like looking for a private global
sequence number.

I'm not sure my postscript is really good enough to do the latter so I hoped to
pursue a python based approach which has a low probability of busting.
Originally I thought the range was a 16bit number which is why I started with
16bit hashes.
 
T

Thomas Jollans

Robin said:
Martin v. Löwis wrote:

0 the ideal hash

:)

can't be argued with
.......
So: what are your input data, and what is the
distribution among them?

Regards,
Martin
I'm trying to create UniqueID's for dynamic postscript fonts. According
to my resources we don't actually need to use these, but if they are
required by a particular postscript program (perhaps to make a print run
efficient) then the private range of these ID's is 4000000<=UID<=4999999
ie a range of one million.

So I probably really need an 18 bit hash

The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

The UniqueID is used by PS interpreters to avoid recreating particular
glyphs so ideally I would number these fonts sequentially using a global
count, but in practice several processes separated by application and
time can produce postscript which eventually gets merged back together.

If the UID's clash then the printer produces very strange output.

I'm fairly sure there's no obvious python way to ensure the separated
processes can communicate except via the printer. So either I use a
python based scheme which reduces the risk of clashes ie random or some
data based hash scheme or I attempt to produce a postscript solution
like looking for a private global sequence number.

I'm not sure my postscript is really good enough to do the latter so I
hoped to pursue a python based approach which has a low probability of
busting. Originally I thought the range was a 16bit number which is why
I started with 16bit hashes.


For identifying something, I suggest you use a hash function like sha1
truncating it to as much as you can use, similarly to what Jon Ribbens
suggested.
 
R

Robin Becker

Thomas said:
Robin Becker wrote: ........


For identifying something, I suggest you use a hash function like sha1
truncating it to as much as you can use, similarly to what Jon Ribbens
suggested.

that is in fact what I'm doing; my function looks like this

4000000+(reduce(operator.xor,struct.unpack('i'*4,md5.md5(s).digest()))&0x3ffff)

whether it's any better than using the lowest bits I have no real idea. I
suppose (sha being flavour of the month) I should really use

4000000+(reduce(operator.xor,struct.unpack('i'*5,sha.sha(s).digest()))&0x3ffff)
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

I'm trying to create UniqueID's for dynamic postscript fonts. According
to my resources we don't actually need to use these, but if they are
required by a particular postscript program (perhaps to make a print run
efficient) then the private range of these ID's is 4000000<=UID<=4999999
ie a range of one million.

So I probably really need an 18 bit hash

I don't fully understand the example, but ISTM that "no, you don't need
a hash at all. You need a unique identification".
The data going into the font consists of

fontBBox '[-415 -431 2014 2033]'
charmaps ['dup (\000) 0 get /C0 put',......]
metrics ['/C0 1251 def',.....]
bboxes ['/C29 [0 0 512 0] def',.......]
chardefs ['/C0 {newpath 224 418 m 234 336 ......def}',......]

ie a bunch of lists of strings which are eventually joined together and
written out with a template to make the postscript definition.

And the UniqueID should be unique within this file, right?

Why don't you just use a serial number then?

Regards,
Martin
 
R

Robin Becker

Martin v. Löwis wrote:
...........
And the UniqueID should be unique within this file, right?

Why don't you just use a serial number then?
.......
I do need a unique identifier for each font, unfortunately they are required to
be unique across more than one file. Effectively these fonts are dynamically
generated using the used glyphs rather than just dumping the whole of a ttf into
the postscript.

If the UniqueID is used it must be unique when used in the printer (where I
cannot control which other UniqueID's might be present).

If I knew enough about postscript I might generate a UniqueID at the point of
use (by inspecting some global state) unfortunately my postscript is poor so I'm
attempting to give the fonts an id that is likely to be unique by obtaining an
integer dependant on the font data and then shifting into the private range.

Luckily the cheap option of not using the UniqueID at all is available, but
chances are some printer ps interpreter will barf if it's not present and then I
need a fairly robust way to generate reasonable candidates.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

If the UniqueID is used it must be unique when used in the printer
(where I cannot control which other UniqueID's might be present).

I don't know much about PostScript; googling around gives

http://fontforge.sourceforge.net/UniqueID.html

that says that you should not use them anymore, and

http://www.adobe.com/devnet/opentype/archives/id.html

says you should not make up values, but officially register them;
a certain range is available for testing without registration;
that should not be used in production mode.
Luckily the cheap option of not using the UniqueID at all is available,
but chances are some printer ps interpreter will barf if it's not
present and then I need a fairly robust way to generate reasonable
candidates.

See above article. It was always the case that a printer should not
barf; it just might have to re-render all glyphs. However, it seems
that today's printers have advanced caching, making the mechanism
irrelevant.

Regards,
Martin
 
P

Paul Rubin

Robin Becker said:
whether it's any better than using the lowest bits I have no real
idea. I suppose (sha being flavour of the month) I should really use

I think if you have more than a few fonts you really have to assign
the id's uniquely instead of using hashes, to avoid collisions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top