Hashing function (possibly OT)

M

Matt Bull

Hi,

I apologise in advance if this is off topic, but would appreciate any
pointers you chaps might be able to provide.

I'm relatively novice in the art of C so am after any suggestions about
implementing a very simple hash (or quasi hash)

I need to be able to store a counter and a timestamp for requests from
particular device MAC addresses (6 bytes, presented as xx:xx:xx:xx:xx:xx
where x is a hex digit), ideally in a dynamic data structure. Additionally I
need to be able to access the data quickly, hence my thoughts to use a
hashing function.

This may be naive but I'd thought to code something along the line of a tree
structure, with each octet representing a node holding 256 pointers to the
next set of nodes (from most significant to least significant octect) with
the last octect holding pointers to a simple record holding the data.

This would mean that at most i would need to traverse 6 tree nodes before
hitting the data whatever the volume of addresses.(?)

Does this seem feasible?

If so (and this is where I get unstuck :eek:)), in C i'd have a structure
defined as an array of 255 pointers, then allocate space for one of these
structures as I parse each octect of the mac address. I'm having difficulty
in defining the structures correctly so that the last but one octet
references the actual data rather than another set of node pointers.

Can someone please advise on the best way to do this? Or if you think the
above is complete pap suggest any easier or more sensible ( please :p)
solutions. I'm open to advice/flaming/suggestions!

Cheers,
Matt
 
E

Eric Sosman

Matt said:
I need to be able to store a counter and a timestamp for requests from
particular device MAC addresses (6 bytes, presented as xx:xx:xx:xx:xx:xx
where x is a hex digit), ideally in a dynamic data structure. Additionally I
need to be able to access the data quickly, hence my thoughts to use a
hashing function.

This may be naive but I'd thought to code something along the line of a tree
structure, with each octet representing a node holding 256 pointers to the
next set of nodes (from most significant to least significant octect) with
the last octect holding pointers to a simple record holding the data.

<off-topic>

This approach would probably waste a lot of memory. For
example, suppose your entire data base contains just two six-byte
keys that happen to differ in the first byte. You'll allocate
one 256-entry node for the first byte and two for each of the
second through sixth bytes -- eleven 256-entry nodes in all, just
for twelve bytes of key data?

Incidentally, the scheme you describe is known as a "trie,"
and doesn't sound like a hashing method at all.

If so (and this is where I get unstuck :eek:)), in C i'd have a structure
defined as an array of 255 pointers,

Don't you mean 256?
then allocate space for one of these
structures as I parse each octect of the mac address. I'm having difficulty
in defining the structures correctly so that the last but one octet
references the actual data rather than another set of node pointers.

The first step is to define your "payload," the data
structure you'll find at the eventual end of whatever search
path you'll take. A typical way to do this in C is with a
struct, something like

typedef struct {
TimeStampType when;
CounterType count;
OtherType other;
} Payload;

Next, you need something capable of pointing to 256
different Payloads. An array of pointers is just the trick:

typedef Payload *Level6[256];

That is, a `Level6' object is an array of 256 pointers to
`Payload' objects. At the next level you'll need a bunch of
pointers to Level6's, and so on up to the Level1 root:

typedef Level6 *Level5[256];
typedef Level5 *Level4[256];
typedef Level4 *Level3[256];
typedef Level3 *Level2[256];
typedef Level2 *Level1[256];

/* finally, here's the root of your trie: */
Level1 root;

.... and there you have it: The lone Level1 object points to 256
Level2's, each Level2 points to 256 Level3's, and so on until
the chosen Level6 object points to the desired Payload.

However, this is an incredibly verbose way to proceed! The
novice C programmer is usually comfortable with the idea of a
"pointer to something" but starts to get queasy when confronted
with a "pointer to pointer to something." The LevelN types have
no real function other than to assign names to various "somethings"
so the novice can visualize them more easily -- throw away all
those names and you get the exact same structure, expressed in
terms of multiple levels of pointers:

Payload *****root[256];

(Actually, there's a subtle difference between the two
formulations: the LevelN version carries the "256-ness" information
along while the latter version doesn't, and this will affect the way
you calculate how much memory to allocate for each node. Still,
the essential structure of a six-level trie is the same.)
Can someone please advise on the best way to do this? Or if you think the
above is complete pap suggest any easier or more sensible ( please :p)
solutions. I'm open to advice/flaming/suggestions!

<off-topic>

Choice of data structure and search/update method really hasn't
anything to do with the C language. Once a choice has been made,
the question of how best to express the solution in C would be
topical, but the selection itself is driven by criteria that would
be just as valid (or invalid) for Fortran, Java, BASIC, COBOL, ...

Also, you haven't revealed enough about your situation to allow
anything more than informed guesswork. For example, how many of
these six-byte keys do you need to store? You say you need to
access them "quickly," but you don't mention how many million queries
per second you need to sustain. Once entered, do the keys persist
forever or can they be deleted? And so on, and so forth -- and
NONE of these apparently vital considerations has anything at all
to do with C. Look elsewhere for advice, and return here if you're
having difficulty with coding the chosen solution in C.

</off-topic>

Good luck!
 
M

Matt Bull

Eric Sosman said:
<off-topic>

This approach would probably waste a lot of memory. For
example, suppose your entire data base contains just two six-byte
keys that happen to differ in the first byte. You'll allocate
one 256-entry node for the first byte and two for each of the
second through sixth bytes -- eleven 256-entry nodes in all, just
for twelve bytes of key data?

Incidentally, the scheme you describe is known as a "trie,"
and doesn't sound like a hashing method at all.

I've analysed most of the data we are processing at the moment and the keys
would be in quite a low number of most significant byte "chunks", where the
vast majority of difference lies in lower 3 octets. So it might be ok.

This method seemed to offer the best of both ease of programming and memory
use at the time, although I'm open to suggestions for another method that
wouldn't make my brain melt :eek:)
Don't you mean 256?

Yes I did! hehe
The first step is to define your "payload," the data
structure you'll find at the eventual end of whatever search
path you'll take. A typical way to do this in C is with a
struct, something like

(Actually, there's a subtle difference between the two
formulations: the LevelN version carries the "256-ness" information
along while the latter version doesn't, and this will affect the way
you calculate how much memory to allocate for each node. Still,
the essential structure of a six-level trie is the same.)

Ahhh thas excellent, just the info I was after!
<off-topic>

Choice of data structure and search/update method really hasn't
anything to do with the C language. Once a choice has been made,
the question of how best to express the solution in C would be
topical, but the selection itself is driven by criteria that would
be just as valid (or invalid) for Fortran, Java, BASIC, COBOL, ...

Yes, apologies again for the off topic sway of the query. I'm somewhat stuck
with implementing in C as thats the only supported API the application this
code must interact with supports.
Also, you haven't revealed enough about your situation to allow
anything more than informed guesswork. For example, how many of
these six-byte keys do you need to store? You say you need to
access them "quickly," but you don't mention how many million queries
per second you need to sustain. Once entered, do the keys persist
forever or can they be deleted? And so on, and so forth -- and
NONE of these apparently vital considerations has anything at all
to do with C. Look elsewhere for advice, and return here if you're
having difficulty with coding the chosen solution in C.

Well, at the moment it needs to retain around 100,000 individual "payloads",
persisting in memory until the application is terminated, but generally only
seeing 100-500 new or duplicated keys per second.
</off-topic>

Good luck!

Thanks very much for the information Eric, you've been very helpful!

I'll let you know how I get on.

:eek:)

Cheers,
Matt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top