Hash Code Computation

Hash code computation of 32 bit codes for word-aligned strings, null padded to be integral multiples of 12 bytes.

The C source code this version is based on, along with an explanatory article, was found at http://burtleburtle.net/bob/hash/evahash.html .

It came with the following notice:

By Bob Jenkins, 1996. bob_jenkins@burtleburtle.net. You may use this code any way you wish, private, educational, or commercial. It's free.

Use for hash table lookup, or anything where one collision in $2^{32}$ is acceptable. Do NOT use for cryptographic purposes.

I've adaptated it to C++ and specialized it for 32-bit x86. In particular, the hash code arithmetic has been rewritten to first copy the key to a word-aligned, null-padded, multiple-of-12-byte buffer, which input works out to be much more tractable than raw, unaligned, arbitrary length C-style character strings.

Judging by the article, Jenkins is clearly aware that the word aligned computation is easier to read and write; he evidently chose the unaligned approach in order to satisfy the widest possible range of clients. Note, by the way, that he gives source for 64-bit hash code computation as well.

Word alignment allows each group of four character operations to be folded into a single unsigned integer expression, and the null padding allows the switch logic for string remainders to be folded into the main loop.

Although the loop traversal code is considerably different in style from that of Bob Jenkins' original, each iteration of the loop still works with 12-byte character sequences, and in Hash::State::operator+=(nat_0 key), the first three assignments correspond to Jenkins' combine() step, and the following loop, to the mix(), or permutation, step. Note that in order to fold the post-loop remainder computation into the main loop body, I chose to: (1) regularize the string remainder suffix combining step to conform with that used in the loop body; the original varied slightly between loop body and post-loop; and (2) use the string length as the salt, rather than attempt to combine it with the state vector part way through the computation.

As a result, the hash codes that are computed for strings of length greater than 8 differ from Jenkins' original code. Since I still use all the original input information, and the permutation step is unchanged, the likelihood of collisions should also be unchanged.

Bill Pippin 2010-01-14