Designing a Hash Function

History_Buff

Deity
Joined
Aug 12, 2001
Messages
6,529
For a Comp-Sci course, I have to build a Hash Function, that generates a value for ~10,000 English words, and place them into a table, with 9000-16000 elements. The actual working of the code aren't that tough, but if someone could tell me how to build a good hash function, I'd really appreciate it. Doesn't have to be one-to-one, but the closer to it the better.

And just for the record, I'm writing it in C++.
 
Well, it all depends... ;)
By "building a hash function" do you mean only the function itself or also concepts for managing collisions? How collision resistent does it have to be? Do you know the range of values the function has to point to and is it static? How Do you manage collisions? How full is your table going to be?
There are different hashing concepts like Static Hashing, Dynamic Hashing and Expansive Hashing.
The problem with designing a good hash function (and I now mean just the function itself) is that it has to be useful in your specific case.
It doesn't help to know an awesome concept for designing an expansive hashing function when you are working with a static table that can't be modified. And a super collision resistent function (like cryptographic hashes) doesn't help if it's slow. Hashing is supposed to speed up thing. If the hashing function is collision resistent but very complex and slow you'll end up slowing down you program because the hash has to be calculated on every read operation on that table.

Let's assume your table contains pointers to dictionary entries that describe the words you just hased but also contain the words itself. THe last part is important for collision management.
Code:
Word with n letters.
x(n) = the unsigned char value of n-th character in the word
H = Sum(x(n)*x^n)  % M
x = a number, preferably a prime number. The bigger the table the bigger the number should be.
M = size of your table
Sum means you add up the values of all the characters in a word.
e.g:
word: test
M = 100
x = 3
n=1234
  test
H= 116*3^1 + 101*3^2 + 115*3^3 + 116*3^4 % 100
H= 58
In case of a collision during a write operation you move on to another table index. During read operations you check the word in the dictionary entry against your input to see if it's really the word you where looking for and if it isn't you move on the another table index in the same way you do during write operations. The simplest way for would be to just increment the index by one until you find an empty entry.
The Example above also generates huge number (prior to the modolo) compared to the actual number of indexes (the ASCII value of 't' alone is bigger than M) so it should work well enough ;)
But how collision resistent that function is depends on the words you use. I'm sure you could come up with 100 words that all have the same hash but you could also come up with 100 words that have 100 different hashes. But that is to be expected since there are more thatn 100 words in the english language.
Of course you don't have to use that funtion. You could also XOR the caracters of a word against each otehr and end up with a 8 bit values that you map to the table or do some other type of key folding (like H(first half) XOR H(last half) ).

So, to sum it up: it all depends. :blush:
Since you're talking about a maximum of 16000 entries while the English language contains a few words more than that you shouldn't worry too much about the function itself and instead come up with an efficient concept for collision management.
 
Well, actually, we were simply told to use chaining in case of collisions, probably to simplify our job a little.

We are supposed to have the table at least 70% full, ( we can arbitrary choose the size of it, it'll be static.) and generally the fuller the table is, the better we go overall, so the function itself is very important ;)
 
Ok,
So H(x) has to give you a number (= table index) for a given word. Since words are composed of an arbitrary number of characters you a) have to find a function that can turn an arbitrary number of chars into one number and b) use modulo to stay within your table boundaries.
Since you can choose the size of your table and it'll be your modulo I'd go for a prime number or (2^k)-1 because they tend to give good results.
As for the function.. well. The problem is (like I said in the first post) that there are much more words than you have table entries. So even if you design the perfect hash function it's perfromance will still greatly depend on the words.
You can also take a look at Fibonacci Hashing as an alternative to the simple modulo hashing, although that won't solve the characters -> number problem :/
IMO two good base concepts for the function are the one I gave you above (which is basically the same thigng that's used for ISBN numbers) and the preprocessing methods of the SHA functions (fips180-2).
 
Top Bottom