Hash Function Speed

killerqb · October 21, 2009, 11:14pm

I have created my own hash table class, but am looking to speed it up. My current hash function is:

 int HashTable::hashFunc(const string &key) const
        {
		int tableSize = theLists.size();
            int hashVal = 0;
			for(int i = 0; i<key.length();  i++)
			hashVal = 37*hashVal+key;
			hashVal %= tableSize;
			if(hashVal<0)
			hashVal += tableSize;
			return hashVal;
        }

I am looking for an alternative function that can hash a string. That will probably yield similar results.

jim_mcnamara · October 22, 2009, 10:40am

Your hash function is, um, unusual. Your for loop does nothing. Assuming that is what you really wanted, this is what you function is actually doing:

int HashTable::hashFunc(const string &key) const
{
  int tableSize = theLists.size();
  int hashVal = 0;
      			
  hashVal = 37*hashVal+key[key.length()];
  return hashVal % tableSize;
}

which is consierably faster - not iterating i over the length of a string.

Corona688 · October 22, 2009, 11:01am

Are you sure of that? The indenting looks almost random, but if you ignore it, the loop looks like it should iterate over the line below it, modifying the value of hashVal further for every character in the key string.

jim_mcnamara · October 22, 2009, 11:09am

It is missing { }, IMO.

---------- Post updated at 09:09 ---------- Previous update was at 09:07 ----------

int HashTable::hashFunc(const string &key) const
        {
		int tableSize = theLists.size();
            int hashVal = 0;
			for(int i = 0; i<key.length();  i++) {
			hashVal = 37*hashVal+key;
			hashVal %= tableSize;
			if(hashVal<0)
			hashVal += tableSize;
                                      }

			return hashVal;
        }

Corona688 · October 22, 2009, 11:34am

Without { }, the loop will operate on the following statement instead of the following code block. If the loop had a semicolon on the end, then it would be truly pointless. Try the following code:

int n;
for(n=0; n<10; n++)
  printf("loop A %d\n", n);

for(n=0; n<10; n++);
  printf("loop B %d\n", n);

It looks to me like the intent was to only work on the following line -- why strip the value to table size every loop instead of just once -- so code blocks were left out. Better indenting would have showed the intent. For maximum clarity they could've surrounded the single line.

---------- Post updated at 09:34 AM ---------- Previous update was at 09:10 AM ----------

As for improving the hash function, there's not a lot to it as is. Slowdowns may be coming from other things. How full are your hash tables, how many collisions are you getting?

jim_mcnamara · October 22, 2009, 11:54am

As to hash improvement in general - test avalanche/distribution for your data sets on these:

// XOR-Bernstein hash

unsigned xor_b_hash ( const void *key, 
                      const int len,
                      const unsigned tablesize )
{
  const unsigned char *p = (const unsigned char *)key;
  unsigned hval = 0;
  int i=0;

  for ( i = 0; i < len; i++, p++ )
    hval = 33 * hval ^ *p;

  return hval % tablesize;
}

// fowler/nol/vo hash

unsigned fnv_hash ( const void *key, 
                    const int len,
                    const unsigned tablesize )
{
  const unsigned char *p = (const unsigned char *)key;
  unsigned hval = 2166136261U;
  int i=0;

  for ( i = 0; i < len; i++, p++ )
    hval = ( hval * 16777619 ) ^ *p;

  return hval % tablesize;
}

OP's additive hash fails to treat permutations, i.e., �xyz�, �zyx�, and �xzy� all result in the same hash value.

And if the original hash is "slow", then so will these be. Did you try instrumtenting your code, or using a profiler? ...before you decided the hash algorithm was the bottleneck.

Corona688 · October 22, 2009, 1:01pm

Once again, sorry, but look closer: It's not an additive hash. Order alters the value because it multiplies hashval by 37 each iteration before adding the next character. With a table size of 2048, it gives me:

hash("xyz") == 943
hash("zyx") == 1631
hash("xzy") == 979

jim_mcnamara · October 22, 2009, 5:46pm

Corona is right - I stand corrected.