Page 1 of 3

Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 9:32 am
by mar
I always thought that one can't do better than using "truly" random keys (or at least using statistically good PRNG - read passing various statistical tests).
However as Steven showed recently (while working on Oscar) - even PRNG that fails some statistical tests is good enough for zobrist keys that just work.

So I tried to mesasure something else - given any two features (=keys) I measured maximum number of same bits.
In my case, I have 1049 keys (yes I know it's much more than necessary) and for my PRNG it was 50, for random numbers I got from random.org I got 49.
The average is obviously very close to 32 (half of my key size).
Then I used a brute force loop to try to improve and managed to reduce this so that no two features have more than 42 common bits,
with the idea that this should (in theory) lower the possibility of cancellation,
but one would have to conduct collision tests if this is indeed true (which I haven't).

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 10:13 am
by Evert
I once looked at this briefly, and came to the conclusion that it doesn't really matter.

There is some discussion on the topic at https://chessprogramming.wikispaces.com/Zobrist+Hashing under "Linear independence".

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 10:41 am
by hgm
mar wrote:..., with the idea that this should (in theory) lower the possibility of cancellation,
This theory has already been debunked here many times. E.g. with four 12-bit keys, 0xF00, 0x0F0, 0x00F and 0xFFF any pair has 8 out of 12 different (the 'Hamming distance'), while they are dependent. (The XOR of all of them is 0.) While a simple 1, 2, 4, 8 would have provided collision-free keys, while they would all have 10 bits in common.

In fact there exists an algorithm to generate a set of keys with maximum Hamming distance. This results in a set where any key is the XOR of 2 others. Which is absolutely the worst possible key set.

A good way to test the quality of your key set is to store all XORs of 3 keys in a hash table (using that XOR as the hash). With 1000 keys you get 166M XORs. If two diffent sets give the same XOR (i.e. you get a collission in that hash table) your key set contains a 6th-order dependency, which is pretty bad. To detect 7th-order dependencies you can check for all products of 4 keys if it occurs in the table. (This takes very long with 1000 keys, as you have ~40G combinations.)

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 12:18 pm
by mar
hgm wrote: This theory has already been debunked here many times. E.g. with four 12-bit keys, 0xF00, 0x0F0, 0x00F and 0xFFF any pair has 8 out of 12 different (the 'Hamming distance'), while they are dependent. (The XOR of all of them is 0.) While a simple 1, 2, 4, 8 would have provided collision-free keys, while they would all have 10 bits in common.
Thanks, so now I see that it's nonsense (or wrong measure)... Well... I also found in the links posted by Evert that it has been discussed already in 2001.

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 1:25 pm
by Rein Halbersma
hgm wrote:
mar wrote:..., with the idea that this should (in theory) lower the possibility of cancellation,
This theory has already been debunked here many times. E.g. with four 12-bit keys, 0xF00, 0x0F0, 0x00F and 0xFFF any pair has 8 out of 12 different (the 'Hamming distance'), while they are dependent. (The XOR of all of them is 0.) While a simple 1, 2, 4, 8 would have provided collision-free keys, while they would all have 10 bits in common.

In fact there exists an algorithm to generate a set of keys with maximum Hamming distance. This results in a set where any key is the XOR of 2 others. Which is absolutely the worst possible key set.

A good way to test the quality of your key set is to store all XORs of 3 keys in a hash table (using that XOR as the hash). With 1000 keys you get 166M XORs. If two diffent sets give the same XOR (i.e. you get a collission in that hash table) your key set contains a 6th-order dependency, which is pretty bad. To detect 7th-order dependencies you can check for all products of 4 keys if it occurs in the table. (This takes very long with 1000 keys, as you have ~40G combinations.)
Zobrist hashing is also known as Tabulation hashing in the literature, and in the last few years it has been shown that it has a lot of nice statistical properties, even though Zobrist hashing is not even 4-independent in general. See e.g. http://arxiv.org/pdf/1011.5200.pdf or google for the authors' names for more accessible introductions.

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 1:26 pm
by SMIRF
mar wrote:Thanks, so now I see that it's nonsense (or wrong measure)... Well... I also found in the links posted by Evert that it has been discussed already in 2001.
Whether it would be nonsense or not, I am using 64 bit zobrist atoms consisting of 32 0-bits and 32 1-bits. Thus applying a xored Zobrist atom always will switch 32 bits.

Random values traditionally chosen as zobrist atoms could in extrem encounter a zero value which would then switch nothing at all. Thus I concluded, the bigger the absolute difference between zero- and one-bits, the greater the chance of producing indentic overlayed keys.

This is pure intention, but I am sure, that using such equal bitted (randomized) Zobrist atoms as I do cannot harm anything.

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 1:45 pm
by AlvaroBegue
I studied the number of collisions for a hash function in a different setting (hashing short strings, like ticker symbols in the stock market) and I found that cyclic redundancy codes (CRC) performed better than could be expected from purely random numbers. You can describe CRC as a Zobrist key, by making the i-th entry in the Zobrist table be the coefficients of the polynomial x^(i+64) modulo P(x), where P(x) is a 65-th degree polynomial.

If you want to test out this idea, here's some C++ code to generate the Zobrist table:

Code: Select all

#include <cstdio>

int main&#40;) &#123;
  unsigned long long const P = 0x42F0E1EBA9EA3693ull;
  unsigned long long h = P;
  for &#40;int i = 0; i < 1000; ++i&#41; &#123;
    std&#58;&#58;printf&#40;"%016llx\n", h&#41;;
    h = &#40;h & 0x8000000000000000u&#41; ? &#40;h << 1&#41; ^ P &#58; h << 1;
  &#125;
&#125;

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 2:04 pm
by sean_vn
Interesting discussion. This seems like a good forum for the sort of programming I am doing at the moment. Has anyone done any work on quasi-random numbers and hash tables? I tried Marsaglia's simple Weyl number idea with hash tables. It seems better than using a linear probe offset of 1.

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 2:06 pm
by AlvaroBegue
Arg, there is a typo and the forum won't let me edit the message. Here's the corrected code:

Code: Select all

#include <cstdio>

int main&#40;) &#123;
  unsigned long long const P = 0x42F0E1EBA9EA3693ull;
  unsigned long long h = P;
  for &#40;int i = 0; i < 1000; ++i&#41; &#123;
    std&#58;&#58;printf&#40;"%016llx\n", h&#41;;
    h = &#40;h & 0x8000000000000000ull&#41; ? &#40;h << 1&#41; ^ P &#58; &#40;h << 1&#41;;
  &#125;
&#125;

Re: Zobrist keys - measure of quality?

Posted: Tue Feb 24, 2015 2:20 pm
by mar
AlvaroBegue wrote:I studied the number of collisions for a hash function in a different setting (hashing short strings, like ticker symbols in the stock market) and I found that cyclic redundancy codes (CRC) performed better than could be expected from purely random numbers.
Interesting, I did a similar test of hash functions some time ago (looking for a good general-purpose hashing function) and found that CRC-32 indeed has pretty good distribution (but also quite slow unless you use slicing tables).
I also found that Hsieh's SuperFastHash is not a) that super fast and b) produces more collisions that other hashes I tried.
I used english wordlist of 1.6M words, hashing strings.
I'm posting my hash function here (originally inspired by MurmurHash3), I hereby place it in public domain. It only has 32-bit output though.
Another disadvantage right now is that it requires 32-bit aligned input and is not incremental.
Also it passed smhasher tests.

Code: Select all

// "random" keys
// primes are essential for multiplicative keys
static const UInt KEY1 = 0xdb91908du;	// prime
static const UInt KEY2 = 0x6be5be6fu;	// prime
static const UInt KEY3 = 0x53a28fc9u;	// prime
static const UInt KEY4 = 0xf8ffd50bu;	// prime

// rotate &#40;left, could be right, doesn't matter because left = 32-right&#41;
static inline UInt Rotate&#40; UInt u, Byte amount )
&#123;
	CORE_ASSERT&#40; amount && amount < 32 );
	return &#40;u << amount&#41; | &#40;u >> &#40;Byte&#41;&#40;32-amount&#41;);
&#125;

// faster than Murmur3 32-bit
static inline UInt MyHashUInt&#40; UInt u, UInt h )
&#123;
	// passes smhasher tests
	u = Rotate&#40; u*KEY1, 1 );
	return u -= Rotate&#40; h ^ KEY2, 11 );
&#125;

static inline UInt MyHash&#40; const void *buf, size_t len, UInt seed = 0 )
&#123;
	const Byte *b = &#40;const Byte *&#41;buf;
	CORE_ASSERT&#40; len < 4 || !(&#40;UIntPtr&#41;b & 3&#41; );
	const UInt *u = &#40;const UInt *&#41;b;
	size_t ulen = len/4;

	UInt h = KEY3 ^ seed;
	h ^= len * KEY4;

	b += ulen*4;
	len &= 3;

	while ( ulen-- ) &#123;
		h = MyHashUInt&#40; *u++, h );
	&#125;

	// process remaining data if any&#58;
	UInt last = 0;
	switch&#40; len )
	&#123;
	case 3&#58;
		last = &#40;UInt&#41;b&#91;2&#93; << 16;
	case 2&#58;
		last |= &#40;UInt&#41;b&#91;1&#93; << 8;
	case 1&#58;
		last |= &#40;UInt&#41;b&#91;0&#93;;
		h = MyHashUInt&#40; last, h );
	default&#58;
		break;
	&#125;

	// finalize &#40;avalanche&#41;
	h ^= h >> 7;
	h ^= h << 21;
	return h;
&#125;
This comes directly from my new simple framework, so UInt = uint32_t, Byte = uint8_t, CORE_ASSERT = assert.
It's also quite fast except that clang unfortunately missed to fold one of the rotations (I assume because another optimization is applied before rot-folding).
I will look at your code in more detail later, thanks.
EDIT: it also produces different results on little vs big endian machines but since I used it as general-purpose hashing function for my hashset, I don't really care as long as it's fast and has good distribution.