how to measure frequency of hash collisions.

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 16, 2012 6:45 pm

Is there a way to directly measure the number of hash collisions (not its effect). I was thinking of storing the fen along with the hash key in transposition table entry, so that if the signatures match but the fens don't it would be collision. Is there a better way ?
Edit: Right after I write this I realized storing _two_ hash keys is better than this (one could be a string hash of the fen). If the primary keys match but the second one don't it is a collision.

marcelk · Post by **marcelk** » Sat Jun 16, 2012 7:25 pm

Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect). I was thinking of storing the fen along with the hash key in transposition table entry, so that if the signatures match but the fens don't it would be collision. Is there a better way ?
Edit: Right after I write this I realized storing _two_ hash keys is better than this (one could be a string hash of the fen). If the primary keys match but the second one don't it is a collision.

You can get a good estimate by measuring the number of near-collisions.
For example, each time you have a miss, count how many of the left-most bits match (xor and look for the highest bit set, call that index 'b'). Then do count++. Then make a histogram with b on one axis, and log(count) on the other. You should get a linear plot once you have enough measurements. The number of false hits would be approximately count[0]/2.

There are variants on this that are essentially the same (instead of xor use substraction, or count hamming distance, or count right-most match etc)

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 16, 2012 8:25 pm

marcelk wrote:
Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect). I was thinking of storing the fen along with the hash key in transposition table entry, so that if the signatures match but the fens don't it would be collision. Is there a better way ?
Edit: Right after I write this I realized storing _two_ hash keys is better than this (one could be a string hash of the fen). If the primary keys match but the second one don't it is a collision.
You can get a good estimate by measuring the number of near-collisions.
For example, each time you have a miss, count how many of the left-most bits match (xor and look for the highest bit set, call that index 'b'). Then do count++. Then make a histogram with b on one axis, and log(count) on the other. You should get a linear plot once you have enough measurements. The number of false hits would be approximately count[0]/2.

There are variants on this that are essentially the same (instead of xor use substraction, or count hamming distance, or count right-most match etc)

Thanks for the suggestion. I think that will be a good way to measure for PRNG generated hash keys when collisions are very rare. Because of that it may better to test for similarity of keys...

I did a quick test over the forumulas we have been examining and the collision rate is embarassing :(The keys differ in a subtle way and they look good visually but there are many collisions.
FNV-1 => 25000 collisions/sec
FNV-1a => 15 collisions/sec
The one I suggested with a rotate & multiply => 1600 collisions/sec !!

Code: Select all

const U64 C1 = U64&#40;0x109951162821a737&#41;;
	U64 h = U64&#40;0xeadf018b615d3be8&#41;;
	h = &#40;h << sq&#41; ^ &#40;h >> &#40;64 - sq&#41;); 
	h *= C1;
	h = &#40;h << cp&#41; ^ &#40;h >> &#40;64 - cp&#41;);
	return h;

I changed the third line to

Code: Select all

h += C1

and collions dropped to => 0 collions/second so far...
Visual inspection is definately misleading. And the power of additions is revealed again

More later

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 16, 2012 9:15 pm

I tested separate tables for [square] and [piece] and then multiply or add, they both gave 0 collisions / sec after 200 million positions. The rotating code which does h += C1 code also gave 0 collisions after same number of simulation, but if C1 is shorter it can give few 1-2 collisions per second but using multiplication there is very bad surprizingly. I guess the rotate by square & cp with multiplication are somehow related. I am trying many formulas now, deliberately modifying some of them to give overlapping signatures, and it spots them right away with an astounding 20000 collisions / second... I think it would have been much easier to devise a hash function with this tool in hand. If you get a 0 collision after a few seconds it probably means the hash function is good enough.

What I do is to randomly select number of pieces and their placement (no overlap) and calculate two hashkeys and store it in TT. Whenever I do that I check for collisions with the stored position before replacing and it works pretty well. It doesn't require a chess engine or perft to test that. I don't know if I should limit the domain to similar positions or so to catch collisions faster but this works pretty well so far.

Don · Post by **Don** » Sat Jun 16, 2012 11:17 pm

Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect). I was thinking of storing the fen along with the hash key in transposition table entry, so that if the signatures match but the fens don't it would be collision. Is there a better way ?
Edit: Right after I write this I realized storing _two_ hash keys is better than this (one could be a string hash of the fen). If the primary keys match but the second one don't it is a collision.

You can estimate the collision rate by using N bits are for checking. So if your key is 64 bits, pretend it's only 60 bits and 4 bits are for collision testing. If the 4 bits do not match it was a collision. You can extrapolate to get the 64 bit collision rate estimate - each time you add a bit you can expect half the number of collisions.

It's arbitrary how many bits to use for the the collistion test, the more you use the less often it will be wrong. If you use just 1 bit it will be wrong a lot, if you use 4 it will be wrong 1/16 times on average. Note that it will never return false negatives, if it claims a collision there IS a collision, but it will occasionally say there is no collision when there is since it's only accurate to the 64 bit resolution of your full key.

Don

Rebel · Post by **Rebel** » Sat Jun 16, 2012 11:36 pm

Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect). I was thinking of storing the fen along with the hash key in transposition table entry, so that if the signatures match but the fens don't it would be collision. Is there a better way ?
Edit: Right after I write this I realized storing _two_ hash keys is better than this (one could be a string hash of the fen). If the primary keys match but the second one don't it is a collision.

3 things I have done in the past.

1. Create an extra byte (mini hash-key) in the TT serving as a collision reporter.

2. Storing the position also as extra test.

3. For creating a 64 bit hash-key I use 2 different 32-bit randomizing routines. One routine via the standard PC (srand + rand) and another one I found on the net.

Even with a 48-bit hash-key I hardly got collisions, but it's a long time ago. Perhaps the magic is in (3) using different randomizer code for part 1 and 2 of a 32-bit hash key. After generation combine them to U64.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 17, 2012 12:01 am

Rebel wrote:
Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect). I was thinking of storing the fen along with the hash key in transposition table entry, so that if the signatures match but the fens don't it would be collision. Is there a better way ?
Edit: Right after I write this I realized storing _two_ hash keys is better than this (one could be a string hash of the fen). If the primary keys match but the second one don't it is a collision.
3 things I have done in the past.

1. Create an extra byte (mini hash-key) in the TT serving as a collision reporter.

2. Storing the position also as extra test.

3. For creating a 64 bit hash-key I use 2 different 32-bit randomizing routines. One routine via the standard PC (srand + rand) and another one I found on the net.

Even with a 48-bit hash-key I hardly got collisions, but it's a long time ago. Perhaps the magic is in (3) using different randomizer code for part 1 and 2 of a 32-bit hash key. After generation combine them to U64.

Yes that seems to be the case. I think (3) is the best. If two different hash functions happen to collide (give same 64 bit hash signatures) at the same position, it is a miracle infact miracle multiplied by miracle. It is very rare to get a hash collision for two different positions using a decent hash function in the first place. So chances two different hash functions colliding at the same position is zero. For my test, I used the same random number generator but stored two sequences primary & secondary hash keys as a quick hack.

For those interested to try out their hash functions for collisions, here is the code. There could be a bug though.

Code: Select all

#include <stdio.h>
#include <time.h>

//U64
#ifdef _MSC_VER
	typedef unsigned __int64 U64;
	typedef unsigned int U32;
#	define U64&#40;x&#41; &#40;x##ui64&#41;
#	define FMTU64 "0x%016I64x"
#else
#   include <inttypes.h>
	typedef uint64_t U64;
	typedef uint32_t U32;
#	define U64&#40;x&#41; &#40;x##ull&#41;
#	define FMTU64 "0x%016llx"
#endif

#define unitBB&#40;x&#41; (&#40;U64&#41;1 << &#40;x&#41;)

//PRNG
#define MY_RAND_MAX  0x7fff

struct PRNG &#123;
	U32 randn;

	void seed&#40;int sd&#41; &#123;
		randn = sd;
	&#125;

	U32 rand&#40;) &#123;
		randn *= 214013;
		randn += 2531011;
		return (&#40;randn >> 16&#41; & MY_RAND_MAX&#41;;
	&#125;

	U64 rand64&#40;) &#123;
		return&#40;&#40;U64&#41;rand&#40;)) ^ 
			  (&#40;U64&#41;rand&#40;) << 15&#41; ^ (&#40;U64&#41;rand&#40;) << 30&#41; ^
			  (&#40;U64&#41;rand&#40;) << 45&#41; ^ (&#40;U64&#41;rand&#40;) << 60&#41;;
	&#125;
&#125;;

//TT entry
struct ENTRY &#123;
	U64 key1;
	U64 key2;
	U32 stores;
&#125;;

//pieces
int piece&#91;32&#93;;
int square&#91;32&#93;;
int npieces;
U64 secondary_hkey&#91;14&#93;&#91;64&#93;;
U64 cp_key&#91;14&#93;;
U64 sq_key&#91;64&#93;;
ENTRY* table;

//put your hash function here
U64 hashf&#40;int cp,int sq&#41; &#123;
	/*/
	const U64 C1 = U64&#40;0xeadf018b615d3be8&#41;;
	const U64 C2 = U64&#40;0x109951162821a737&#41;;
	U64 h = C1;
	h = &#40;h << sq&#41; ^ &#40;h >> &#40;64 - sq&#41;); 
	h += C2;
	h = &#40;h << cp&#41; ^ &#40;h >> &#40;64 - cp&#41;);
	return h;
	/*
	const U64 FNV = U64&#40;1099511628211&#41;;
	U64 h = U64&#40;19282138887198&#41;; 
	h = h * FNV;
	h = h ^ cp;
	h = h * FNV;
	h = h ^ sq;
	return h;
	/*
	return cp_key&#91;cp&#93; * sq_key&#91;sq&#93;;
	/*/
	return cp_key&#91;cp&#93; + sq_key&#91;sq&#93;;
	//*/
&#125;

//primary key
U64 get_key1&#40;) &#123;
	U64 key = 0;
	for&#40;int i = 0;i < npieces;i++)
		key ^= hashf&#40;piece&#91;i&#93;,square&#91;i&#93;);
	return key;
&#125;
//secondary key to compare to
U64 get_key2&#40;) &#123;
	U64 key = 0;
	for&#40;int i = 0;i < npieces;i++)
		key ^= secondary_hkey&#91;piece&#91;i&#93;&#93;&#91;square&#91;i&#93;&#93;;
	return key;
&#125;

void main&#40;) &#123;
	const int TT_SIZE = 1 << 16;
	const unsigned int TT_MASK = TT_SIZE - 1;
	const int PIECES_LIMIT = 32; //reduce this for faster tests ( endgame positions ).

	//table
	table = new ENTRY&#91;1 << 16&#93;;

	//secondary hashkeys
	PRNG prng;
	prng.seed&#40;0&#41;;
	for&#40;int i = 0;i < 14;i++)
		for&#40;int j = 0;j < 64;j++)
			secondary_hkey&#91;i&#93;&#91;j&#93; = prng.rand64&#40;);

	for&#40;int i = 0;i < 14;i++)
		cp_key&#91;i&#93; = prng.rand64&#40;);
	for&#40;int i = 0;i < 64;i++)
		sq_key&#91;i&#93; = prng.rand64&#40;);

	//test for collisions of primary hash keys
	U64 all,key1,key2,bb;
	U32 index;
	U32 collisions = 0;
	int sq,millions = 0,j = 0;
	ENTRY* pentry;
	clock_t start,end;
	start = clock&#40;);

	while&#40;true&#41; &#123;
		j++;

		//set up random position
		all = 0;
		npieces = &#40;prng.rand&#40;) % PIECES_LIMIT&#41; + 1;
		for&#40;int i = 0;i < npieces;i++) &#123;
			piece&#91;i&#93; = &#40;prng.rand&#40;) % 12&#41;;
			do &#123;
				sq = &#40;prng.rand&#40;) % 64&#41;;
				bb = unitBB&#40;sq&#41;;
			&#125; while&#40;&#40;all & bb&#41;);
			square&#91;i&#93; = sq;
			all |= bb;
		&#125;

		//get the two keys & store them
		key1 = get_key1&#40;);
		key2 = get_key2&#40;);

		index = U32&#40;key1 & TT_MASK&#41;;
		pentry = &table&#91;index&#93;;
		if&#40;pentry->key1 == key1&#41; &#123;
			if&#40;pentry->key2 != key2&#41; &#123;
				collisions++;
			&#125;
		&#125; else &#123;
			pentry->key1 = key1;
			pentry->key2 = key2;
			pentry->stores++;
		&#125;

		//display stat
		if&#40;&#40;j % 1000000&#41; == 0&#41; &#123;
			end = clock&#40;);
			int time = &#40;end - start&#41; / 1000;
			printf&#40;"Time %ds Positions %d\n",time,j&#41;;
			printf&#40;"  Collisions = %d. Rate = %.2f collisions per sec.\n",
				collisions,collisions / double&#40;time&#41;);
			millions++;
			if&#40;millions == 100&#41; break;
		&#125;
	&#125;
	/*
	FILE* log = fopen&#40;"test.txt","w");
	for&#40;int i = 0;i < TT_SIZE;i++) &#123;
		fprintf&#40;log,"%d\n",table&#91;i&#93;.stores&#41;;
	&#125;
	fclose&#40;log&#41;;
	*/
	/*
	for&#40;int i = 0;i < 64;i++)
		printf&#40;FMTU64"\n",hashf&#40;i,5&#41;);
	*/
&#125;

//*/

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 17, 2012 12:05 am

Don wrote: You can estimate the collision rate by using N bits are for checking. So if your key is 64 bits, pretend it's only 60 bits and 4 bits are for collision testing. If the 4 bits do not match it was a collision. You can extrapolate to get the 64 bit collision rate estimate - each time you add a bit you can expect half the number of collisions.
Don

But that won't work because in a hash collision, the hash signature (all 64 bits) are the same for two completely different positions... You need a key from another sequence of random numbers (be it from the same or different hash function). Am I missing something ?

hgm · Post by **hgm** » Sun Jun 17, 2012 12:12 am

Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect).

I was just asking myself the same question. I wrote a program to generate random positions (assign random squares and piece types to a number of piees, taking care of that no square is used twice), and compared the number of collisions I got with truly random Zobrist keys, and keys derived with your pieceKey+squareKey idea. I checked with a second key, but that wasn't really needed, because all primary key matches were indeed collisions, as the chances to randomly generate the same position twice are about zero. I could not detect significant difference in collision frequency between the random key and the composite key.

I don't think this is a decisive test, though. What is important is how many collisions you get when you have a set of very similar positions, as occur in a search tree.

A method I used in the past to test the quality of a set of Zobrist keys is is to count the number of 'dependency loops': subsets of N different keys that XOR to zero. For a given key length and a much larger total number of keys there is a certain N beyond which such cycles cannot be avoided. The quality of the set of keys is determined by whether you already have cycles of lower N (i.e. an unnecessary dependency between a small number of keys).

E.g. with 768 keys you have 768*767*766/6 = 75M different 'products' (XORs, actually) of 3 keys, which conceivably could all be different if the key length is 27 bit (75M < 2^27). This means that you can have key sets of 27 bit where no product of 6 keys can be zero. You have 14G different products of 4 keys. So with 32-bit keys there must be duplicats, meaning there must be products of 8 keys that are zero. Keys of realistic length (such as 64 bit) are hard to test, because collisions are extremely rare. So it is best to test model systems of short key length, where collisions are frequent, and you can efficiently detect them by storing the complete key space in a table (so you can easily detect how often a product of N keys is repeated). Cycles of odd length (say 7) can be detected by putting all products of 3 keys in a hash table, and then checking all products of 4 keys against that table, to see if they occur there.

Note that is not yet the full story: in practice a vary bad dependency (like 3 keys XORing to 0) could be completely hidden by putting the bad keys in the table for a piece type that only occurs once (like King).

Don · Post by **Don** » Sun Jun 17, 2012 12:23 am

Daniel Shawul wrote:
Don wrote: You can estimate the collision rate by using N bits are for checking. So if your key is 64 bits, pretend it's only 60 bits and 4 bits are for collision testing. If the 4 bits do not match it was a collision. You can extrapolate to get the 64 bit collision rate estimate - each time you add a bit you can expect half the number of collisions.
Don
But that won't work because in a hash collision, the hash signature (all 64 bits) are the same for two completely different positions... You need a key from another sequence of random numbers (be it from the same or different hash function). Am I missing something ?

Yes, what you are missing is that you are not testing the 64 bit key. You are testing a 60 bit key. In my example just pretend that you are generating a 60 bit key and then a totally independent 4 bit key. Let's say for example that the you get a collision on the 60 bit key once out of 1 million matches (as judged by the 4 bit verify key.) You can expect that had this been a 64 bit key instead of 60 bits you will get only 1/16 as many collisions since 2^4 is 16. Each extra bit cuts the number of expected collisions in half.

This is superior to any of the suggested methods proposed because it does not require modification of the program in order to add more key space, you just borrow a few bits from the already existing key. So this could be added to the program with no additional overhead. In fact the program could still continue to use the full 64 bit key while doing the 60 bit key collision detection test for statistical purposes and it could even be put in the log file of the program. If you use 4 bits for collision detection then you count how many of these 60 bit keys collided and divide by 16 to get an ESTIMATE of how often a 64 bit key would have collided.

how to measure frequency of hash collisions.

how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Some tests: Attn Don

Re: Some tests: Attn Don

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.