how to measure frequency of hash collisions.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 17, 2012 4:00 am

Using a hash key length of 28 , I was able to confirm that multiplication gives less collisions than addition. With the full 64 bit length I was not getting any collisions compared to the composite method. Multiplications shows about 150 collisons/sec and addition 220 collisions/sec. So indeed it is possible to quantify performance of hash functions using smaller key length. I also checked AND and OR, worst is OR . Same pattern seem to exist for other key lengths

Code: Select all

          28 bit              20 bit 
MUL    150 coll/sec     40335 coll/sec 
ADD   220 coll/sec     55371 coll/sec 
AND   250 coll/sec     57219 coll/sec 
OR     457 coll/sec     60240 coll/sec

And the one below that doesn't use tables has 120 coll/sec! even better than multiplying two pre-generated tables. So atleast it seems to work.

Code: Select all

U64 hashf&#40;int cp,int sq&#41; &#123;
	const U64 C1 = U64&#40;0xeadf018b615d3be8&#41;;
	const U64 C2 = U64&#40;0x109951162821a737&#41;;
	U64 h = C1;
	h = &#40;h << sq&#41; ^ &#40;h >> &#40;64 - sq&#41;); 
	h += C2;
	h = &#40;h << cp&#41; ^ &#40;h >> &#40;64 - cp&#41;);
	return h;
&#125;

All collision tests used the composite key as secondary key.
Both FNV-1 and 1a formulas we tested are poor giving thousands of collisions per sec.

lucasart · Post by **lucasart** » Sun Jun 17, 2012 5:02 am

Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect). I was thinking of storing the fen along with the hash key in transposition table entry, so that if the signatures match but the fens don't it would be collision. Is there a better way ?
Edit: Right after I write this I realized storing _two_ hash keys is better than this (one could be a string hash of the fen). If the primary keys match but the second one don't it is a collision.

Be careful with comparing FEN. The FEN also contains the 50 move counter, and the play count. Those must be discarded, or course.
Probably easier would be to have a benchmark method, uwing 64 bits of a known good PRNG (like the Mersenne Twister or Bob Jenkins' 64 bit KISS). These hould have no collisions, or collisions would be so rare that you probably wouldn't see one in all your testing time. Now you can compare other PRNG by reducing the number of bits.

I wonder for example, how often you get collisions with 32 bits of an MT or KISS. For example Stockfish uses 32 bits only of Bob Jenkins' KISS 64 bit generator, and I use 62 bits of the same generator. Probably 32 is enough, but I wanted to be on the safe side.

marcelk · Post by **marcelk** » Sun Jun 17, 2012 9:55 am

Don wrote: Honestly, when I saw the problem posed I was suggesting a good way to accurately estimate the hash collision rate, none of the suggestions posted on this thread can judge the quality of the Zobrist hashing because they all require separate verification bits and the method falls apart no matter what unless those verification bit's are properly generated. If you can do that then you might as well generate all the random bits that way.

Then I think I have not expressed my method clearly enough because it doesn't add bits. In fact, your 60-4 proposal is a special case of it.

I agree that if the hash quality can't be trusted at all then this method also doesn't work anymore (but you don't need a very good RNG at all for zobrist hashing to work). That is easy to see when you consider the most extreme case of that: the keys all all-zero. You will have no misses but you have no idea how many are false. Under such conditions one has to include the FEN in the comparison.

Rémi Coulom · Post by **Rémi Coulom** » Sun Jun 17, 2012 10:30 am

Daniel Shawul wrote:Using a hash key length of 28 , I was able to confirm that multiplication gives less collisions than addition.

Interesting experiment, thanks for running it.

Can you give the number of collisions for the usual method so that we can compare?

Rémi

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 17, 2012 12:49 pm

There is one flaw in the experiment I just noticed. Collisions/sec is not a good measurement! Multiplication is slow while addition is fast. So this biases the result in favor of multiplication since it takes longer time. I redid the experiments with collisions per million,and added mersenne twister and stockfish kiss to the mix. Now it got harder to differentiate..

Code: Select all

            28 bit key          
MUL         243 coll/mil    
ADD         247 coll/mil      
AND         542 coll/mil     
OR          410 coll/mil 
MY_CODE     240 coll/mil    
MT19927     240 coll/mil
RKISS       240 coll/mil

Now MUL and ADD are almost on the same level, while AND/OR are significantly poorer. The code I gave before that does rotations with sq & cp is only slightly better this time around. Mersenne and RKISS don't bring that much to the mix. I always doubted the need for complex random generators...

Let me know if you see flaws in the setup.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 17, 2012 4:58 pm

I got an interesting post of yours about frequency of collisions you made a long time ago. It seems to be the birthday paradox problem with different m and n. Wikipedia also cites the same formula:

Code: Select all

Pr&#40;n,m&#41; = 1 - exp&#40;m * m / 2 * n&#41;

So for 64 bit keys (n = 2^64), one or more collision happens at the level of sqrt(n) as you said = 2^32 = 4G. It is not as bad as I thought. But that will require us to compare each key with the rest which makes it less likely to get a collision doing just that. It will probably take a lot more than 4G.

I understood that part but got lost with further discussion that involve hash tables. In this case the comparisons are applied only to the key stored at the same index. I can't even make a general statement as to whether using a larger/small hash table will increase or decrease the chances of finding a collision... I mean with larger tables the comparisons will be lower but then smaller tables would replace entries quite often. So what is an optimal size for that?

Another doubt I have is with using search to get more collisions since the keys will be more "dependent". If I was able to compare all the keys with each other, then I think the brute force approach that scans the set of keys uniformly would be better. Also supposedly zobrist XOR supposedly changes the key significantly even after a single move. So why would search that has dependency show more collisions than brute force? Infact it may narrow the test domain since correlated positions are being tested.
I thik that one XOR changes the hashkey as much as 12 moves (XORs) do so it may not be better to use search. It is just a doubt I have and use of hash tables for collision search is making it difficult for me to understand.

syzygy · Post by **syzygy** » Sun Jun 17, 2012 5:08 pm

lucasart wrote:I wonder for example, how often you get collisions with 32 bits of an MT or KISS. For example Stockfish uses 32 bits only of Bob Jenkins' KISS 64 bit generator, and I use 62 bits of the same generator. Probably 32 is enough, but I wanted to be on the safe side.

With a full hashtable, bucket size of 4 and assuming the Zobrist hash is effectively random, 32 bits should give a collission about once every one billion (= 2 ^ 30) nodes (or actually a bit less often, since hash hits canot collide). Seems acceptable to me, as long as collissions cannot crash the engine.

With very large hash tables and corresponding long searches and/or high nps, even storing 64 bits will not avoid collissions, since most of the extra bits will be used as index into the table anyway.

syzygy · Post by **syzygy** » Sun Jun 17, 2012 5:41 pm

Daniel Shawul wrote:I got an interesting post of yours about frequency of collisions you made a long time ago. It seems to be the birthday paradox problem with different m and n. Wikipedia also cites the same formula:
Code: Select all
Pr&#40;n,m&#41; = 1 - exp&#40;m * m / 2 * n&#41;

I don't like this approach too much when applied to hash tables, since even if two nodes in the tree have the same hash value, due to the limited size of the hash table it can easily be the case that there is no actual collission during the search. When the second node is encountered, chances are high that the first node has already been evicted from the hash table.

Therefore I prefer an easier approach. Just assume the hash table is full (either with nodes from the current search or from a previous search). Collission rate is then 1 in every 2 ^ (number of random hash bits stored) / bucket_size nodes. (See also this post by Rémi.)

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 17, 2012 7:58 pm

syzygy wrote:
Daniel Shawul wrote:I got an interesting post of yours about frequency of collisions you made a long time ago. It seems to be the birthday paradox problem with different m and n. Wikipedia also cites the same formula:
Code: Select all
Pr&#40;n,m&#41; = 1 - exp&#40;m * m / 2 * n&#41;
I don't like this approach too much when applied to hash tables, since even if two nodes in the tree have the same hash value, due to the limited size of the hash table it can easily be the case that there is no actual collission during the search. When the second node is encountered, chances are high that the first node has already been evicted from the hash table.

Therefore I prefer an easier approach. Just assume the hash table is full (either with nodes from the current search or from a previous search). Collission rate is then 1 in every 2 ^ (number of random hash bits stored) / bucket_size nodes. (See also this post by Rémi.)

Ok that will be 2^64 with a bucket size of 1 for a 64 bits hashkey (no change). My doubt is with the replacement scheme. In the previous birthday formula, we assume that each key is compared against all available keys which is what brings up the probablity higher. But for this case I don't know how to quantitiy how many useful comparisons that lead to collision will be done. With a hashtable many positions map to the same entry (assume butcket_size=1). The entry is home to many hash signatures that may actually be real collisions or not in most cases. To detect collision the stored key should happen to be exactly same as the one we are going to store. Say A,B,C map to the same entry but only A == C. First we store C, then B, then A. Thus we will not detect collision. So don't you think the replacement scheme will have an effect ?

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 17, 2012 8:12 pm

lucasart wrote:
Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect). I was thinking of storing the fen along with the hash key in transposition table entry, so that if the signatures match but the fens don't it would be collision. Is there a better way ?
Edit: Right after I write this I realized storing _two_ hash keys is better than this (one could be a string hash of the fen). If the primary keys match but the second one don't it is a collision.
Be careful with comparing FEN. The FEN also contains the 50 move counter, and the play count. Those must be discarded, or course.
Probably easier would be to have a benchmark method, uwing 64 bits of a known good PRNG (like the Mersenne Twister or Bob Jenkins' 64 bit KISS). These hould have no collisions, or collisions would be so rare that you probably wouldn't see one in all your testing time. Now you can compare other PRNG by reducing the number of bits.

I wonder for example, how often you get collisions with 32 bits of an MT or KISS. For example Stockfish uses 32 bits only of Bob Jenkins' KISS 64 bit generator, and I use 62 bits of the same generator. Probably 32 is enough, but I wanted to be on the safe side.

I tested both and it seems I get 240 collisions/million on average for 28-bit key length with any decent PRNG. So p = 0.24*10^-3 collision rate for 28 bit key and a hash table of 1<<16 entries with bucket size of 1. I am quoting this detail for anyone that wants to do some calculations to arrive at an approximate figure. Once the mean rate of collision is known poisson equation can be used for prediction.

how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.