how to measure frequency of hash collisions.

syzygy · Post by **syzygy** » Sun Jun 17, 2012 11:06 pm

Daniel Shawul wrote:Ok that will be 2^64 with a bucket size of 1 for a 64 bits hashkey (no change).

Only the bits that are not part of the index into the hash table count. If the hash table has say 2^24 entries, only 40 bits of the hash key are available to distinguish between nodes mapped to the same entry. Collission rate is 1 in every 2^40 nodes. If the hash table has 2^28 entries, collission rate will be 1 in every 2^36 nodes. (Of course if the index comes from a different hash key, the full 64 bits count.)

My doubt is with the replacement scheme.

The replacement scheme has no effect on the collision rate (if we ignore the rate of hash hits). With a full table and a bucket size of 1, for each node searched there is only a collision if the node currently stored in its bucket happens to have the same hashkey, which is determined by the distinguishing bits (40 bits or 36 bits in the examples above).

In the previous birthday formula, we assume that each key is compared against all available keys which is what brings up the probablity higher. But for this case I don't know how to quantitiy how many useful comparisons that lead to collision will be done. With a hashtable many positions map to the same entry (assume butcket_size=1). The entry is home to many hash signatures that may actually be real collisions or not in most cases. To detect collision the stored key should happen to be exactly same as the one we are going to store. Say A,B,C map to the same entry but only A == C. First we store C, then B, then A. Thus we will not detect collision. So don't you think the replacement scheme will have an effect ?

Of course the replacement scheme will affect individual collisions (if B does not overwrite C, there will be a collision between A and C, if B does overwrite C, there is no collision), but it will not affect the overall collision rate. The probability that A == C is the same as the probability that B == C. The point is that each time you compare to only one key. With a very small hash table (and the full 64 bits stored, so that the number of bits that "count" is high), collisions will be extremely rare even if the total tree searched has many nodes with identical hash keys.

Of course everything under the assumption that the Zobrist keys are independently and uniformly distributed.

syzygy · Post by **syzygy** » Sun Jun 17, 2012 11:14 pm

Daniel Shawul wrote:I tested both and it seems I get 240 collisions/million on average for 28-bit key length with any decent PRNG. So p = 0.24*10^-3 collision rate for 28 bit key and a hash table of 1<<16 entries with bucket size of 1. I am quoting this detail for anyone that wants to do some calculations to arrive at an approximate figure. Once the mean rate of collision is known poisson equation can be used for prediction.

If the total key has 28 bits, and you use 16 for the index into the hash table, there are 12 bits left to distinguish between nodes mapped to the same entry. 2^12 = 4096, so the expected number of collisions per million of nodes is 10^6 / 4096 = 244.

Your slightly lower number might be explained by actual hash hits among the 10^6 probes, which by definition cannot lead to a collision.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 17, 2012 11:57 pm

Only the bits that are not part of the index into the hash table count. If the hash table has say 2^24 entries, only 40 bits of the hash key are available to distinguish between nodes mapped to the same entry. Collission rate is 1 in every 2^40 nodes. If the hash table has 2^28 entries, collission rate will be 1 in every 2^36 nodes. (Of course if the index comes from a different hash key, the full 64 bits count.)

Thanks! That looks about right. I couldn't work it out from the example remi gave. It seems the use of the word "hash codes" is a bit confusing. The first formula below says hash codes
is total size of entries but later the number of bits of TT is added to it implying otherwise.

Code: Select all

pc = &#40;number of entries&#41; / &#40;number of hash codes&#41;
Each of them containing 32 bits of hash codes, with a rehash of 2. 
I had pc = 2 * 2^18 / 2^&#40;18 + 32&#41; hence 1/pc = 2^31

Of course the replacement scheme will affect individual collisions (if B does not overwrite C, there will be a collision between A and C, if B does overwrite C, there is no collision), but it will not affect the overall collision rate. The probability that A == C is the same as the probability that B == C. The point is that each time you compare to only one key. With a very small hash table (and the full 64 bits stored, so that the number of bits that "count" is high), collisions will be extremely rare even if the total tree searched has many nodes with identical hash keys.

Indeed you are right. I tested by changing always replace strategy to never replace (i.e only one time store) and I still get the same 240 collisions / mil.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 18, 2012 12:01 am

syzygy wrote:
Daniel Shawul wrote:I tested both and it seems I get 240 collisions/million on average for 28-bit key length with any decent PRNG. So p = 0.24*10^-3 collision rate for 28 bit key and a hash table of 1<<16 entries with bucket size of 1. I am quoting this detail for anyone that wants to do some calculations to arrive at an approximate figure. Once the mean rate of collision is known poisson equation can be used for prediction.
If the total key has 28 bits, and you use 16 for the index into the hash table, there are 12 bits left to distinguish between nodes mapped to the same entry. 2^12 = 4096, so the expected number of collisions per million of nodes is 10^6 / 4096 = 244.

Your slightly lower number might be explained by actual hash hits among the 10^6 probes, which by definition cannot lead to a collision.

Great! I skipped over the assumption of a full hash table but that will probably not have a significant effect since it that is much smaller than number of trials.

jshriver · Post by **jshriver** » Wed Jun 20, 2012 10:45 pm

When I was testing mine out I just feed it a couple gigs of FEN positions and outputed the hash key to a text file.

Then did a sort output.txt then a unique output.txt > clean.txt

wc -l output.txt to get total count of entries
wc -l clean.txt to see how many are left. Subtract the two and you could see how many entries collided.

My first couple attempts had 8-10% collisions, even with 64bit keys.

syzygy · Post by **syzygy** » Thu Jun 21, 2012 12:44 am

jshriver wrote:My first couple attempts had 8-10% collisions, even with 64bit keys.

For how many positions in total?

jshriver · Post by **jshriver** » Thu Jun 21, 2012 3:51 am

syzygy wrote:
jshriver wrote:My first couple attempts had 8-10% collisions, even with 64bit keys.
For how many positions in total?

Forget, its' been a while but it was in the trillions.

A lot of my interest and work goes into retaining chess knowledge and I work with over 100gigs of PGN data, plus maybe another 4-5gigs in home crafted data. So this has been an interesting obstacle for me.

Need the data to be small for storage and lookups, but when working with large volumes traditional hash algorithms break down.

When I tried with smaller amounts say less than a gig I had no problems. Which is ok for opening books and such though, or transposition tables which are even smaller.

how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.

Re: how to measure frequency of hash collisions.