how to measure frequency of hash collisions.

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: how to measure frequency of hash collisions.

Post by Don »

Don wrote:
Daniel Shawul wrote:
Don wrote: You can estimate the collision rate by using N bits are for checking. So if your key is 64 bits, pretend it's only 60 bits and 4 bits are for collision testing. If the 4 bits do not match it was a collision. You can extrapolate to get the 64 bit collision rate estimate - each time you add a bit you can expect half the number of collisions.
Don
But that won't work because in a hash collision, the hash signature (all 64 bits) are the same for two completely different positions... You need a key from another sequence of random numbers (be it from the same or different hash function). Am I missing something ?
Yes, what you are missing is that you are not testing the 64 bit key. You are testing a 60 bit key. In my example just pretend that you are generating a 60 bit key and then a totally independent 4 bit key. Let's say for example that the you get a collision on the 60 bit key once out of 1 million matches (as judged by the 4 bit verify key.) You can expect that had this been a 64 bit key instead of 60 bits you will get only 1/16 as many collisions since 2^4 is 16. Each extra bit cuts the number of expected collisions in half.

This is superior to any of the suggested methods proposed because it does not require modification of the program in order to add more key space, you just borrow a few bits from the already existing key. So this could be added to the program with no additional overhead. In fact the program could still continue to use the full 64 bit key while doing the 60 bit key collision detection test for statistical purposes and it could even be put in the log file of the program. If you use 4 bits for collision detection then you count how many of these 60 bit keys collided and divide by 16 to get an ESTIMATE of how often a 64 bit key would have collided.
P.S. I would actually favor more than 4 bits - assuming that 60 bits does not generate very many collisions. That is because if you have to run the program for 10 minutes just to get 2 or 3 collisions you have too much noise to get a very accurate estimate. So I would experiment with the number of bits to borrow from the 64 bit key to make this produce perhaps a few dozen collisions in a minute or two. It could be a parameter. I actually don't have any clue about how many collisions a program like Komodo will produce per hour of run-time.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: how to measure frequency of hash collisions.

Post by Daniel Shawul »

hgm wrote:
Daniel Shawul wrote:Is there a way to directly measure the number of hash collisions (not its effect).
I was just asking myself the same question. I wrote a program to generate random positions (assign random squares and piece types to a number of piees, taking care of that no square is used twice), and compared the number of collisions I got with truly random Zobrist keys, and keys derived with your pieceKey+squareKey idea. I checked with a second key, but that wasn't really needed, because all primary key matches were indeed collisions, as the chances to randomly generate the same position twice are about zero. I could not detect significant difference in collision frequency between the random key and the composite key.
Exactly the chances that is not a collision is almost zero. BTW addition is sounding less and less of a brilliant idea as OR and even AND also perform well. If the two keys (piece & square) are generated with a good random number generator, it seems that anything goes. The keys are long enough (64 bits) to hide the deficiency of the operator.
I don't think this is a decisive test, though. What is important is how many collisions you get when you have a set of very similar positions, as occur in a search tree.
Indeed what I do now is pure brute force. It has no insight ato try two similar positions so a search tree would be more appropriate. For example the OR and AND that showed no collisions in the brute force test would probably show some collisions when tested with in a tree. They change the frequency of bits to 75% and 25% respectively, but still no collisions. I always doubted the excessive attention payed to the PRNG used to reduce hash collisions. For example you share a lot of bits for keys of different squares as you mentioned previously but that didn't introduce many collisions. A simple rotate by 1 changes every thing significantly and zobrist XOR compares only adjacent bits.
A method I used in the past to test the quality of a set of Zobrist keys is is to count the number of 'dependency loops': subsets of N different keys that XOR to zero. For a given key length and a much larger total number of keys there is a certain N beyond which such cycles cannot be avoided. The quality of the set of keys is determined by whether you already have cycles of lower N (i.e. an unnecessary dependency between a small number of keys).

E.g. with 768 keys you have 768*767*766/6 = 75M different 'products' (XORs, actually) of 3 keys, which conceivably could all be different if the key length is 27 bit (75M < 2^27). This means that you can have key sets of 27 bit where no product of 6 keys can be zero. You have 14G different products of 4 keys. So with 32-bit keys there must be duplicats, meaning there must be products of 8 keys that are zero. Keys of realistic length (such as 64 bit) are hard to test, because collisions are extremely rare. So it is best to test model systems of short key length, where collisions are frequent, and you can efficiently detect them by storing the complete key space in a table (so you can easily detect how often a product of N keys is repeated). Cycles of odd length (say 7) can be detected by putting all products of 3 keys in a hash table, and then checking all products of 4 keys against that table, to see if they occur there.

Note that is not yet the full story: in practice a vary bad dependency (like 3 keys XORing to 0) could be completely hidden by putting the bad keys in the table for a piece type that only occurs once (like King).
I will have to look to this carefully before I comment.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: how to measure frequency of hash collisions.

Post by Daniel Shawul »

I still don't see how testing only the lower 60 bits changes anything. The problem is you generated the hashkey 64 bits using the _same_ hash function. You need to use another hahs function to generate those 4 bit keys, otherwise those 4 bits are *as likely to match* as the 60 bits.

Eg. Take the wrong hash function that uses XOR for [piece] and [square] as is used for zobrist. Now positions like Pe4^Bd4 == Pd4^Be4 are collisions. All the bits match , how do you tell the keys apart when everything matches. You have to use a different hash function (probably closer to something that gives 0 collisions) to get a secondary hash key. The key could be very short but still it should not be generated using the same hash function.

BTW if you got this idea from the method used to resolve hash collisions for parallel search, then it is a little bit different from the one we are talkign about here...

Edit I edited my post to stress a different hash function should be used. Different sequence alone will not suffice in all cases.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: how to measure frequency of hash collisions.

Post by Don »

Daniel Shawul wrote:I still don't see how testing only the lower 60 bits changes anything. The problem is you generated the hashkey 64 bits using the _same_ hash function. You need to use another hahs function to generate those 4 bit keys, otherwise those 4 bits are *as likely to match* as the 60 bits.
No, you miss the point. With zobrist hashing using xor you really have 64 1 bit keys, not 1 64 bit key. The random number tables are not 64 bit random numbers, they each are arrays of 64 random bits. There is no interaction across bits. So the 60 bit key and the 4 bit key can be thought of as totally separate.

Now if there is a serious problem with the quality of the random bits, then you cannot fix it by generating separate keys. That does not address the problem. The issue is not the independence of the keys, it's the quality of the verification bits.

So if you have an issue with this, then generate the 60 bits, then generated the 4 bits separately and combine them in a single 64 bit random number for each entry. It's exactly what you think will "fix" the problem. Semantically there is no difference between combining them or storing them in a separate table.
But the 4 bits have to be high quality (generating them independently doesn't "fix" anything.) It doesn't matter if the 60 bits are crap because the test will still work if the 4 bits are good and it will just help you see that the crappy 60 bit keys are generating a lot of collisions.

Honestly, when I saw the problem posed I was suggesting a good way to accurately estimate the hash collision rate, none of the suggestions posted on this thread can judge the quality of the Zobrist hashing because they all require separate verification bits and the method falls apart no matter what unless those verification bit's are properly generated. If you can do that then you might as well generate all the random bits that way.

Eg. Take the wrong hash function that uses XOR for [piece] and [square] as is used for zobrist. Now positions like Pe4^Bd4 == Pd4^Be4 are collisions. All the bits match , how do you tell the keys apart when everything matches. You have to use a different hash function (probably closer to something that gives 0 collisions) to get a secondary hash key. The key could be very short but still it should not be generated using the same hash function.

BTW if you got this idea from the method used to resolve hash collisions for parallel search, then it is a little bit different from the one we are talkign about here...

Edit I edited my post to stress a different hash function should be used. Different sequence alone will not suffice in all cases.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

Re: how to measure frequency of hash collisions.

Post by kbhearn »

I think what he's getting at is that by generating the extra bits in a different way you make it unlikely that both types of bits will have the same type of weakness (i.e. they both may generate more collisions than you want, but it's unlikely they'll collide under the same sort of changes).

Also you may not want to generate all your bits using the second generator because your second generator might be slow - i.e. you could feed your entire set of bitboards or your mailbox into something like murmur for a verification that doesn't depend on whatever you're using to generate zobrist keys.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: how to measure frequency of hash collisions.

Post by Daniel Shawul »

No, you miss the point. With zobrist hashing using xor you really have 64 1 bit keys, not 1 64 bit key. The random number tables are not 64 bit random numbers, they each are arrays of 64 random bits. There is no interaction across bits. So the 60 bit key and the 4 bit key can be thought of as totally separate.
I didn't miss a point. Like I said the 4 bit keys have to be generated from a different hash function. Now after I explained it to you ,you start start changing your words. You were saying taking the lower 60 bits of the 64 key will suffice...
So if you have an issue with this, then generate the 60 bits, then generated the 4 bits separately and combine them in a single 64 bit random number for each entry. It's exactly what you think will "fix" the problem. Semantically there is no difference between combining them or storing them in a separate table.
It is not an "me having an issue", it is a _must_ that you generate them using different hash functions. Storing them in a separate table is the same thing so why you want to make that an issue is beyound me. I suggest you read your first post again and check if you have said anything about generating them with different hash function. Look at Ed's post point number (3) to see how you should sayed it. it is clear I agree with him.
Honestly, when I saw the problem posed I was suggesting a good way to accurately estimate the hash collision rate, none of the suggestions posted on this thread can judge the quality of the Zobrist hashing because they all require separate verification bits and the method falls apart no matter what unless those verification bit's are properly generated. If you can do that then you might as well generate all the random bits that way.
Man the way you change your words borders to being a lair... Read your first post again. You made no mention of separately the 60 bit and 4 bit key to be generated separately !!
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: how to measure frequency of hash collisions.

Post by Don »

kbhearn wrote:I think what he's getting at is that by generating the extra bits in a different way you make it unlikely that both types of bits will have the same type of weakness (i.e. they both may generate more collisions than you want, but it's unlikely they'll collide under the same sort of changes).
The independent hashes do not correct each other - this is not a case of combining two RNGs to get a single improved one. If the 4 bit generator sucks then having a good 60 bit generator does not make the frequency test "more accurate" because it's only the 4 bit quality that has any impact on how good the measurement is.

Whether the 60 bit numbers are good or bad has no impact on the quality of the test. The 4 bit numbers do.

If it makes you feel secure, then just use my idea but generate the 4 bits differently and combine them into a single 64 bit number with 60 bit generator.

Also you may not want to generate all your bits using the second generator because your second generator might be slow - i.e. you could feed your entire set of bitboards or your mailbox into something like murmur for a verification that doesn't depend on whatever you're using to generate zobrist keys.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: how to measure frequency of hash collisions.

Post by Don »

Daniel Shawul wrote:
No, you miss the point. With zobrist hashing using xor you really have 64 1 bit keys, not 1 64 bit key. The random number tables are not 64 bit random numbers, they each are arrays of 64 random bits. There is no interaction across bits. So the 60 bit key and the 4 bit key can be thought of as totally separate.
I didn't miss a point. Like I said the 4 bit keys have to be generated from a different hash function. Now after I explained it to you ,you start start changing your words. You were saying taking the lower 60 bits of the 64 key will suffice...
So if you have an issue with this, then generate the 60 bits, then generated the 4 bits separately and combine them in a single 64 bit random number for each entry. It's exactly what you think will "fix" the problem. Semantically there is no difference between combining them or storing them in a separate table.
It is not an "me having an issue", it is a _must_ that you generate them using different hash functions. Storing them in a separate table is the same thing so why you want to make that an issue is beyound me. I suggest you read your first post again and check if you have said anything about generating them with different hash function. Look at Ed's post point number (3) to see how you should sayed it. it is clear I agree with him.
Honestly, when I saw the problem posed I was suggesting a good way to accurately estimate the hash collision rate, none of the suggestions posted on this thread can judge the quality of the Zobrist hashing because they all require separate verification bits and the method falls apart no matter what unless those verification bit's are properly generated. If you can do that then you might as well generate all the random bits that way.
Man the way you change your words borders to being a lair... Read your first post again. You made no mention of separately the 60 bit and 4 bit key to be generated separately !!
Can someone please arbitrate this for us? Not a moderator but just someone who understand what I am saying, what he is saying and can figure out who is misunderstanding what? If I am wrong I will admit it. If I lied I didn't know it but if I can see it I will apologize.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: how to measure frequency of hash collisions.

Post by Daniel Shawul »

I think what he's getting at is that by generating the extra bits in a different way you make it unlikely that both types of bits will have the same type of weakness (i.e. they both may generate more collisions than you want, but it's unlikely they'll collide under the same sort of changes).

Also you may not want to generate all your bits using the second generator because your second generator might be slow - i.e. you could feed your entire set of bitboards or your mailbox into something like murmur for a verification that doesn't depend on whatever you're using to generate zobrist keys.
Thank you. Maybe it helps when another person says it hopefully! Infact i was planning to store the fen (or use a string hash of it) since that has absolutely nothing whatsover with the zobrist hashing scheme. This is the key since two miracles don't happen. The fact that you get a collision with 64 bit keys is very rare and that another hash function collides at the same position is almost zero. Otherwise the example I gave before still produces the same hash key even if a different set of keys is used. I am inclined to believe using rand() from microsoft and linux (which use different constants) as primary and secondary keys would generate the same hash key for the wrong hashing example I gave (one with XOR of [piece] and [square]),since both use the same form of LCG (algorithm wise).
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: how to measure frequency of hash collisions.

Post by Don »

Daniel Shawul wrote:
Don wrote: You can estimate the collision rate by using N bits are for checking. So if your key is 64 bits, pretend it's only 60 bits and 4 bits are for collision testing. If the 4 bits do not match it was a collision. You can extrapolate to get the 64 bit collision rate estimate - each time you add a bit you can expect half the number of collisions.
Don
But that won't work because in a hash collision, the hash signature (all 64 bits) are the same for two completely different positions... You need a key from another sequence of random numbers (be it from the same or different hash function). Am I missing something ?
Daniel,

I'm going to do a reset here and answer this post again. I think I understand what caused the misunderstanding. I actually am pretty sure we both misunderstood each other.

My idea is probably a lot simpler than you assumed it to be. Essentially you have 64 hash signatures and 64 random number tables. The signatures are all 1 bit and the random number tables are each 1 bit tables. Due to the magic of 64 bit processors they can all be computed in parallel but that is just a detail.

You can check each of these 64 signatures against each other. If the least significant bit is the same, it might be a match or it may be a collision. You can use the second bit to test that it's a collision with a zero percent false positive but 50% false negative. In other words if the second bit does not match then the first bit is a collision for sure. The more bits you use to test, the more accurate your assessment of whether it's a collision.

You could take 60 of those hash signatures and if they all agree with the 60 signatures of the position being tested against, they probably represent the same position. But you can use 4 more of these independent signatures to see if they agree too. If they don't, you have detected a collision with 100% certainty. By using 4 of these signatures you will detect valid collisions 15 out of 16 times. But once in a while one will slip through undetected. If you want better then you must use more bits.

You did a total context switch which confused me because I was talking about this and you started talking about using separate hash functions. I may have been calling my 64 hash signatures as 64 different hash functions, but the function is really the same XOR function operating on 64 different data so this was probably confusing to you. I was careful this time to call them "signatures"

Does it make sense to you now?
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.