Zenmastur wrote:mar wrote:
Hamming distance is not a good measure of hash key quality (this has been shown to me by hgm some time ago).
Actually you can craft Zobrist hashes to maximize Hamming distance but such will perform very very badly.
About a year ago I went through the posts looking for hash collision info. I thought I did a reasonably good search, but I don't remember seeing HGM giving an expose on the subject. Can you point out the discussion?
bob wrote:I am not sure what there is to decide. You can't represent all chess positions with just 64 bits. So far the best has been around 160 bits or so. mapping 2^160 positions into 2^64 bits clearly will have literally gazillions of positions with a hamming distance of zero. Something like 2^96 such positions roughly...
Collisions (false matches with 64 bit signatures) absolutely happen. If an illegal move will crash your engine, you'd better check 'em or suffer the occasional crash since these are just like death, taxes, and such, there is no escaping them.
As I recall, a few years back, you initiated a rather long discussion on this topic. You were trying to produce keys with the maximum possible Hamming distance until someone (offline) demonstrated that this would produce horrible collision problems. IE. XOR'ing any two keys would produce a key already in the set. Unfortunately this conversation degenerated and ended prior to determining what the optimal hamming distance was. I remember thinking to myself that the optimal distance was (hash_key_size_in_bits minus log2(number_of_Zorbist_keys))/4 e.g. (for 784 zorbist keys and a 64-bit hash it would be (64-log2(784))/4 = 13.6). If this is true, then it would also seem that it's impossible to create a set of 64-bit Zorbist keys that can guarantee that no collisions will occur when two positions are separated by 4 or more plies. IE. 13.6 / 2^4 < 1.
It does seem possible, at least in theory, to be able to create a set of 64-bit Zorbist keys which could guarantee that no collisions would occur between 2 positions separated by 3-plies or less. IE. 13.6 / 2^3 > 1 . Although, I have no clue as to how to go about constructing such a set.
Do these sound like reasonable conclusions?
Yes. And I think it was HGM that led me to conclude the idea was bad. Burton Wendroff and Tony Warnock wrote a paper in the ICGA in the early 80's that addressed just this issue. And they discussed the problem in terms of "plies". And they had a scheme that would let you protect N consecutive from collisions, but the down-side was that after that, the probability of collisions went up sharply, perhaps exponentially.
On a somewhat related subject...
You did a collision rate test a few years back in which you showed that very high collision rates (on the order 10^-6) had a relatively minor if any effect on engine move selection. I tried a few months ago to calculate what the collision rate for a given TT size and Zorbist key size but I couldn't find anyway to do it. A guess would be 1 / (2^(key_size_in_bits minus log2(#_of_table_entries)). This would work if all table entries were IID. There not, of course, and you can't have more collisions than table hits. Since the number of table hits in the middle game run between 15% and 20% this would seem to put a upper limit on possible collisions rates during this part of the game. So it's likely the only way to find out is to run some tests. The piece of information that would be nice to know is: With a 64-bit key size what is the number of table entries that will cause a measurable decline in engine ELO due to excess collisions.
You CAN have more collisions than hits. Because EVERY node in the tree results in a TT probe. And a collision would produce a false match. It's a two-step process. Compute the address from the signature (which has a huge number of different signatures that map to the same table address) and then compare the signature to that which is stored in the table. True collisions could drive the hit rate up, or not, since i (and probably others) only count a position as a hit if the signature matches AND the draft is useful. There are far more "hits" where the draft is too small. It is hard to figure out how to even measure hits in such an environment. If you end up using it, it would be a hit in my terminology, which means collisions would certainly fall under "hits" since only the collisions that mean something would have the ability to change anything.
But the big problem today is speed. On the 20 core box I use a lot, 100M nodes per second is typical. That produces a billion nodes every 10 seconds, which allows PLENTY of probes to encounter a collision. 20 years ago, a 64 bit signature might never produce a collision. Today it is not uncommon. I ran a test yesterday on a single position (fine 70) and after 5-6 minutes Crafty reported a single "bad move from hash table" error which signifies a collision. But that is a lower bound on that position because there are almost certainly more collisions where the hash move just happened to be legal.
64 bits is a lot of "protection" until you begin to see searches where the total nodes can't be counted in a 32 bit counter, I see 15-25 billion node searches on ICC all the time...
time=3:23(99%) nodes=18300692792(18.3B) fh1=91% pred=11 nps=90.0M
time=3:06(99%) nodes=16568754647(16.6B) fh1=91% pred=11 nps=89.0M
time=4:57(99%) nodes=27050978878(27.1B) fh1=91% pred=14 nps=90.8M
time=3:46(99%) nodes=19049838535(19.0B) fh1=89% pred=11 nps=84.3M
Those were from the last ACCA tournament. Those kinds of numbers tend to drive up collisions dramatically compared to 20 years ago when 60K nodes per second was very fast.
If it takes a collision rate > ~10^-6 to cause a noticable change in engine ELO and 20 excess bit in the key vice the number of table entries will produce an error rate of about 10^-6 then it would appear that table sizes as large as 16 trillion entries (2^44 entries) would be possible.
I know what a lot of you are thinking, you're thinking that when the number of table entries exceeds the square root of the possible number of zorbist keys collisions become almost gauranteed. While this may be true, It doesn't tell you how many collisions you will have, just that the probability of having at least one becomes very large. In a table of 2^44 entries you have to have greater than 2^24 probes that return invalid data due to a false positive per 2^44 probes to have an error rate of ~10^-6.
In any case, I don't think this can be tested on a full 64-bit key because of the amount of TT memory needed and the amount of time required to generate the required numbe of cache probes. i.e. a machine probing at 100,000,000 nodes per second would require over 2.5 months to completely fill the table.
Thoughts?
Regards,
Zen
It is an issue. But from the paper Cozzie and I wrote, I don't worry about collisions causing problems with game play, but in my case, I have to be ultra-careful about illegal moves that can corrupt the board since Crafty doesn't do copy/make. If I make a castle move when it is not legal (king or rook has moved, I am doomed because there is now at least one extra piece on the board, I might have lost an opponent's piece, and there is no way to recover, leaving the board corrupted which will eventually lose the game. Or create two kings when that side can now never be checkmated.
So I screen moves to the best of my ability (within computational time constraints) and don't give a second thought to the occasional incorrect evaluation.