Zobrist key random numbers

Michael Sherwin · Post by **Michael Sherwin** » Sat Jan 24, 2009 6:02 am

bob wrote:Has anyone done any sort of study (excluding the Wendroff/Warnock paper in the JICCA) on the random numbers you use for hashing?

I was looking for an assignment for a parallel programming course, and decided to use the random number generation of the Zobrist values as a relatively simple project. Crafty needs 788 numbers. 768 are used by the piece tables, 6 for each side, 64 squares per piece which is 12 * 64 or 768. I also have 16 EP target square random numbers that are used to alter the hash when an EP capture on a specific square is possible. Now up to 784. I then have 4 more for castling rights, since each side can castle to either side.

I first tested the basic numbers I get from the PRNG I use in Crafty, and discovered that the min hamming distance between any two numbers in that set was 14 bits which seemed low. I tested the numbers in two of the other programs I use for testing and found similar results. I then wrote a pretty brain-dead (but parallel) algorithm to start generating random numbers and replacing one of the original 788 if the new number increased the min hamming distance. I've had this running for about 2 hours and have raised the min hamming distance from 14 to 23. But at 23, the changes are becoming _very_ infrequent, and while each update improves the overall hamming distance average, pulling 23 up to 24 has not happened yet. I'm going to let this grind overnight on an 8-cpu machine to see if 24 is reachable.

My questions are:

(1) does anyone know how to compute the theoretical minimum hamming distance in a set of 788 64 bit numbers that can be produced? In my case, can this be raised to 24 or beyond? (I have the test set to stop when the PRNG starts to "cycle".

(2) has anyone tested their numbers and tried alternative generators to see if this can be improved on? I am currently using the existing Crafty PRNG, but am going to replace it with a 64 bit version from the internet when the current run goes long enough to convince me it is not going to improve things any further.

Once I get this done, I am going to release Crafty version 23.0, which will include these hard-coded numbers. The version number will change because this will break old opening books that used different RNs to produce the hash signatures stored in those books. Anyone can copy them if they want. But I am curious just how "good" these numbers can be...

In RomiChess the Zorbrist Keys are built by one random bit at a time. I was told that this was the worst possible way to do it and that I should use a GOOD random generator instead. I tried it and it made no difference at all. So, I went back to generating one random bit at a time.

Michael Sherwin · Post by **Michael Sherwin** » Sat Jan 24, 2009 6:27 am

wgarvin wrote:Maximizing Hamming distance between any 2 keys in the set of 788 (or however many) does not seem like a good way to optimize the Zobrist keys, because is not a great model for how the keys are actually used. All interesting board positions contain more than 2 of the keys from the entire set of 788 (one white king + one black king + several others). Since all of the used entries are XORed together to make the Zobrist key, it seems reasonable to take a group of related keys (e.g. the 64 keys for "white bishop on square X") and XOR together all subsets of up to N of those keys, where N is the maximum number of that piece that typically appears in a position. I was hasty in reading the thread, but I think this was more or less what hgm suggested already.

Now: Most positions will contain a small number of the 64 keys for each piece (for pawns, the position will contain 0 to 8 keys out of 48 keys for each color, and for pieces it will usually contain 0 to 2 keys out of 64 keys for each color). Maybe you could take advantage of that property to design keys that have less collisions on actual chess positions than randomly-generated keys do (though I doubt it will make much difference, in the end).

So I would suggest that you first try to design two sets of 48 zobrist keys for the pawns, where (following hgm's idea) you can take all subsets of N=6 or N=7 of those 48 keys, and XOR them together, and then measure the hamming distance between every one of those subsets. (N=6 seems like it would be >8 billion combinations... using N=4 might be more practical, and hopefully good enough). First try to maximize that, so that at least several bits are being flipped by each placement of a pawn. Keep the final list of subsets for the two sets of pawn keys (call them X and Y).

After making two sets of pawn keys, it should be relatively simple to create sets of 64 keys for each other piece, with the property that for all subsets of N=2 have decent hamming distances from each other (and from all of the values on the X and Y lists).

Hmm, another idea (that might be dumb): When measuring bitcounts or hamming differences or whatever... Maybe give a higher weight to the lowest 28 bits or so of the key ? Because the lower bits control the hardware address in the transposition table, while the higher bits are only used to disambiguate the inevitable collisions. You don't want situations where the positions from a search tree are competing for the available transposition table entries any more than necessary. I'm not sure if that would matter much, though. And ultimately I'll be surprised if any of this results in lower collision rates than you get from randomly-assigned Zobrist keys.

Initialize the Zorbrist Keys at the beginning of each search by searching 3 Ply deep so that there are no collisions or clumping and then just go from there.

bob · Post by **bob** » Sat Jan 24, 2009 7:26 am

hgm wrote:Of course!

I measured that first:

The memory access has a latency that is 2 clocks longer than for the aligned access. But it is not a 2-cycle pipeline stall; execution of other instructions continues as usual during these cycles. Even another memory load could be done in one of the two cycles. (Not in the other, of course, as there are really two loads needed , when the data is distributed over two words.)

So the cost is fairly small, in any case much smaller than the latency of an L1 cache miss that has to be satisfied from L2 (20 clocks on an AMD, or on an Intel when the replaced line is dirty). So it saves a lot of time if I make sure the Zobrist table will reside in L1 always, because it is small enough to fit there together with the other core data.

Remember that a Zobrist table on a board of 81 squares with 13 different piece types for each color can be rather big, and waste most of your L1 cache, if not already overflowing it on some machines. So it comes in very handy that this is possible.

That is a _long_ way from right. Unaligned memory accesses are _far_ worse than that. In your case, a significant number of them span two cache blocks in fact. But ignoring that, you can find a ton of tests that have been run that show that unaligned accesses are beyond bad. Some recent tests show a 2x slowdown when comparing the two...

I think you are considering a pure cache hit case, which is _not_ the real issue. But even for L1 cache, 2 cycles won't cut it... Not quite sure where the "2 cycle penalty" comes from...

Intel claims 2x-3x the number of microops for unaligned memory accesses, which, again, is not two clock cycles...

It is bad. really bad...

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Jan 24, 2009 10:04 am

Greg Strong wrote:
Gerd Isenberg wrote:One may try unaligned SSE2 or SSE3 load instruction and 15 (or 16 for padding) additional bytes.
Code: Select all
char Zobrist&#91;12*64+16&#93;;

__m128 zob;
unsigned long long hashkey = ...;
zob = _mm_loadu_si128&#40; (__m128*)&#40;Zobrist + 64*pieceType + squareNr&#41; ); 
zob = _mm_xor_si128&#40; zob, _mm_cvtsi64x_si128&#40;hashkey&#41; );
hashkey = _mm_cvtsi128_si64&#40; zob );
What's the advantage of doing this with SSE instructions?

Probably the unaligned load instruction via LDDQU:
http://software.intel.com/en-us/article ... hitecture/
http://forums.amd.com/devblog/blogpost. ... adid=94377

hgm · Post by **hgm** » Sat Jan 24, 2009 10:32 am

Well, this is what I measured. I can't take any responsibility for what you read or what Intel says.

This number seems to be highly CPU dependent anyway, so it would always be good to check if what you read in fact applies to your own model. I just repeated the test for my Celeron-M and guess what:

There isn't a mis-alignment penalty at all there!

If I put the instruction movl a+N(%edx),%edx in a loop and time it, (the cache-line aligned array a is pre-filled with all zeros, so that %edx will remain zero during the entire test), it always takes 3 clocks, wheter N=0, 7, or 30. Only when N = 61, 62 or 63 it takes 12 clocks.

So indeed it seems to be quite expensive if the misaligned access straddles a cache line, on Celeron M (9-cycle penalty). If I embed the instruction in an execute-bound loop (of 20 register increments) it increases the duration of that loop by 4 cycles (from 13 to 17), so it s not a pure 9-cycle stall. Anyway, a 9-cycle penalty (making the access latency 12 clocks) is still a lot faster than a 20-cycle L2 access.

For randomly scattered un-aligned accesses, 3 out of 64 accesses would incur this 9-cycle penalty. That would be an average penaty of less than half a clock cycle.

But as I am using a 0x88 board, the square numbering is in fact such that I can arrange it so that there would never be a cache-line straddling misalignment when fetching Zobrist keys

bob · Post by **bob** » Sat Jan 24, 2009 6:12 pm

hgm wrote:Well, this is what I measured. I can't take any responsibility for what you read or what Intel says.

This number seems to be highly CPU dependent anyway, so it would always be good to check if what you read in fact applies to your own model. I just repeated the test for my Celeron-M and guess what:

There isn't a mis-alignment penalty at all there!

All this shows is that your measurement is flawed. Simple case: A L1 cache line is 32 bytes long. You are going to try to do 32 fetches from that line, 7 of which will overlap into the _next_ cache line. So 7 out of every 32 fetches is going to result in _two_ L1 cache hit penalties. That is not "zero cost" (shock). More importantly, this data won't be in L1 for every access. So now you get one or two _much_ slower L2 cache hits to fill in the missing data for L1. _Again_ not "zero" cost (again, shock). And when one of those two blocks is not in L2, and since there are two blocks you have 2x the probability of this happening, now you get two long memory stalls.

I'd suggest you check your measurements... We have no "zero-cycle" L1 caches around. I7 is a 4-cycle _hit_ penalty, for reference. You just doubled that for 1/4 of your random data accesses...

If I put the instruction movl a+N(%edx),%edx in a loop and time it, (the cache-line aligned array a is pre-filled with all zeros, so that %edx will remain zero during the entire test), it always takes 3 clocks, wheter N=0, 7, or 30. Only when N = 61, 62 or 63 it takes 12 clocks.

No idea what you are measuring. First, L1 cache is 32 bytes wide per line. Second, you are reading unaligned dwords (32 bit values). The Zobrist numbers are 64 bit values which will turn the above into somethine 3.5x longer. Why you see this at 64 is unknown since the L1 cache is 32. L2 is 64 on most Intel processors (PIV was 128).

So indeed it seems to be quite expensive if the misaligned access straddles a cache line, on Celeron M (9-cycle penalty). If I embed the instruction in an execute-bound loop (of 20 register increments) it increases the duration of that loop by 4 cycles (from 13 to 17), so it s not a pure 9-cycle stall. Anyway, a 9-cycle penalty (making the access latency 12 clocks) is still a lot faster than a 20-cycle L2 access.

For randomly scattered un-aligned accesses, 3 out of 64 accesses would incur this 9-cycle penalty. That would be an average penaty of less than half a clock cycle.

Again, you are using the _wrong_ word size. These are 64 bit values. It is also unreasonable to assume these are all L1 cache hits. Your L1 might have 32k bytes of data, or 16K depending. You use these values once when you Make a move, then you do all the other work of generating moves, updating search state, and evaluating a position. It is unlikely that for any reasonable implementation, the Zobrist random numbers are going to remain in L1 for the next use...

But as I am using a 0x88 board, the square numbering is in fact such that I can arrange it so that there would never be a cache-line straddling misalignment when fetching Zobrist keys

bob · Post by **bob** » Sat Jan 24, 2009 6:14 pm

Gerd Isenberg wrote:
Greg Strong wrote:
Gerd Isenberg wrote:One may try unaligned SSE2 or SSE3 load instruction and 15 (or 16 for padding) additional bytes.
Code: Select all
char Zobrist&#91;12*64+16&#93;;

__m128 zob;
unsigned long long hashkey = ...;
zob = _mm_loadu_si128&#40; (__m128*)&#40;Zobrist + 64*pieceType + squareNr&#41; ); 
zob = _mm_xor_si128&#40; zob, _mm_cvtsi64x_si128&#40;hashkey&#41; );
hashkey = _mm_cvtsi128_si64&#40; zob );
What's the advantage of doing this with SSE instructions?
Probably the unaligned load instruction via LDDQU:
http://software.intel.com/en-us/article ... hitecture/
http://forums.amd.com/devblog/blogpost. ... adid=94377

The problem is, this just makes it _work_. It doesn't make it _efficient_ to do unaligned accesses...

hgm · Post by **hgm** » Sat Jan 24, 2009 7:08 pm

bob wrote:I'd suggest you check your measurements... We have no "zero-cycle" L1 caches around. I7 is a 4-cycle _hit_ penalty, for reference. You just doubled that for 1/4 of your random data accesses...

I suggest you would check my measurements, as it is obvious that

1) you are totally ignorant on this topic, not even knowing the length of a cache line
2) you don't properly read what I write, confusing a 0-cycle misalingnement penalty with a total latency
3) Your are too stuborn to believe what someone else writes anyway

Good luck!

bob · Post by **bob** » Sat Jan 24, 2009 11:53 pm

hgm wrote:
bob wrote:I'd suggest you check your measurements... We have no "zero-cycle" L1 caches around. I7 is a 4-cycle _hit_ penalty, for reference. You just doubled that for 1/4 of your random data accesses...
I suggest you would check my measurements, as it is obvious that

1) you are totally ignorant on this topic, not even knowing the length of a cache line
2) you don't properly read what I write, confusing a 0-cycle misalingnement penalty with a total latency
3) Your are too stuborn to believe what someone else writes anyway

Good luck!

OK, I am totally ignorant. What do _you_ think the length of a L1 cache line is? L2?

I _specifically_ said "improper alignment is bad" and I explained why. The biggest penalty comes because of the double cache line access, requiring _two_ cache hit cycles to transfer _one_ 8 byte value.

But go on in ignorance, or perhaps go to Intel's web site and read a bit.

I'll be waiting to hear your cache line length answers...

Be sure to give the CPU type since they vary even in Intel...

hgm · Post by **hgm** » Sun Jan 25, 2009 12:18 am

Everyone knows that that cache-line length has been 64 bytes for nearly a decade! (Ever since DDR memory was introduced, and the memory burst cycle was increased from 4 to 8 words.)

Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers