Incremental Zobrist - slow?

vladstamate · Post by **vladstamate** » Sat Jun 20, 2009 6:00 am

Hi all,

I hit a bit of a wall regarding speed of Zobrist key/hash management. For a long time my program generated the entire hash for a given board from scratch (when needed, like for rep detection - every makeMove and for evaluation). I decided to improve that and take advantage of the Zobrist key property that you can just keep it up to date as you make move by zor-ing the required keys. It turns out I lose about 30% of speed doing that!!

I must be doing something wrong. The code to update the hash of the board is done in make move, and looks something like this (for a non-capture) for example:

Code: Select all

m_kHashKey ^= g_kPieceCodes&#91;m_iSideToMove&#93;&#91;kMove.m_piece&#93;&#91;fromSq&#93; ^ g_kPieceCodes&#91;m_iSideToMove&#93;&#91;kMove.m_piece&#93;&#91;toSq&#93;;

This turns out to generate a lot of code in assembler. I am compiling for a 64 bit platform. Also when using AMD's CodeAnalyst it does indeed point out that a lot of cycles (about 30%) of the makeMove function are spent in there.

Is there anything that I am missing? Some tricks that people do to generate those xors faster? Store the keys in a better way?

Thanks in advance,
Vlad.

wgarvin · Post by **wgarvin** » Sat Jun 20, 2009 7:09 am

vladstamate wrote:

Code: Select all

m_kHashKey ^= g_kPieceCodes&#91;m_iSideToMove&#93;&#91;kMove.m_piece&#93;&#91;fromSq&#93; ^ g_kPieceCodes&#91;m_iSideToMove&#93;&#91;kMove.m_piece&#93;&#91;toSq&#93;;

This probably means that your array has dimensions [2][6][64] or something like that.

To index into it at g_kPieceCodes[side][piece][square], the compiler has to calculate the address, something like:

Code: Select all

address  =  g_kPieceCodes + sizeof&#40;bitboard&#41; * &#40;side*6*64 + piece*64 + square&#41;;

Multiplying by sizeof(bitboard) is a shift left by 3, which can be done with the addressing mode on x86, so you can think of it as "free" here.

Multiplying by 64 is the same as shift left by 6, but multiplying by 6*64 is going to take several smaller math instructions (or an actual multiply instruction, which is kind of slow).

It would help if you used a pieceAndSide value that had [side] and [piece] combined together into one. For example, if the least-significant bit of each piece number represented the side. Then it would be more like this:

Code: Select all

address = g_kPieceCodes + sizeof&#40;bitboard&#41; * &#40;pieceAndSide*64 + square&#41;;

That is just a shift, an add and then the memory lookup. The expensive multiply should be gone.

If you don't want to change how you assigned the piece numbers, you could try just changing the array dimensions to [2*6][64] and the lookup code to something like

Code: Select all

m_kHashKey ^= g_kPieceCodes&#91;kMove.m_piece*2+m_iSideToMove&#93;&#91;fromSq&#93; ^ g_kPieceCodes&#91;kMove.m_piece*2+m_iSideToMove&#93;&#91;toSq&#93;;

The compiler will do something smart for the *2 (maybe add it to itself, or shift it left by one). Its optimizer will figure out that (kMove.m_piece*2+m_iSideToMove) only needs to be computed once, but if you want to be on the safe side you could do that computation into a local int variable first.

Note that there might be something else wrong here, because I don't see how that multiply could slow it down enough to be a 30% speed hit (unless maybe you're just testing perft or something)

[Edit: another thing.. make sure you're compiling with optimizations on! Most compilers produce pretty terrible code at -O0 or -O1. I assumed it was for x86-64, but if this is for Itanium or PPC or something like that, maybe the addressing mode stuff is not "for free" either. Also, optimizing compilers will mix together the instructions from lots of different statements to get faster code. So if you're looking at a disassembly of this function, make sure that the instructions you think are part of this xor calculation are really a part of it, and not something unrelated that the compiler happened to sneak in between.]

[Edit 2: what is kMove.piece ? If it is a bitfield, be aware that that is probably slow on every compiler ever made. Most compilers generate kind of crappy code for bitfields. Even if its a structure, some compilers will not handle it especially well.]

wgarvin · Post by **wgarvin** » Sat Jun 20, 2009 9:16 am

I looked up the latency of integer multiply on Core2 and its faster than I thought, only 3-4 cycles of latency. So even with the array arranged in the original order, I can't imagine what could be slowing this down so much.

Do you use lots of arrays or data tables for other things (like eval)? Perhaps the zobrist key entries are not able to stay in the L1 cache. But even if it has to get them from L2 all the time, that is also pretty damn fast on Core2 chips, taking 14 or 15 cycles (instead of 3 cycles for L1 hit).

If you can, post the assembly code? I'm really curious now.

vladstamate · Post by **vladstamate** » Sat Jun 20, 2009 6:56 pm

Here is the assembly obtained using a "release", hence optimized build for x64 arhitecture. I double checked the compile flags and all optimizations seem to be turned on. It seems to generate an awful lot of code for 2 xors and 2 memory address calculation (albeit yes, they do need multiplications).

m_kHashKey ^= g_kPieceCodes[m_iSideToMove][kMove.m_piece][fromSq] ^
g_kPieceCodes[m_iSideToMove][kMove.m_piece][toSq];
000000014000C85E lea r11,[140000000h]
000000014000C865 lea rdx,[rax+rcx*2]
000000014000C869 movzx ecx,byte ptr [rdi+1]
000000014000C86D movzx eax,byte ptr [rdi+3]
000000014000C871 lea ecx,[rax+rcx*8]
000000014000C874 mov rax,rsi
000000014000C877 shl rax,cl
000000014000C87A not rax
000000014000C87D and qword ptr [rbx+rdx*8],rax
000000014000C881 movzx eax,byte ptr [rbx+84h]
000000014000C888 lea rcx,[rax+rax*2]
000000014000C88C movzx eax,byte ptr [rdi]
000000014000C88F lea rdx,[rax+rcx*2]
000000014000C893 movzx ecx,byte ptr [rdi+2]
000000014000C897 movzx eax,byte ptr [rdi+4]
000000014000C89B lea ecx,[rax+rcx*8]
000000014000C89E mov rax,rsi
000000014000C8A1 shl rax,cl
000000014000C8A4 or qword ptr [rbx+rdx*8],rax
000000014000C8A8 movzx eax,byte ptr [rbx+84h]
000000014000C8AF lea rcx,[rax+rax*2]
000000014000C8B3 movzx eax,byte ptr [rdi]
000000014000C8B6 lea rdx,[rax+rcx*2]
000000014000C8BA movzx eax,r14b
000000014000C8BE movzx ecx,r13b
000000014000C8C2 shl rdx,6
000000014000C8C6 add rax,rdx
000000014000C8C9 add rcx,rdx
000000014000C8CC mov rax,qword ptr [r11+rax*8+0C4770h]
000000014000C8D4 xor rax,qword ptr [r11+rcx*8+0C4770h]
000000014000C8DC xor qword ptr [rbx+3080h],rax
000000014000C8E3 or r9d,0FFFFFFFFh

NOTE: yes, kMove is a structure. I will try to combine 2 dimensions of each array. And see if that will help. Thanks for you feedback.

Regards,
Vlad.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Jun 20, 2009 7:38 pm

vladstamate wrote:Here is the assembly obtained using a "release", hence optimized build for x64 arhitecture. I double checked the compile flags and all optimizations seem to be turned on. It seems to generate an awful lot of code for 2 xors and 2 memory address calculation (albeit yes, they do need multiplications).

//dump code snipped

NOTE: yes, kMove is a structure. I will try to combine 2 dimensions of each array. And see if that will help. Thanks for you feedback.

Regards,
Vlad.

You can not trust the source code assembly mapping in optimized code. The variable shifts by cl, complement, and setting r9d to -1 seem to have nothing to do with hashkeys, unless your compiler has serious problems maybe with bitfield structures. Probably there are some more source lines involved in the assembly. Otherwise there is no mul, but a bunch of leas, which is also not fast. As Wylie mentioned, better make your g_kPieceCodes dimension 16*64 or vice versa. What compiler is that?

Sven · Post by **Sven** » Sat Jun 20, 2009 7:56 pm

vladstamate wrote:For a long time my program generated the entire hash for a given board from scratch (when needed, like for rep detection - every makeMove and for evaluation). I decided to improve that and take advantage of the Zobrist key property that you can just keep it up to date as you make move by zor-ing the required keys. It turns out I lose about 30% of speed doing that!!

[...]
Code: Select all
m_kHashKey ^= g_kPieceCodes&#91;m_iSideToMove&#93;&#91;kMove.m_piece&#93;&#91;fromSq&#93; ^ g_kPieceCodes&#91;m_iSideToMove&#93;&#91;kMove.m_piece&#93;&#91;toSq&#93;;
[...] Also when using AMD's CodeAnalyst it does indeed point out that a lot of cycles (about 30%) of the makeMove function are spent in there.

If you still have the old "from scratch" code you could maintain a counter to find out how often that code was called for a given fixed search scenario, compared to how often the incremental code runs after switching to the incremental version again. Maybe the incremental code runs much more often? This might be one explanation.

You could extend your piece codes array to [2][8][64] and spend another 2k memory, which *might* make your array accesses slightly faster since the compiler can insert one more constant shift instead of a multiplication.

Making the move data structure (kMove) e.g. 2, 4 or 8 bytes large could help, too - just in case it isn't yet.

Also your hashkey update code will surely consist of much more than only that one line you showed, since you are handling several cases like ep, castle, promotion, or capture. And that code is probably full of branches. Maybe you can improve the logic there somehow?

Do you undo the same way, or do you save the old hashkey and simply fall back by decrementing a stack pointer?

Sven

vladstamate · Post by **vladstamate** » Sat Jun 20, 2009 7:57 pm

Hi Gerd,

Yes you are right. A closer look at the code (from bottom up) it looks like there are 2 64bit xors plus about another 10-12 instructions for setting up the rax,rdx offset registers. So indeed, not as many instructions I've listed there.

I will work on "compressing" the arrays by making them 1 dimensional rather than 3-dimensional.

I am using the Microsoft's compiler from Visual Studio 2008.

As a general question though, for chess engine programmers here: I am assuming the practice of incrementally maintaining the hash keys by xoring the correct zobrist keys on makeMove is popular? I mean, makeMove is a pretty "hot" point in my engine being called more times than anything else.

Regards,
Vlad.

bob · Post by **bob** » Sat Jun 20, 2009 8:49 pm

vladstamate wrote:Hi Gerd,

Yes you are right. A closer look at the code (from bottom up) it looks like there are 2 64bit xors plus about another 10-12 instructions for setting up the rax,rdx offset registers. So indeed, not as many instructions I've listed there.

I will work on "compressing" the arrays by making them 1 dimensional rather than 3-dimensional.

I am using the Microsoft's compiler from Visual Studio 2008.

As a general question though, for chess engine programmers here: I am assuming the practice of incrementally maintaining the hash keys by xoring the correct zobrist keys on makeMove is popular? I mean, makeMove is a pretty "hot" point in my engine being called more times than anything else. As a general rule, things used as subscripts ought to be reasonable close to a word size on the architecture in question to avoid the sign extension stuff.

Regards,
Vlad.

You could do the same thing again with no optimizations. That will give you a good idea of how the code for one instruction translates, without co-mingling asm from several C/C++ source lines together.

BTW if that "structure" is a group of bit fields, get rid of that for certain. You should do dyour own and/or/etc stuff. Whenever I see movsx/movzx type instructions, that's a warning that things are not as efficiently laid out as possible.

hgm · Post by **hgm** » Sat Jun 20, 2009 9:25 pm

vladstamate wrote:I must be doing something wrong. The code to update the hash of the board is done in make move, and looks something like this (for a non-capture) for example:
Code: Select all
m_kHashKey ^= g_kPieceCodes&#91;m_iSideToMove&#93;&#91;kMove.m_piece&#93;&#91;fromSq&#93; ^ g_kPieceCodes&#91;m_iSideToMove&#93;&#91;kMove.m_piece&#93;&#91;toSq&#93;;
This turns out to generate a lot of code in assembler. I am compiling for a 64 bit platform. Also when using AMD's CodeAnalyst it does indeed point out that a lot of cycles (about 30%) of the makeMove function are spent in there.

This should not be possible. The differential update just does three times what a full key calculation does 32 times, and you likely do call the full key calculation after every MaeMove(), to do the hash probe.

Are you sure the differential update produces exacty the same key as the full calculation?

vladstamate · Post by **vladstamate** » Sat Jun 20, 2009 10:25 pm

hgm wrote: This should not be possible. The differential update just does three times what a full key calculation does 32 times, and you likely do call the full key calculation after every MaeMove(), to do the hash probe.

Are you sure the differential update produces exacty the same key as the full calculation?

Indeed. A deeper inspection (using AMD CodeAnalyst with a capture for both cases) and some debugging turns out I might have a bug and that the keys produced by both methods are not matching. What that causes is a lot more evaluations (I am missing hash hits so I am evaluating instead of just retrieving the value) hence slowing down code a lot. I am still debugging.

Regards,
Vlad.

Incremental Zobrist - slow?

Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?

Re: Incremental Zobrist - slow?