Incremental updating for positional evaluation

sje · Post by **sje** » Thu Mar 27, 2008 11:28 pm

Many programs do not perform a full positional evaluation at nodes where the current material score and the A/B window suggest that the calculation probably won't make a difference. That's a big time saver although it can make mistakes.

One idea is to have incremental updating for positional evaluation after each move is made or unmade, just as is done with incremental bitboard database updating. This could save a large amount of time for complex positional evaluation code.

Consider: Say that the bulk of the positional evaluation code consists of scanning occupied squares and then summing the values produced for each piece of a given color on a square. For each piece/square evaluation, have the program generate and store a bitboard of all those squares that must retain their occupation status over a move for the generated evaluation value to remain valid. On the next call to the evaluator, the code can quickly check if a move affects the subscore by checking the from and to squares against the stored bitboard for the piece. If there's no change, then the previously computed value for that piece can be simply retrieved instead of recalculated.

I'd guess that for most pieces, a good start on establishing the effect bitboard would be to or together the AttackFrom and AttackTo bitboards for the square and then add in the square itself.

Now the above scheme won't be worthwhile if the positional evaluation code is very simple. But it seems to me that the bitboard operations, assuming that the AttackFrom and AttackTo database is available, will be fast enough to pay off in the long run.

Harald Johnsen · Post by **Harald Johnsen** » Fri Mar 28, 2008 9:50 am

sje wrote:...
For each piece/square evaluation, have the program generate and store a bitboard of all those squares that must retain their occupation status over a move for the generated evaluation value to remain valid. On the next call to the evaluator, the code can quickly check if a move affects the subscore by checking the from and to squares against the stored bitboard for the piece. If there's no change, then the previously computed value for that piece can be simply retrieved instead of recalculated.

So one question is if the saving is worth it. For each move played you need to copy a few bitmaps (at least one per piece) and do one or more bitwise computations. It reminds me of the incremental updating of attack table.
Except that the position of one piece has more influence than just a delta in attacks. You still need to re-compute term that depend on pieces placement (king safety, passer pawns, etc).
I have the impression that too much depend on piece placement. Of course it should not be hard to compute some statistics.

HJ.

Guetti · Post by **Guetti** » Fri Mar 28, 2008 10:34 am

Currently with my engine I get the following results for the starting position:
Evals: 15181272
LazyEvals: 7771564 51.19% of Evals

50% of the evals are Lazy eval cutoffs, and only consider material and some essential evaluation terms that might win or lose the game. I guess in tactical position this percentage even increases to 80%. That means, in 80% of the positions you would incrementally update positional stuff that you are not gonna use. Wouldn't that make the incremental updates inefficient?
I have to mention however, I do some incremental updating of Piece square values because it is easy to do.

hgm · Post by **hgm** » Fri Mar 28, 2008 11:56 am

If you want to go this road, why not update the evaluation terms differentially all the way, rather than only check if they did not change? Normally you know which evaluation terms could depend on the position of the moved piece. (E.g. if I move a Pawn one step forward, it could create a passer bonus for enemy Pawns left and right of its fromSquare, and a backward-pawn penalty for own Pawns there.) So you only would have to compute the effect of the move on those terms.

If your evaluation has very many terms, which kick in only rarely, you could keep a list of terms that actually contribute. E.g. if you give points for soft-pins on the Queen, each occurrence of an enemy slider with your Queen, with exactly one of your pieces in between, could be stored in a list, hooked up to the 3 involved pieces. If any of the involved pieces moved, you would immediately know it contributed this term, subtract it from the differential evaluation score, and unhook it from the lists.

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Mar 28, 2008 12:50 pm

My feeling is that minimal update and computation on the fly when needed is faster in the long run - if we use a separate hash-tables for evaluation and/or store eval-scores in the main-hashtable as well - and/or perform lazy eval stuff.

Otoh if you probe huge tables everywhere, you may do a lot of fancy stuff (incrementally or on the fly) to hide the huge latency - after the hashkey is already updated by move, followed by a immediatly prefetch of the hash-entry - and before the probe actually occurs. Using prefetch or not makes a difference of about 200KN/second in my program on my "old" AMD-K8 machine.

Gerd

hgm · Post by **hgm** » Fri Mar 28, 2008 2:35 pm

This is interesting. I never could get prefetch to work for search. How do you do it? After I generate moves and picked the first one to search, (the earliest time I would know what to prefetch), I basically need the probe result immediately. Even before MakeMove if it will be a cutoff, and immediately after MakeMove to try the hash move from the daughter node. And MakeMove does not take enough time to hide anything under.

In perft I only got a benificial effect when I prefetched a few moves ahead (2 or 3) in pre-horizon nodes. But in search such prefetching can easily backfire, as you might not want to search the move you started to prefetch. And then your subsequent accesses would have to wait for that unnecessary prefetch to complete.

Do you only do the prefetch when you are reasonably sure that you are in an alpha node? (E.g. for the late moves.)

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Mar 28, 2008 6:40 pm

hgm wrote:This is interesting. I never could get prefetch to work for search. How do you do it? After I generate moves and picked the first one to search, (the earliest time I would know what to prefetch), I basically need the probe result immediately. Even before MakeMove if it will be a cutoff, and immediately after MakeMove to try the hash move from the daughter node. And MakeMove does not take enough time to hide anything under.

In perft I only got a benificial effect when I prefetched a few moves ahead (2 or 3) in pre-horizon nodes. But in search such prefetching can easily backfire, as you might not want to search the move you started to prefetch. And then your subsequent accesses would have to wait for that unnecessary prefetch to complete.

Do you only do the prefetch when you are reasonably sure that you are in an alpha node? (E.g. for the late moves.)

I use mm_prefetch(hash_address) at the beginning of each node.
After I checked for repetitions, I call a huge routine to do a lot of SSE2 fill stuff (C++ not assembly), reading one cacheline and writing a few cachelines on an external stack[ply]. One or two internal stack cachelines are used as well inside that routine (mostly for saving/restoring caller safe xmm6...xmm13 and some gp-registers plus return address of course), plus one cacheline shared by all nodes of that "search-object". Near the end of this routine, I use four times movdqa to read one cacheline or four 16-byte slots from the prefetched hash_address and store the key-matching one (if any) to a lokal address inside the "search-object" as well.

Before I further interprete the key-matching entry, I already have all legal moves generated, stored inside 16 target bitboards for each direction (including knights), to decide about check- and single reply extensions. Plus some other stuff (attack-sets, pinned pieces, potential discovered checkers etc..), I use in eval.

Of course, if I get a sufficient hash-hit with an early return, all work done is thrown away. Otoh that alu-dominant stuff with some rare likely L1 accesses seems appropriate to hide the hash-latency.

I also had no success with prefetch in my old 32-bit program (on the same hardware K8 x2 btw), but that was designed different and probably had to many other memory references (but less instructions) between the prefetch and the actual probe.

wgarvin · Post by **wgarvin** » Fri Mar 28, 2008 8:35 pm

Gerd Isenberg wrote:I use mm_prefetch(hash_address) at the beginning of each node.
After I checked for repetitions, I call a huge routine to do a lot of SSE2 fill stuff (C++ not assembly), reading one cacheline and writing a few cachelines on an external stack[ply]. One or two internal stack cachelines are used as well inside that routine (mostly for saving/restoring caller safe xmm6...xmm13 and some gp-registers plus return address of course), plus one cacheline shared by all nodes of that "search-object". Near the end of this routine, I use four times movdqa to read one cacheline or four 16-byte slots from the prefetched hash_address and store the key-matching one (if any) to a lokal address inside the "search-object" as well.

Before I further interprete the key-matching entry, I already have all legal moves generated, stored inside 16 target bitboards for each direction (including knights), to decide about check- and single reply extensions. Plus some other stuff (attack-sets, pinned pieces, potential discovered checkers etc..), I use in eval.

Of course, if I get a sufficient hash-hit with an early return, all work done is thrown away. Otoh that alu-dominant stuff with some rare likely L1 accesses seems appropriate to hide the hash-latency.

I also had no success with prefetch in my old 32-bit program (on the same hardware K8 x2 btw), but that was designed different and probably had to many other memory references (but less instructions) between the prefetch and the actual probe.

So to summarize: You know you're going to pay the latency for the hash table probe anyway, so you take other work that you would have to do after the table probe, and speculatively do it while the table probe is being prefetched. If you cut off using the hash table, this speculative work was wasted, but as long as the speculative work was not much larger than the cost of an L2 cache miss it is not really worse than just doing the cache miss. OTOH If you don't cut off using the hash table, some of the needed work has already been done.

In the best case, the speculative work and the L2 cache miss take the same amount of time, so the cut-off nodes are not any slower, and non-cut-off nodes are faster (their L2 cache miss latency is completely hidden). I guess it will rarely be this ideal, but if you're anywhere in the ballpark it is still be a net win.

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Mar 28, 2008 9:58 pm

wgarvin wrote: So to summarize: You know you're going to pay the latency for the hash table probe anyway, so you take other work that you would have to do after the table probe, and speculatively do it while the table probe is being prefetched. If you cut off using the hash table, this speculative work was wasted, but as long as the speculative work was not much larger than the cost of an L2 cache miss it is not really worse than just doing the cache miss. OTOH If you don't cut off using the hash table, some of the needed work has already been done.

In the best case, the speculative work and the L2 cache miss take the same amount of time, so the cut-off nodes are not any slower, and non-cut-off nodes are faster (their L2 cache miss latency is completely hidden). I guess it will rarely be this ideal, but if you're anywhere in the ballpark it is still be a net win.

Exactly, that is the basic idea to hide the random memory read, which might be up to two/three times the memory latency, if tlb-misses occur.

Software prefetch is tricky and may vary from processor to processor - and the exact circumstances it may pay off are a bit unlear to me. It interacts with other memory accesses and I guess the hardware prefetcher as well. I seems register intensive stuff with only a few other reads, but a few more writes - likely from/to L1 does the trick.

I am curious how it behaves with multiple threads - but that still takes some time.

wgarvin · Post by **wgarvin** » Sat Mar 29, 2008 12:13 am

Gerd Isenberg wrote:
wgarvin wrote: So to summarize: You know you're going to pay the latency for the hash table probe anyway, so you take other work that you would have to do after the table probe, and speculatively do it while the table probe is being prefetched. If you cut off using the hash table, this speculative work was wasted, but as long as the speculative work was not much larger than the cost of an L2 cache miss it is not really worse than just doing the cache miss. OTOH If you don't cut off using the hash table, some of the needed work has already been done.

In the best case, the speculative work and the L2 cache miss take the same amount of time, so the cut-off nodes are not any slower, and non-cut-off nodes are faster (their L2 cache miss latency is completely hidden). I guess it will rarely be this ideal, but if you're anywhere in the ballpark it is still be a net win.
Exactly, that is the basic idea to hide the random memory read, which might be up to two/three times the memory latency, if tlb-misses occur.

Software prefetch is tricky and may vary from processor to processor - and the exact circumstances it may pay off are a bit unlear to me. It interacts with other memory accesses and I guess the hardware prefetcher as well. I seems register intensive stuff with only a few other reads, but a few more writes - likely from/to L1 does the trick.

I am curious how it behaves with multiple threads - but that still takes some time.

I just realized I was thinking of it in the context of an in-order processor (like a PPC one I use at work), where you would have to try and match the L2 cache miss cost (610 cycles on the one I work with) to the amount of speculative work being done.

Since its an out-of-order processor, I guess the costs are pretty fungible. And it's a lot less clear how to get "optimal" performance. But I guess as long as you give it lots of stuff to chew on between prefetch and the actual read, it will end up being faster. Some of the latency is hidden because the read doesn't issue until a while after the prefetch is issued. More of it might be hidden by issuing and executing instructions in-between that would have to be issued and executed later, but I'm not as certain about that---shouldn't the chip be able to speculate past the branch anyway? Unless it mispredicts it... maybe that is the source of your speedup?

[Edit: oh, and writes are effectively wicked cheap--unless you try to do large numbers of them, or something--because of store buffers, write combining and store forwarding. The store forwarding in particular is something I miss on this in-order PPC chip. You can't even convert an int to a float on the damn thing without writing to memory and paying a 40-cycle Load-Hit-Store penalty. Super annoying.]

Incremental updating for positional evaluation

Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation