The mailbox trials

hgm · Post by **hgm** » Sun Apr 04, 2021 12:00 am

Mike Sherwin wrote: ↑Sat Apr 03, 2021 8:52 pmIt is only a move ordering experiment. At one extreme we can do zero work ahead of time and just do a full move generation to list of moves and do a one pass sort to hopefully get the best move the first time. However, if the tt move or the pv move if there is no tt is played and causes a cutoff then zero pre work is done for all those moves not searched. At the other extreme a boatload of work can be done to incrementally keep attack tables up to date only for all that work to prove useless if the tt/pv move proves to be sufficient.

This is not the right way of looking at it. If the hash/PV move is sufficient for a cutoff, it means that the node it leads to was an all node. You would have needed the entire move list there. The question is what is the fastest way to get that move list: generating it from scratch, or taking the list you were using two ply earlier, and make a few changes to it to account for the effect of the latest two ply. (I.e. for the evacuation of two squares, and the replacement of two of the pieces by an opponent.) If you do that work in two steps, each accounting for one evacuation and one substitution, you cannot say that one of the steps was done in vain. And doing the work in two steps does make sense, because if you don't get a cutoff on the first move, the work of the first step would be shared by the alternatives you try. Anyway, the task of updating the move list by nature consisted of a sequence of steps anyway. (Each step removing or adding a piece.) So there is no overhead in dividing that task over two ply; it is just a matter of deciding which of the steps to do where.

The only thing that matters is if it is faster to get the move list by modifying the one from your grandparent node, or generating one from scratch. If deriving it from the one in the grandparent is faster, the work involved in the derivation is never done 'in vain', not even if the node where you did it got a cutoff on a hash move. Because it was used in the child.

For what Harm is trying to demonstrate to prove favorable it must result in an overall reduction in the number of nodes searched to reach depth in a time per node that does not cancel the benefits of searching fewer nodes.

No, not at all. This has absolutely nothing to do with what I am investigating. The only thing of interest here is how to search the same tree in as short a time as possible, by increasing the nps.

So on average per root position it must search fewer nodes and substantially so. So in this endeavor there must be a gain in search depth in a given time to be able to show a gain in strength. It will prove futile if time per node searched is 'doubled' if the tree size is not more than halved.

This is completely the opposit of what is going on. Of course we are not going to double the time per node. The entire effort is directed at cutting that time in half, or better.

Edit: The above logic supposes that there is some cheap moves like tt/pv or killer moves.

In QS that is usually not the case. It is at the tip of the branches, and breaks new ground that was not visited in the previous iteration. And even if it was, the entries for those nodes in the TT are not likely to survive to the next iteration. They have no depth that would protect them from replacement.

Mike Sherwin wrote: ↑Sat Apr 03, 2021 9:45 pmI did not sleep well last night so I probably should not be posting on such a complicated subject. Anyway though the technique I use for move ordering that no one seems to want to try is if depth > 3 (> 2 may be better) is to do a reduced search with a widened window on each of the remaining moves. The formula (depth > 3) is depth/4. All I can do is say that it works gaining on average ~1.5 ply deeper searches in a given time.

It is well known that move ordering is crucial for tree size. What you do in d>3 nodes has virtually no impact on speed (nps) at all. Which fraction of the nodes has d>3? 0.1%? Make them 10 times slower and you still only lose 1% in speed overall.

hgm · Post by **hgm** » Sun Apr 04, 2021 12:06 pm

I got the incremental update of the attack map working. But the result was very disappointing: there was virtually no speedup compared to from-scratch generation of the attack map in every node. I don't quite understand that, because the on-scratch generation calls AddMoves() for all 16 pieces, while the incremental update only calls it for 3 pieces: the moved one in its old and new location, and the captured one. I guess it is the need to unmake the changes that breaks the idea: for every move that AddMoves() changes in the attack map it also has to remember the old value. Which I did by pushing value plus index on a stack. And then looping through that stack on unmake to put everything back. I guess the large number of loads and stores this requires just doesn't make it competitive. I could still try a copy-make approach; if the copy can be made fast enough this is bound to be faster than a clear-generate approach, and I won't have any unmake overhead. The attack map consists of 128 ints = 512 bytes. Using YMM registers, which are 256 bits (IIRC), this could be done with just a few memory operations. Because I think the CPU can communicate with the L1 cache through a 256 or even 512-bit-wide data path.

But I think I am on the wrong track with this. The basic problem is that the attack map is way too big in this design. The information was stored in a quite sparse format, just to save a few instructions during its utilization in move generation. This turns out to be a bad tradeoff. So let me go back to the drawing board, for something completely different.

To make the map as small as possible we will store all attacks (true attacks as well as protections) on a piece in a single int. There are 32 pieces, and each piece gets a fixed bit associated with it. White pieces use the even bits, black pieces use the odd bits, and the bits are assigned in order of increasing piece value (for extraction in LVA order):