Failure of trivial approach to neural network move ordering

j.t. · Post by **j.t.** » Wed Feb 16, 2022 7:55 pm

I recently tried an approach I had in mind, how neural networks could be used for quiet move ordering:

You have a neural network Position+Move - 30 neurons - 20 neurons - 1 output. It should print out a number that can be used to sort quiet moves.
The training I did by running depth 2 searches on 1 million random positions from the CCRL 4040 database. I added random fluctuations to the current best neural network parameter, and each time, the total number of nodes searched decreases, the new params get accepted. This could be optimized in quite a few ways, but this simple approach worked insofar that without neural model, the total counted nodes on these 1 million positions was like 153 million, while after a few hundred tries, the model reduced this number to 148 million.

Now I tested the engine with neural model against one without. I found out that it's -10 +- 7 or so. Even when I only use the neural network at depth 2 and lower it still is only 3+-4. Additionally, the neural model comes with a speed penalty of halving NPS, so it's a fail.

It could be that the neural model is overfitting to the "only" 1 million positions, or maybe that more training is needed, to reduce the number of nodes even more. However, both these could only be solved with significantly more computing power (I already used 40 threads) and time (took the whole day already).

Another approach could be to train a neural network with gradient descent for the preferred move of a low depth search. Gradient descent is probably a lot quicker, than trying random additions. However, I am now skeptical how much a simple neural network can even improve history heuristic. I think that for move ordering the dominant factor is finding good/bad tactics, opposed to strategic ideas, for which I believe neural networks are better suited.
Especially considering that even a very simple neural network for only quite moves brings down NPS significantly (700k-1000k nps instead of 1300k-1600k nps)

dangi12012 · Post by **dangi12012** » Fri Feb 18, 2022 12:22 am

j.t. wrote: ↑Wed Feb 16, 2022 7:55 pm Another approach could be to train a neural network with gradient descent for the preferred move of a low depth search. Gradient descent is probably a lot quicker, than trying random additions. However, I am now skeptical how much a simple neural network can even improve history heuristic. I think that for move ordering the dominant factor is finding good/bad tactics, opposed to strategic ideas, for which I believe neural networks are better suited.
Especially considering that even a very simple neural network for only quite moves brings down NPS significantly (700k-1000k nps instead of 1300k-1600k nps)

Random additions are WAY to noisy to get any good results. I think you definitely need gradient descent as you can get good results with a few! iterations.
I dont know if you are using 8bit or float weights and I would recommend you use float to get started.
Your score function also has to be correct. It must not be the amount of times it guessed correctly - but the elo of your engine for a fixed thinking time. (some of the times you train against one parameter when in reality you want to improve strength = score!)

Maybe you can post your code because Neural Networks are really just multipling a vector and do a GEMM. Then the outputs have to be transformed with relu for the next layer etc.
So Really its just one big chain of matrix - vector multiply - activation function. So a BLAS library helps a lot in terms of perfomance.

Gradient descent is the trick that matrices are invertible and the gradient of your activation function is defined- and the delta to the correct output is known (score) so one delta can get backwards distributed in the other direction to update all weights in the correct direction at once.

TLDR: Use Blas GEMM, share your code, random trying improves 1 weight per iteration, gradient descent improves 10000 weights per iteration. (or however many you have). Maybe you can create a visio or drawn image of your input layer bits and how you are using the one output bit to decide among the up to 128 moves.

j.t. · Post by **j.t.** » Fri Feb 18, 2022 4:17 am

dangi12012 wrote: ↑Fri Feb 18, 2022 12:22 am ...

I used float32. All the optimizing of the code I would have done at later stages. I probably would've looked at libraries like arraymancer which implement super fast matrix multiplications using blas. I didn't think the code was worth sharing, but it is now here (optimizer here).

The problem with obtaining a gradient is that you need a target score which I can minimize against. In case of move ordering, such a derivable score simply doesn't exist (or is at least practically impossible to calculate). So when targeting Elo gain directly, I will have to rely on random "explorations".

What I previously assumed was that making move ordering worse (in the sense that objectively good moves are searched late) would automatically increase node count, but I realized that that's not necessarily true. So you're right, that I should've targeted Elo directly.

I think I will need to learn a bit of reinforcement learning again. No need to trip over all the stones that other people already found solutions to.

EDIT: I am also not 100% sure if for the 1. layer that super optimized matrix multiplication is best, because it is a very sparse input usually (~35 inputs are set to one all other thousand(s) inputs are zero).

dangi12012 · Post by **dangi12012** » Fri Feb 18, 2022 3:32 pm

j.t. wrote: ↑Fri Feb 18, 2022 4:17 am
dangi12012 wrote: ↑Fri Feb 18, 2022 12:22 am ...
EDIT: I am also not 100% sure if for the 1. layer that super optimized matrix multiplication is best, because it is a very sparse input usually (~35 inputs are set to one all other thousand(s) inputs are zero).

Well. What exactly are you using as the input layer that its so sparse?
Its best to provide as much information as possible to the NN - and the inputs dont need to be binary at all.

Here is something that blew my mind once:
A one layer network together with these weights https://www.chessprogramming.org/Simpli ... on#Knights
is mathematically literally identical to the static evaluation function using those wheights! Because the matrix dotproduct is a multiply sum per element. So if you have a single layer network and feed in the piecetypes one by one in a list of [12*64] inputs a one layer network learning algorithm should converge to weights that look like the ones in the link.

This way you can confirm learning.

Once this is done you can add more layers - so that the network can learn evaluation by higher order patterns too (like 2 layer will learn knight forks and maybe even pins). But efficient learning is a must since you need millions of positions to get ONE score - and that single score should update the network in a highly efficient way. (Gradient descent)

At the end you should have a network that sees a few moves ahead by itself to get a really efficient good move ordering NN.

dangi12012 · Post by **dangi12012** » Wed Feb 23, 2022 6:34 pm

j.t. wrote: ↑Wed Feb 16, 2022 7:55 pm I recently tried an approach I had in mind, how neural networks could be used for quiet move ordering:

Any progress on that topic?

j.t. · Post by **j.t.** » Thu Feb 24, 2022 2:55 am

dangi12012 wrote: ↑Wed Feb 23, 2022 6:34 pm
j.t. wrote: ↑Wed Feb 16, 2022 7:55 pm I recently tried an approach I had in mind, how neural networks could be used for quiet move ordering:
Any progress on that topic?

Nope.

I read a little bit into reinforcement learning (this I found quite useful), and much of it is not immediately applicable to this problem, as usually the network models are the ones making the decision what to do in an environment, while here the network only assists an algorithm in finding the right decision, in a way that is not entirely transparent to humans because of the different pruning strategies (the best moveordering model doesn't necessarily need to select the best move first, but maybe the move that most needs deep search to make the search accurate).

What my rough idea is right now is that for training I will play a few games, but during the moveordering, the outputs of the move eval network are randomly mutated slightly. These mutated outputs and the rgarding input are stored in memory. Then depending on whether the player for whom this moveordering mutation has been done during search, won a game, or lost it, the mutation will be used to do backpropagation. I have no clue if that'll work out from a machine learning standpoint, and I am not sure that it works out with the memory/computing constraints.

Before I will start coding, I have to wait for the merge of a big pull request in a library that intend to use (arraymancer), because I won't implement subpar backpropagation myself again.

Do you have any new ideas?

connor_mcmonigle · Post by **connor_mcmonigle** » Tue Mar 01, 2022 5:30 pm

This is a topic I find interesting and there's definitely potential to use a policy network as a substitute for conventional move ordering heuristics. Additionally, the actual logits predicted by the policy network could have other uses such as node budgeting through reductions and extensions (https://www.chessprogramming.org/Giraffe). One consideration is that existing combined butterfly/counter/follow/capture history table based move ordering yields shockingly good results in practice such that actually outperforming this approach will be far from trivial. Perhaps just using the superior policy network move ordering on PV nodes (where ordering is most important) could be worthwhile.

In my experimentation, the architecture I've had the most success with takes the form of a fully connected network with a wide incrementally updated first layer serving as the only hidden layer followed by a ReLU activation and two output heads for queries and keys respectively (both yielding lower D dimensional vectors). Let Q(S) refer to the query vector predicted by the query output and K(S) refer to the key vector predicted by the key output head for arbitrary state S. To calculate the probability of action A_i from state S, actions A_1..A_n for S and corresponding children of S, S_1...S_n:

Pr(A_i | S) = exp(Q(S) dot K(S_i)) / (sum over j from 1 to n: exp(Q(S) dot K(S_j)))

Note that this is just a standard softmax over the query-key vector dot products so the denominator is given by the partition function. For the training loss, I've been using negative log-likelihood under the model of the best move as predicted by a fixed depth search. Notably, the incrementally updated first layer can be shared with a neural network used for position evaluation. Additionally, queries and keys can be cached separately provided D is sufficiently small.

dangi12012 · Post by **dangi12012** » Tue Mar 01, 2022 5:53 pm

Giraffe looks very interesting!
The important part is this: Evaluating a position is not the same as move ordering!
One is evaluation a position
The other is scoring a position in such a way that the probability of it leading to a higher evaluation is greatest.

Its a subtle difference but its there.
This also answers OPs question since the github link you provided solves this.

Failure of trivial approach to neural network move ordering

Failure of trivial approach to neural network move ordering

Re: Failure of trivial approach to neural network move ordering

Re: Failure of trivial approach to neural network move ordering

Re: Failure of trivial approach to neural network move ordering

Re: Failure of trivial approach to neural network move ordering

Re: Failure of trivial approach to neural network move ordering

Re: Failure of trivial approach to neural network move ordering

Re: Failure of trivial approach to neural network move ordering