chess evaluation neural network design

Daniel Shawul · Post by **Daniel Shawul** » Fri May 04, 2018 12:39 am

I am training a NN for chess for the purpose of substituting the evaluation function only, no policy network. I am also not interested in the zero approach either so I am using set of quiet labeled epd postions, and CCRL games. It might be interesting to train on the 10 million LCZero games later.

Anyway, my question is which approach to follow for the input features: the Giraffe or AlphaZero method. Giraffee has demonstrated you can get pretty good evaluation function with 2 or 3 layers NN, if you give it helpful input features like number of pieces, attack maps, etc. With the zero approach, I have 8x8 bitmaps with 13 channels (12 for pieces and 1 for color). I do not look at castling status or history of moves in my current evaluation so those are ignored in the NN. If I wanted to train a convolutional neural network, I am stuck with this naive representation. For instance, I can not give it the number of pieces as inputs, cause it would be meaningless when it is convolved. Attack maps are 8x8 so they could probably go into the convolution pipeline and give meaningful results. So far the 18-layer resnet i trained on 2 million positions doesn't yet seem to have figured out the value of pieces (e.g. this idiot gives an 80% winning probability from the start position).

Please give your best design for a value neural network, that would capture major evaluation features quickly without needing 44 million games. Also we would like the neural network to figure out advanced features for itself so it can not be as simple as Giraffe's.

Daniel

AlvaroBegue · Post by **AlvaroBegue** » Fri May 04, 2018 1:36 am

I think it's very reasonable to use this for input:
* 12 planes indicating where the pieces are
* 12 planes indicating how many pieces of each type attack each square
* 1 plane indicating castling rights (just mark the rooks that can still be involved in castling)

I would start with a CNN with something like 10 layers, using the ResNet skip connections (so something similar to what LCZero calls "5 blocks"). 64 filters will get you started. You then need a "value head" (i.e., something that reduces down to a single number). You can just copy LCZero here. Use one more 3x3 convolution with 32 filters, then interpret the 32*(8x8) as a vector of size 2048, then have a fully connected layer that reduces that to 128, then one final layer with a single output and tanh non-linearity. Or you can end with 3 values and use SoftMax, so you get W/D/L probabilities.

It is possible that trying to predict the next move makes the network easier to train and more resilient to overfitting, because you'll have many more labels for training.

How many actual games do you have in your training DB? How exactly did you generate them? I generated 3M games of SF8-vs-SF8 at very fast time control. Let me know if you want them.

I think using CCRL games is fine, but I would use the Elo difference between the players as an input (you can concatenate it with the vector of 2048 entries, for instance).

Daniel Shawul · Post by **Daniel Shawul** » Fri May 04, 2018 1:59 am

AlvaroBegue wrote:I think it's very reasonable to use this for input:
* 12 planes indicating where the pieces are
* 12 planes indicating how many pieces of each type attack each square
* 1 plane indicating castling rights (just mark the rooks that can still be involved in castling)

I forgot that you could just fill in an input plane with constant value for number of pieces or any other single number feature you may want to have. It would have to do convolutions for those but i think atleast to help it figure out the value of pieces quickly this is necessary. I would also add the attack maps to help it figure out king attacks , centralization etc quickly.

I would start with a CNN with something like 10 layers, using the ResNet skip connections (so something similar to what LCZero calls "5 blocks"). 64 filters will get you started. You then need a "value head" (i.e., something that reduces down to a single number). You can just copy LCZero here. Use one more 3x3 convolution with 32 filters, then interpret the 32*(8x8) as a vector of size 2048, then have a fully connected layer that reduces that to 128, then one final layer with a single output and tanh non-linearity. Or you can end with 3 values and use SoftMax, so you get W/D/L probabilities.

I modified the original resnet18 for a single output (sigmoid), average pooling, 3x3 kernels. For training, I am using a set of labeled positions that merges your file and Zurichess author's files. That gives me 2 million epd postions that I already used to tune scorpio's hand written evals. I am using that to train the neural network as well.
Since I do not plan to capture tactics with NN anyway, quiet positons are fine.

It is possible that trying to predict the next move makes the network easier to train and more resilient to overfitting, because you'll have many more labels for training.

How many actual games do you have in your training DB? How exactly did you generate them? I generated 3M games of SF8-vs-SF8 at very fast time control. Let me know if you want them.

Yes, I think I used an older version of your epd files that were <1 M games?

I think using CCRL games is fine, but I would use the Elo difference between the players as an input (you can concatenate it with the vector of 2048 entries, for instance).

Ok noted.

AlvaroBegue · Post by **AlvaroBegue** » Fri May 04, 2018 2:20 am

Here's the thread about those 3 million games: http://talkchess.com/forum/viewtopic.php?t=66681

And the direct link to the games: https://drive.google.com/drive/folders/ ... itamyJD5_k

Daniel Shawul · Post by **Daniel Shawul** » Fri May 04, 2018 3:32 am

AlvaroBegue wrote:Here's the thread about those 3 million games: http://talkchess.com/forum/viewtopic.php?t=66681

And the direct link to the games: https://drive.google.com/drive/folders/ ... itamyJD5_k

Ok thanks.

I added 5 more channels for the difference in number of queens,rooks...pawns. This already made the 2-layer convenet to give very good evaluation numbers. It gives about 48% winning chance at start position and after e4,d5,exd5 it goes up to 71% winning change.

I want to add the attack maps but adding 12 more channels seems costly. So I am thinking to OR or replace the existing piece square tables with attack maps instead. I don't know if that will be more preferable than having 12 more separate channels though.

AlvaroBegue · Post by **AlvaroBegue** » Fri May 04, 2018 3:47 am

Daniel Shawul wrote:[...] It gives about 48% winning chance at start position and after e4,d5,exd5 it goes up to 71% winning change.

If you are training on quiescent positions only, you need to use a quiescence search. What's the score after Qxd5?

Daniel Shawul · Post by **Daniel Shawul** » Fri May 04, 2018 4:05 am

AlvaroBegue wrote:
Daniel Shawul wrote:[...] It gives about 48% winning chance at start position and after e4,d5,exd5 it goes up to 71% winning change.
If you are training on quiescent positions only, you need to use a quiescence search. What's the score after Qxd5?

It goes down to 45%. That is Ok because I want to use the NN for evaluation purposes only, and here it is behaving just like a hand written evaluation function would without trying to resolve SEE level tactics.
Also I don't want it to do that given the difficulty LCZero is facing with tactics anyway.
It seems leela's network gives 54% likelihood of winning for exd5 so it seems to have some tactical understanding (I used easy mode on the online play site). I am assuming that 54% is coming from the eval-head output after exd5.

AlvaroBegue · Post by **AlvaroBegue** » Fri May 04, 2018 4:30 am

Daniel Shawul wrote:[...]
It seems leela's network gives 54% likelihood of winning for exd5 so it seems to have some tactical understanding (I used easy mode on the online play site). I am assuming that 54% is coming from the eval-head output after exd5.

I'm not sure, but I would think that "1 node" means that only the root is fed to the NN, so that's probably where the score is coming from. The move played is the one to which the policy head assigns the highest probability.

Daniel Shawul · Post by **Daniel Shawul** » Fri May 04, 2018 7:23 am

AlvaroBegue wrote:
Daniel Shawul wrote:[...]
It seems leela's network gives 54% likelihood of winning for exd5 so it seems to have some tactical understanding (I used easy mode on the online play site). I am assuming that 54% is coming from the eval-head output after exd5.
I'm not sure, but I would think that "1 node" means that only the root is fed to the NN, so that's probably where the score is coming from. The move played is the one to which the policy head assigns the highest probability.

I replaced the piece location bitmaps with attack maps instead and the evaluation is getting very precise now. I think the attack maps can easily add king safety, centralization and mobility terms. I wonder what would be the point of multiple convolutions if you have attack maps..

Here is how a game proceeds with a one ply search and the 2-layer convnet evaluator

Code: Select all

class ConvnetBuilder&#40;object&#41;&#58;
    @staticmethod
    def build&#40;input_shape&#41;&#58;
    	model = Sequential&#40;)
    	model.add&#40;Conv2D&#40;32, &#40;3, 3&#41;,
    	                 activation='relu',
    	                 input_shape=input_shape&#41;)
    	model.add&#40;AveragePooling2D&#40;pool_size=&#40;2, 2&#41;, strides=&#40;1, 1&#41;))
    	model.add&#40;Conv2D&#40;64, &#40;3, 3&#41;, activation='relu'))
    	model.add&#40;AveragePooling2D&#40;pool_size=&#40;2, 2&#41;))
    	model.add&#40;Flatten&#40;))
    	model.add&#40;Dense&#40;256, activation='relu'))
    	model.add&#40;Dense&#40;1, activation='sigmoid'))
    	return model

Game start

Code: Select all

r n b q k b n r
p p p p p p p p
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
P P P P P P P P
R N B Q K B N R
Your move&#58; e2e4
r n b q k b n r
p p p p p p p p
. . . . . . . .
. . . . . . . .
. . . . P . . .
. . . . . . . .
P P P P . P P P
R N B Q K B N R
g8h6 &#91;0.5166185&#93;
g8f6 &#91;0.55204576&#93;
b8c6 &#91;0.5502181&#93;
b8a6 &#91;0.48420006&#93;
h7h6 &#91;0.48405892&#93;
g7g6 &#91;0.51393116&#93;
f7f6 &#91;0.50278157&#93;
e7e6 &#91;0.54343534&#93;
d7d6 &#91;0.5342053&#93;
c7c6 &#91;0.4777125&#93;
b7b6 &#91;0.48292392&#93;
a7a6 &#91;0.47985923&#93;
h7h5 &#91;0.47904932&#93;
g7g5 &#91;0.47555655&#93;
f7f5 &#91;0.4648322&#93;
e7e5 &#91;0.48346817&#93;
d7d5 &#91;0.4681068&#93;
c7c5 &#91;0.4720884&#93;
b7b5 &#91;0.46997124&#93;
a7a5 &#91;0.46296948&#93;
My move&#58;  g8f6 Score  &#91;55.204575&#93;
r n b q k b . r
p p p p p p p p
. . . . . n . .
. . . . . . . .
. . . . P . . .
. . . . . . . .
P P P P . P P P
R N B Q K B N R
Your move&#58; g1f3
r n b q k b . r
p p p p p p p p
. . . . . n . .
. . . . . . . .
. . . . P . . .
. . . . . N . .
P P P P . P P P
R N B Q K B . R
h8g8 &#91;0.45786917&#93;
b8c6 &#91;0.5779574&#93;
b8a6 &#91;0.47027504&#93;
f6g8 &#91;0.35249573&#93;
f6h5 &#91;0.40187335&#93;
f6d5 &#91;0.4030236&#93;
f6g4 &#91;0.4721533&#93;
f6e4 &#91;0.60681653&#93;
h7h6 &#91;0.48782092&#93;
g7g6 &#91;0.4834252&#93;
e7e6 &#91;0.4923743&#93;
d7d6 &#91;0.51799154&#93;
c7c6 &#91;0.47185832&#93;
b7b6 &#91;0.46224302&#93;
a7a6 &#91;0.46457154&#93;
h7h5 &#91;0.45992076&#93;
g7g5 &#91;0.4688589&#93;
e7e5 &#91;0.4579141&#93;
d7d5 &#91;0.45785898&#93;
c7c5 &#91;0.4763748&#93;
b7b5 &#91;0.4711253&#93;
a7a5 &#91;0.4444967&#93;
My move&#58;  f6e4 Score  &#91;60.681652&#93;
r n b q k b . r
p p p p p p p p
. . . . . . . .
. . . . . . . .
. . . . n . . .
. . . . . N . .
P P P P . P P P
R N B Q K B . R

I will train the 2-layer convnet with the 3 million games (~120 million positions) and see how it fares. I think I am already quite happy with the result so far.

The search I am going to use should have a quescent search otherwise the NN will misevaluate a lot. I wonder how much the policy + eval NN + multiple layers help it to reolve tactics ...

Daniel

AlvaroBegue · Post by **AlvaroBegue** » Fri May 04, 2018 2:27 pm

Is your plan to filter out non-quiescent positions?

Ideally, you would train by picking a random position, running QS with the current NN and then use the leaf from that search. This is expensive, but it would optimize something very reasonable: the quality of the prediction of the result of the game given by QS.

Filtering out non-quiescent positions is a cheap approximation, and it's possible that it's perfectly fine. After all, what's important in the evaluation (e.g., is this passed pawn enough advantage to win the game? Is this king-side attack likely to win?) can be learned from looking at quiescent positions only. The search will handle the messy situations.

My intuitions about this have evolved over the years, mostly unencumbered by complicated things like "evidence".

chess evaluation neural network design

chess evaluation neural network design

Re: chess evaluation neural network design

Re: chess evaluation neural network design

Re: chess evaluation neural network design

Re: chess evaluation neural network design

Re: chess evaluation neural network design

Re: chess evaluation neural network design

Re: chess evaluation neural network design

Re: chess evaluation neural network design

Re: chess evaluation neural network design