Page 1 of 1

neural network architecture

Posted: Wed Dec 26, 2018 11:17 pm
by jackd
Giraffe, the author of this paper http://www.ai.rug.nl/~mwiering/Thesis_M ... atelli.pdf and the authors of this paper https://www.cs.tau.ac.il/~wolf/papers/deepchess.pdf had success using a normal, fully connected neural network to label positions. However, AlphaZero, LeelaZero and Scorpio all use residual and convolutional layers, and they seem to be used commonly in other domains. Are convolutional and residual connections a necessity?

Re: neural network architecture

Posted: Thu Dec 27, 2018 7:41 am
by matthewlai
Possibly not. We used convolutional networks because they are more generalisable to other board games. The bigger the board, the more useful convolutions are, and chess has a relatively small board.

Re: neural network architecture

Posted: Thu Dec 27, 2018 10:31 am
by Daniel Shawul
One thing that helps me a lot in reducing the number of training games is to 'shortcut' the deep ResNet by adding a layer of inputs ( e.g. piece counts)
to the dense network of the policy head, and also using attack tables (that should in theory reduce the number of convolution layers) required.

However, I noticed that with the 'shortcut' the NN becomes unlike leela and becomes materialistic. Without the 'shortcut' layer, it takes really long to learn piece values so i have never managed to train a pure ResNet fully yet.

@Mathew Lai, How helpful is the policy network in chess ?

Re: neural network architecture

Posted: Fri Dec 28, 2018 2:49 pm
by chrisw
jackd wrote: Wed Dec 26, 2018 11:17 pm Giraffe, the author of this paper http://www.ai.rug.nl/~mwiering/Thesis_M ... atelli.pdf and the authors of this paper https://www.cs.tau.ac.il/~wolf/papers/deepchess.pdf had success using a normal, fully connected neural network to label positions. However, AlphaZero, LeelaZero and Scorpio all use residual and convolutional layers, and they seem to be used commonly in other domains. Are convolutional and residual connections a necessity?
It maybe that the convolutions scanning being somewhat regional, detects local situations well and is thus responsible for unbalance. there are many examples of AZ taking advantages of underdevelopment or cramping on one side of the board and sacrificing on the other, or liking centre pawn attack fortresses or an attack pawn f6 or h6 and so on. All localized powerful patterns.

Re: neural network architecture

Posted: Fri Dec 28, 2018 5:27 pm
by jackd
I recently had stockfish label 10 million positions. Stockfish was told "go depth 10" and then I used the depth 10 centi-pawn score as the value of this position. I wanted to keep the network input simple and didn't included castling rights or turns as input. So the castling rights were invalidated before being sent to stockfish and I multiplied labels for black positions by -1. After trying two different network sizes, the eval is decent but not at all fast enough to make my program, which currently uses SEE move-ordering, pvs, tt, null move and lmr, strong. Some questions:

-What are some other ways to label positions? How do you label the positions using the result of the game? I am familiar with TD-leaf but that isn't used in any top programs right now.

-Can a residual block be summarized as follows: Given a neural network with hidden layers a,b and c, connected via matrix multiplication, during the feed-forward step, after layer a and c are calculated normally, add a to c. During back-propogation calculate dError/dLayerC( derivative with respect to: layer C after a nonlinearity) like you would in a normal fully connected net, then set dError/dLayerA equal to it and calculate gradients for weights Input-A, and weights B-C. Then go back to standard back-prop, and calculate dError/dLayerA, and add it to the gradient at Input-A.

-can a convolutional layer be summarized as follows: Given hidden layers a and b, where a and b are of size 128 and their first 64 numbers represent the white pieces and their second 64 numbers the black ones. Set each half of b equal to a 3*3 kernel multiplied by its associated half in a. Then during back-prop, after you have dError/dNetLayerB (derivative with respect to: layer b before applying a non-linearity) set the gradient of the kernel equal to the average of the gradient of the kernel in each place the kernel was used.

-What are some resources for learning how to program in cuDNN or openCL, and setting up a gpu?

Re: neural network architecture

Posted: Sat Dec 29, 2018 12:02 pm
by matthewlai
Daniel Shawul wrote: Thu Dec 27, 2018 10:31 am @Mathew Lai, How helpful is the policy network in chess ?
Extremely important. Vast majority of nodes in a typical tree only have 1 or 2 nodes that have been visited. Which 1 or 2 is (almost) entirely decided by the policy network (policy head of the network).

Re: neural network architecture

Posted: Sat Dec 29, 2018 12:13 pm
by Henk
There are too many different moves so implementing policy network is impossible. Unless you own a super.

Re: neural network architecture

Posted: Sat Dec 29, 2018 4:58 pm
by jackd
This question, and my previous, were more suited for stack overflow, but I am going to post again here to clear up my last post .

Consider the following pseudocode for standard backprop

Code: Select all


layer1 = input

netlayer2 = layer1 * matrix1
layer2 = act( netlayer2 )

netlayer3 = layer2 * matrix2
layer3 = act( netlayer3 )

netlayer4 = layer3 * matrix3
layer4 = act( netlayer4 )

netlayer5 = layer4 * matrix4
layer5 = act( netlayer5)

calculate derror/dlayer5
derror/netlayer5 = dlayer5/dnetlayer5 * derror/dlayer5

derror/dlayer4 = matrix4.backprop( derror/dnetlayer5 )
derror/netlayer4 = dlayer4/dnetlayer4 * derror/dlayer4

derror/dlayer3 = matrix3.backprop( derror/dnetlayer4 )
derror/netlayer3 = dlayer3/dnetlayer3 * derror/dlayer3

derror/dlayer2 = matrix2.backprop( derror/dnetlayer3 )
derror/netlayer2 = dlayer2/dnetlayer2 * derror/dlayer2

for j in 1..4

	matrixj.update(layerj, derror/dnetlayer(j+1) )

I am saying to make a resnet we change two things:

Code: Select all


layer4 = act( netlayer4 )

to

Code: Select all


layer4 = act( netlayer4 ) + layer2

and also

Code: Select all


derror/dlayer2 = matrix2.backprop( derror/dnetlayer3 )

to

Code: Select all


derror/dlayer2 = matrix2.backprop( derror/dnetlayer3 ) + matrix4.backprop( derror/dnetlayer5 )

Does this look right?

jack