neural network architecture
Moderators: hgm, Dann Corbit, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
neural network architecture
Giraffe, the author of this paper http://www.ai.rug.nl/~mwiering/Thesis_M ... atelli.pdf and the authors of this paper https://www.cs.tau.ac.il/~wolf/papers/deepchess.pdf had success using a normal, fully connected neural network to label positions. However, AlphaZero, LeelaZero and Scorpio all use residual and convolutional layers, and they seem to be used commonly in other domains. Are convolutional and residual connections a necessity?

 Posts: 793
 Joined: Sun Aug 03, 2014 2:48 am
 Location: London, UK
 Contact:
Re: neural network architecture
Possibly not. We used convolutional networks because they are more generalisable to other board games. The bigger the board, the more useful convolutions are, and chess has a relatively small board.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.

 Posts: 4100
 Joined: Tue Mar 14, 2006 10:34 am
 Location: Ethiopia
 Contact:
Re: neural network architecture
One thing that helps me a lot in reducing the number of training games is to 'shortcut' the deep ResNet by adding a layer of inputs ( e.g. piece counts)
to the dense network of the policy head, and also using attack tables (that should in theory reduce the number of convolution layers) required.
However, I noticed that with the 'shortcut' the NN becomes unlike leela and becomes materialistic. Without the 'shortcut' layer, it takes really long to learn piece values so i have never managed to train a pure ResNet fully yet.
@Mathew Lai, How helpful is the policy network in chess ?
to the dense network of the policy head, and also using attack tables (that should in theory reduce the number of convolution layers) required.
However, I noticed that with the 'shortcut' the NN becomes unlike leela and becomes materialistic. Without the 'shortcut' layer, it takes really long to learn piece values so i have never managed to train a pure ResNet fully yet.
@Mathew Lai, How helpful is the policy network in chess ?
Re: neural network architecture
It maybe that the convolutions scanning being somewhat regional, detects local situations well and is thus responsible for unbalance. there are many examples of AZ taking advantages of underdevelopment or cramping on one side of the board and sacrificing on the other, or liking centre pawn attack fortresses or an attack pawn f6 or h6 and so on. All localized powerful patterns.jackd wrote: ↑Wed Dec 26, 2018 10:17 pmGiraffe, the author of this paper http://www.ai.rug.nl/~mwiering/Thesis_M ... atelli.pdf and the authors of this paper https://www.cs.tau.ac.il/~wolf/papers/deepchess.pdf had success using a normal, fully connected neural network to label positions. However, AlphaZero, LeelaZero and Scorpio all use residual and convolutional layers, and they seem to be used commonly in other domains. Are convolutional and residual connections a necessity?
Re: neural network architecture
I recently had stockfish label 10 million positions. Stockfish was told "go depth 10" and then I used the depth 10 centipawn score as the value of this position. I wanted to keep the network input simple and didn't included castling rights or turns as input. So the castling rights were invalidated before being sent to stockfish and I multiplied labels for black positions by 1. After trying two different network sizes, the eval is decent but not at all fast enough to make my program, which currently uses SEE moveordering, pvs, tt, null move and lmr, strong. Some questions:
What are some other ways to label positions? How do you label the positions using the result of the game? I am familiar with TDleaf but that isn't used in any top programs right now.
Can a residual block be summarized as follows: Given a neural network with hidden layers a,b and c, connected via matrix multiplication, during the feedforward step, after layer a and c are calculated normally, add a to c. During backpropogation calculate dError/dLayerC( derivative with respect to: layer C after a nonlinearity) like you would in a normal fully connected net, then set dError/dLayerA equal to it and calculate gradients for weights InputA, and weights BC. Then go back to standard backprop, and calculate dError/dLayerA, and add it to the gradient at InputA.
can a convolutional layer be summarized as follows: Given hidden layers a and b, where a and b are of size 128 and their first 64 numbers represent the white pieces and their second 64 numbers the black ones. Set each half of b equal to a 3*3 kernel multiplied by its associated half in a. Then during backprop, after you have dError/dNetLayerB (derivative with respect to: layer b before applying a nonlinearity) set the gradient of the kernel equal to the average of the gradient of the kernel in each place the kernel was used.
What are some resources for learning how to program in cuDNN or openCL, and setting up a gpu?
What are some other ways to label positions? How do you label the positions using the result of the game? I am familiar with TDleaf but that isn't used in any top programs right now.
Can a residual block be summarized as follows: Given a neural network with hidden layers a,b and c, connected via matrix multiplication, during the feedforward step, after layer a and c are calculated normally, add a to c. During backpropogation calculate dError/dLayerC( derivative with respect to: layer C after a nonlinearity) like you would in a normal fully connected net, then set dError/dLayerA equal to it and calculate gradients for weights InputA, and weights BC. Then go back to standard backprop, and calculate dError/dLayerA, and add it to the gradient at InputA.
can a convolutional layer be summarized as follows: Given hidden layers a and b, where a and b are of size 128 and their first 64 numbers represent the white pieces and their second 64 numbers the black ones. Set each half of b equal to a 3*3 kernel multiplied by its associated half in a. Then during backprop, after you have dError/dNetLayerB (derivative with respect to: layer b before applying a nonlinearity) set the gradient of the kernel equal to the average of the gradient of the kernel in each place the kernel was used.
What are some resources for learning how to program in cuDNN or openCL, and setting up a gpu?

 Posts: 793
 Joined: Sun Aug 03, 2014 2:48 am
 Location: London, UK
 Contact:
Re: neural network architecture
Extremely important. Vast majority of nodes in a typical tree only have 1 or 2 nodes that have been visited. Which 1 or 2 is (almost) entirely decided by the policy network (policy head of the network).Daniel Shawul wrote: ↑Thu Dec 27, 2018 9:31 am@Mathew Lai, How helpful is the policy network in chess ?
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
Re: neural network architecture
There are too many different moves so implementing policy network is impossible. Unless you own a super.
Re: neural network architecture
This question, and my previous, were more suited for stack overflow, but I am going to post again here to clear up my last post .
Consider the following pseudocode for standard backprop
I am saying to make a resnet we change two things:
to
and also
to
Does this look right?
jack
Consider the following pseudocode for standard backprop
Code: Select all
layer1 = input
netlayer2 = layer1 * matrix1
layer2 = act( netlayer2 )
netlayer3 = layer2 * matrix2
layer3 = act( netlayer3 )
netlayer4 = layer3 * matrix3
layer4 = act( netlayer4 )
netlayer5 = layer4 * matrix4
layer5 = act( netlayer5)
calculate derror/dlayer5
derror/netlayer5 = dlayer5/dnetlayer5 * derror/dlayer5
derror/dlayer4 = matrix4.backprop( derror/dnetlayer5 )
derror/netlayer4 = dlayer4/dnetlayer4 * derror/dlayer4
derror/dlayer3 = matrix3.backprop( derror/dnetlayer4 )
derror/netlayer3 = dlayer3/dnetlayer3 * derror/dlayer3
derror/dlayer2 = matrix2.backprop( derror/dnetlayer3 )
derror/netlayer2 = dlayer2/dnetlayer2 * derror/dlayer2
for j in 1..4
matrixj.update(layerj, derror/dnetlayer(j+1) )
Code: Select all
layer4 = act( netlayer4 )
Code: Select all
layer4 = act( netlayer4 ) + layer2
Code: Select all
derror/dlayer2 = matrix2.backprop( derror/dnetlayer3 )
Code: Select all
derror/dlayer2 = matrix2.backprop( derror/dnetlayer3 ) + matrix4.backprop( derror/dnetlayer5 )
jack