A0 policy head ambiguity

Daniel Shawul · Post by **Daniel Shawul** » Mon Jan 21, 2019 2:26 am

There is the following in the AlphaZero paper

The neural network consists of a “body” followed by both policy and value “heads”. The
body consists of a rectified batch-normalized convolutional layer followed by 19 residual blocks (48).
Each such block consists of two rectified batch-normalized convolutional layers with a skip connection.
Each convolution applies 256 filters of kernel size 3 × 3 with stride 1. The policy head
applies an additional rectified, batch-normalized convolutional layer, followed by a final convolution
of 73 filters for chess or 139 filters for shogi, or a linear layer of size 362 for Go,
representing the logits of the respective policies described above. The value head applies an
additional rectified, batch-normalized convolution of 1 filter of kernel size 1 × 1 with stride 1,
followed by a rectified linear layer of size 256 and a tanh-linear layer of size 1.

My confusion is with the description of the policy head.

Is there an additional convolution layer ( 3x3 of 256 filters) inside the policy head ? AlphaGo-Zero does
a 1x1 convolution of 2 filters -> 362 outputs. While Lc0 does a 1x1 convolution of 32 filters.

A0 policy head: 3x3 of 256 filters => 1x1 of 73 filters => 8x8x73 outputs
L0 policy head: 1x1 of 32 filters => 1858 outputs

Did I get this right ?

Also it makes sense to me that we do more convolutions in the policy head than the value head.

brianr · Post by **brianr** » Mon Jan 21, 2019 11:41 am

AZ and Lc0 have different "all possible moves" lists and lengths, although I have not looked at the details in a while.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jan 21, 2019 5:42 pm

brianr wrote: ↑Mon Jan 21, 2019 11:41 am AZ and Lc0 have different "all possible moves" lists and lengths, although I have not looked at the details in a while.

Yes, lc0 uses what they call flat representation. In that format after 8x8x32 convolutions =>1858 dense outputs.
The fully connected layer has 1858*8*8*32 weights -- which add significantly to network size.

A0's policy head does not have fully connected layer. 1x1x73 convolutions need just 73 weights.
Maybe they added the additional 8x8x256 convolution in the policy head to account for the absences of a fully connected layer (just speculating).
Aren't ConNets and ResNets supposed to have a fully connected layer anyway ?

Daniel Shawul · Post by **Daniel Shawul** » Mon Jan 21, 2019 7:38 pm

I implemented both methods and tested them now on 2x32 and 6x64 net. There doesn't seem to be much difference

A0's policy head (as I understood it):
64 filters 3x3 convolution
73 filters 1x1 convolution (take the output directly)

LCO's policy head:
73 filters 1x1 convolution
1858 dense layer

The dense layer has added about 30MB to the network size of A0's.
It seemed like A0's style policy head will learn faster initially but after 50 iteration it had a better accuracy.
So the Lc0 method is to the very least not significantly worse -- aside from the increased network size due to the dense layer.

A0 style policy head

Code: Select all

Training on chunk  50  ending at position  4096000  with lr  0.001
Generating input planes using 32 cores
Time 10 sec
Fitting model 0
Epoch 1/1
81920/81920 [==============================] - 2s 30us/step - loss: 3.6678 - value_loss: 0.8111 - policy_loss: 2.8261 - value_acc: 0.6155 - policy_acc: 0.2299
Fitting model 1
Epoch 1/1
81920/81920 [==============================] - 3s 37us/step - loss: 3.5377 - value_loss: 0.7677 - policy_loss: 2.6468 - value_acc: 0.6399 - policy_acc: 0.2758

L0 style policy head

Code: Select all

Training on chunk  50  ending at position  4096000  with lr  0.001
Generating input planes using 32 cores
Time 10 sec
Fitting model 0
Epoch 1/1
81920/81920 [==============================] - 2s 22us/step - loss: 3.4798 - value_loss: 0.8087 - policy_loss: 2.6408 - value_acc: 0.6177 - policy_acc: 0.2487
Fitting model 1
Epoch 1/1
81920/81920 [==============================] - 2s 28us/step - loss: 3.3840 - value_loss: 0.7644 - policy_loss: 2.5067 - value_acc: 0.6444 - policy_acc: 0.2755

Daniel

chrisw · Post by **chrisw** » Wed Jan 23, 2019 12:47 pm

Daniel Shawul wrote: ↑Mon Jan 21, 2019 7:38 pm I implemented both methods and tested them now on 2x32 and 6x64 net. There doesn't seem to be much difference

A0's policy head (as I understood it):
64 filters 3x3 convolution
73 filters 1x1 convolution (take the output directly)

LCO's policy head:
73 filters 1x1 convolution
1858 dense layer

The dense layer has added about 30MB to the network size of A0's.
It seemed like A0's style policy head will learn faster initially but after 50 iteration it had a better accuracy.
So the Lc0 method is to the very least not significantly worse -- aside from the increased network size due to the dense layer.

A0 style policy head
Code: Select all
Training on chunk  50  ending at position  4096000  with lr  0.001
Generating input planes using 32 cores
Time 10 sec
Fitting model 0
Epoch 1/1
81920/81920 [==============================] - 2s 30us/step - loss: 3.6678 - value_loss: 0.8111 - policy_loss: 2.8261 - value_acc: 0.6155 - policy_acc: 0.2299
Fitting model 1
Epoch 1/1
81920/81920 [==============================] - 3s 37us/step - loss: 3.5377 - value_loss: 0.7677 - policy_loss: 2.6468 - value_acc: 0.6399 - policy_acc: 0.2758
L0 style policy head
Code: Select all
Training on chunk  50  ending at position  4096000  with lr  0.001
Generating input planes using 32 cores
Time 10 sec
Fitting model 0
Epoch 1/1
81920/81920 [==============================] - 2s 22us/step - loss: 3.4798 - value_loss: 0.8087 - policy_loss: 2.6408 - value_acc: 0.6177 - policy_acc: 0.2487
Fitting model 1
Epoch 1/1
81920/81920 [==============================] - 2s 28us/step - loss: 3.3840 - value_loss: 0.7644 - policy_loss: 2.5067 - value_acc: 0.6444 - policy_acc: 0.2755
Daniel

I’m finding much the same with Othello as the underlying at the moment. messing around with differing net architectures doesn’t make a great difference.

I also find the loss metrics not very useful (other than to show the nn is actually working. Nothing beats game testing. For quick analysis avoiding running lots of games a scatter diagram of net value output against desired target gives good information. Mine churns one out every 250000 learning positions, it’s a useful monitor if progress

A0 policy head ambiguity

A0 policy head ambiguity

Re: A0 policy head ambiguity

Re: A0 policy head ambiguity

Re: A0 policy head ambiguity

Re: A0 policy head ambiguity