My confusion is with the description of the policy head.The neural network consists of a “body” followed by both policy and value “heads”. The
body consists of a rectified batch-normalized convolutional layer followed by 19 residual blocks (48).
Each such block consists of two rectified batch-normalized convolutional layers with a skip connection.
Each convolution applies 256 filters of kernel size 3 × 3 with stride 1. The policy head
applies an additional rectified, batch-normalized convolutional layer, followed by a final convolution
of 73 filters for chess or 139 filters for shogi, or a linear layer of size 362 for Go,
representing the logits of the respective policies described above. The value head applies an
additional rectified, batch-normalized convolution of 1 filter of kernel size 1 × 1 with stride 1,
followed by a rectified linear layer of size 256 and a tanh-linear layer of size 1.
Is there an additional convolution layer ( 3x3 of 256 filters) inside the policy head ? AlphaGo-Zero does
a 1x1 convolution of 2 filters -> 362 outputs. While Lc0 does a 1x1 convolution of 32 filters.
A0 policy head: 3x3 of 256 filters => 1x1 of 73 filters => 8x8x73 outputs
L0 policy head: 1x1 of 32 filters => 1858 outputs
Did I get this right ?
Also it makes sense to me that we do more convolutions in the policy head than the value head.