However, in the training pseudocode it looks like there is a temperature cutoff at ply 30:
Code: Select all
def select_action(config: AlphaZeroConfig, game: Game, root: Node):
visit_counts = [(child.visit_count, action)
for action, child in root.children.iteritems()]
if len(game.history) < config.num_sampling_moves:
_, action = softmax_sample(visit_counts)
else:
_, action = max(visit_counts)
return action
There is temperature cutoff in description of the play versus stockfish though (not during training), so maybe that's what was meant in pseudocode:
"by softmax sampling with a temperature of 10.0 among moves for which the value was no more than 1% away from the best move for the first 30 plies".
That raises some more questions though (not as important as training question though):
- "Softmax sampling with temperature 10" is a bit ambiguous, my best guess is that that means "proportional to exp(N / 10)".
- Values not more than 1% away. Is value of Q or value of N? (I guess it's Q?)
- If it's Q, what does "1% away" mean? It is just 1% of Q range (i.e. 0.02, as Q is from -1 to 1, e.g. if Q for best move is -0.015, then moves with Q >= -0.035 are taken)?
Or it's a relative percentage? E.g. if Q=(-0.015), then nodes with Q >= (-0.01515) are sampled? (doesn't look correct)