Alphazero news

crem · Post by **crem** » Fri Dec 07, 2018 12:34 pm

The paper says that during the training moves are selected "in proportion to the root visit count", without mention that it happens only for first 30 plies (so I assume it happens for the entire game).

However, in the training pseudocode it looks like there is a temperature cutoff at ply 30:

Code: Select all

def select_action(config: AlphaZeroConfig, game: Game, root: Node):
  visit_counts = [(child.visit_count, action)
                  for action, child in root.children.iteritems()]
  if len(game.history) < config.num_sampling_moves:
    _, action = softmax_sample(visit_counts)
  else:
    _, action = max(visit_counts)
  return action

matthewlai, would it be possible to clarify whether temperature cutoff was used during training or not?

There is temperature cutoff in description of the play versus stockfish though (not during training), so maybe that's what was meant in pseudocode:
"by softmax sampling with a temperature of 10.0 among moves for which the value was no more than 1% away from the best move for the first 30 plies".

That raises some more questions though (not as important as training question though):
- "Softmax sampling with temperature 10" is a bit ambiguous, my best guess is that that means "proportional to exp(N / 10)".
- Values not more than 1% away. Is value of Q or value of N? (I guess it's Q?)
- If it's Q, what does "1% away" mean? It is just 1% of Q range (i.e. 0.02, as Q is from -1 to 1, e.g. if Q for best move is -0.015, then moves with Q >= -0.035 are taken)?
Or it's a relative percentage? E.g. if Q=(-0.015), then nodes with Q >= (-0.01515) are sampled? (doesn't look correct)

matthewlai · Post by **matthewlai** » Fri Dec 07, 2018 12:49 pm

crem wrote: ↑Fri Dec 07, 2018 12:34 pm The paper says that during the training moves are selected "in proportion to the root visit count", without mention that it happens only for first 30 plies (so I assume it happens for the entire game).

However, in the training pseudocode it looks like there is a temperature cutoff at ply 30:
Code: Select all
def select_action(config: AlphaZeroConfig, game: Game, root: Node):
  visit_counts = [(child.visit_count, action)
                  for action, child in root.children.iteritems()]
  if len(game.history) < config.num_sampling_moves:
    _, action = softmax_sample(visit_counts)
  else:
    _, action = max(visit_counts)
  return action
matthewlai, would it be possible to clarify whether temperature cutoff was used during training or not?

There is temperature cutoff in description of the play versus stockfish though (not during training), so maybe that's what was meant in pseudocode:
"by softmax sampling with a temperature of 10.0 among moves for which the value was no more than 1% away from the best move for the first 30 plies".

That raises some more questions though (not as important as training question though):
- "Softmax sampling with temperature 10" is a bit ambiguous, my best guess is that that means "proportional to exp(N / 10)".
- Values not more than 1% away. Is value of Q or value of N? (I guess it's Q?)
- If it's Q, what does "1% away" mean? It is just 1% of Q range (i.e. 0.02, as Q is from -1 to 1, e.g. if Q for best move is -0.015, then moves with Q >= -0.035 are taken)?
Or it's a relative percentage? E.g. if Q=(-0.015), then nodes with Q >= (-0.01515) are sampled? (doesn't look correct)

During training, we do softmax sampling by visit count up to move 30. There is no value cutoff. Temperature is 1.

In those games against SF to increase diversity (this is the only place we used softmax sampling in normal gameplay), we did the same but with a higher temperature, and only consider moves within 1% value.

Definition of temperature is the standard definition in softmax - exp(N / 10) is correct.

By 1% we mean in absolute value. All our values are between 0 and 1, so if the best move has a value of 0.8, we would sample from all moves with values >= 0.79.

Astatos · Post by **Astatos** » Fri Dec 07, 2018 1:01 pm

OK what we know :
1) Stockfish is the best engine in the world
2) LC0 guys did manage to reverse engineer A0 successfully
3) LC0 and A0 roughly at the same strength
4) NN are not less resource hungry than Alpha Beta
5) Scalability is about the same in both methods
6) Google has unacceptable behaviour, hiding data, obfuscating opponents and hyping results

crem · Post by **crem** » Fri Dec 07, 2018 1:02 pm

matthewlai wrote: ↑Fri Dec 07, 2018 12:49 pm All our values are between 0 and 1, so if the best move has a value of 0.8, we would sample from all moves with values >= 0.79.

The paper says: "At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for a draw, and +1 for a win.".

So it's not like that? It's 0 for loss, 1 for win and 0.5 for a draw?
Also paper says that initial Q=0 (and pseudocode also says "if self.visit_count == 0: return 0"). Does it mean that it's initialized to "loss" value?

Whether it's -1 to 1 or 0 to 1 is also important to Cpuct scaling (or C(s) in the latest version of the paper). Do c_base and c_init values assume that Q range is -1..1 or 0..1?

USGroup1 · Post by **USGroup1** » Fri Dec 07, 2018 1:34 pm

OneTrickPony wrote: ↑Fri Dec 07, 2018 12:13 pm ...What is important is that the games show that there are paths in chess which SF is still unable to understand and the losses are very different than just playing against another alpha-beta engine on 4x or 10x the hardware.

That does not mean we can't improve alpa-beta engines so they could handle those paths efficiently.

Jouni · Post by **Jouni** » Fri Dec 07, 2018 2:00 pm

I only looked so far for TCEC opening games. AO seems to be sometimes like patzer and loses in 22 moves to outdated SF

.

[pgn] [Event "Computer Match"] [Site "London, UK"] [Date "2018.01.18"] [Round "255"] [White "Stockfish 8"] [Black "AlphaZero"] [Result "1-0"] [PlyCount "43"] [EventDate "2018.??.??"] 1. e4 {book} e6 {book} 2. d4 {book} d5 {book} 3. Nc3 {book} Nf6 {book} 4. Bg5 { book} Be7 {book} 5. e5 {book} Nfd7 {book} 6. h4 {book} Bxg5 {book} 7. hxg5 { book} Qxg5 {book} 8. Nh3 {book} Qe7 {book} 9. Qg4 g6 10. Ng5 h6 11. O-O-O Nc6 12. Nb5 Nb6 13. Rd3 h5 14. Rf3 a6 15. Qg3 Nd8 16. Nc3 Nd7 17. Bd3 Nf8 18. Rh4 Rg8 19. Bc4 Qd7 20. Nce4 dxe4 21. Nxe4 Nh7 22. Rxh5 1-0 [/pgn]

noobpwnftw · Post by **noobpwnftw** » Fri Dec 07, 2018 3:16 pm

Jouni wrote: ↑Fri Dec 07, 2018 2:00 pm I only looked so far for TCEC opening games. AO seems to be sometimes like patzer and loses in 22 moves to outdated SF .

g4 with a queen, can't defend.

Laskos · Post by **Laskos** » Fri Dec 07, 2018 3:47 pm

OneTrickPony wrote: ↑Fri Dec 07, 2018 12:13 pm
I am not convinced the newest SF would win against it. The ELO is calculated against a pool of similar engines. It's not clear if 50 or 100 ELO more against this pool is equal to 50-100 ELO more against an opponent of a different type.

While that's true, Lc0 with the best nets on my powerful GPU and average CPU ("Leela Ratio" of say 2.5) beats heavily SF8, but loses slightly to SF10, from regular openings. Against SF8, it's similar to what happens in this paper. My guess is that this particular "old" A0 in those TCEC conditions is somewhat weaker than SF10.
Lc0 needs a "Leela Ratio" of 2.5 to have similar results to A0 ("Leela Ratio" 1 by definition), so Lc0 (with the best nets) is still lagging pretty significantly behind A0.
In some games it becomes apparent that they are fairly similar in playing style, strengths and weaknesses.

Damir · Post by **Damir** » Fri Dec 07, 2018 4:20 pm

Where can we see all 1000 games that were played ?

matthewlai · Post by **matthewlai** » Fri Dec 07, 2018 4:50 pm

crem wrote: ↑Fri Dec 07, 2018 1:02 pm
matthewlai wrote: ↑Fri Dec 07, 2018 12:49 pm All our values are between 0 and 1, so if the best move has a value of 0.8, we would sample from all moves with values >= 0.79.
The paper says: "At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for a draw, and +1 for a win.".

So it's not like that? It's 0 for loss, 1 for win and 0.5 for a draw?
Also paper says that initial Q=0 (and pseudocode also says "if self.visit_count == 0: return 0"). Does it mean that it's initialized to "loss" value?

Whether it's -1 to 1 or 0 to 1 is also important to Cpuct scaling (or C(s) in the latest version of the paper). Do c_base and c_init values assume that Q range is -1..1 or 0..1?

All the values in the search are [0, 1]. We store them as [-1, 1] only for network training, to have training targets centered around 0. At play time, when network evaluations come back, we shift them to [0, 1] before doing anything with them.

Yes, all values are initialized to loss value.

Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news