SL vs RL

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

SL vs RL

Post by chrisw »

I found my little test MCTS SL program screwing up in game play because this sort of situation...,

Black Q gives check. Check piece is en prise. King has some moves.

Policy trained SL gave the king move evasions normal probabilities, 0.25, 0.12 sort of values. But the obvious Q capture (which would be outright winning) gets a probability of 0.005 or whatever. This situation happened enough times for an investigation.....

I think it’s because SL games are sort of sensible, compcomp or human, there are virtually no positions generated where a checking queen is just en prise,so no continuation moves in any sensible game where xQ gets promoted in training. Hence xQ not favored in policy.

Result in search is that little program gives Q en prise idiotic check and is very happy because the PUCT puts the recapture so far down its list, or even not at all, that Q en prise never gets captured.

AZ is supposed to generate “weak” games to give experience in stupid situations, is that the solution? And that weak, containing outright blunder positions?

This is an extreme example, but I’m seeing variants of it all the time, based on, SL only sees sensible examples.
Another fail is SEE favored captures. Sensible games have many positions where side just moved is material ahead, but this isn’t maintained because it was just first move in a SEE sequence. Problem is that value head interprets material down but on move as equal because, well, it only got to see examples where this is true.
I think I found a way round this, but hack by hack usually leaves holes. Is RL just better? Maybe.
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: SL vs RL

Post by Rémi Coulom »

RL has the same problem. Weak moves of early self-play games are rapidly forgotten.

In Go, the alphazero method has a very severe problem with ladders. It is similar to what you describe.

A fundamental flaw of the Alpha Zero approach is that it learns only from games between strong players. When the neural network is used inside the search tree, it often has to find refutations to bad moves that would have never been played between strong players.

I have not yet found a good way to overcome this problem. But I will try some ideas to include good refutations to bad moves into the training set.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SL vs RL

Post by hgm »

The solution seems obvious: make sure the training set contains sufficiently many games where a strong player crushes a patzer in the most efficient way. Or games between strong players where you make one of the moves of one player a random one.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: SL vs RL

Post by chrisw »

hgm wrote: Mon Apr 29, 2019 12:41 pm The solution seems obvious: make sure the training set contains sufficiently many games where a strong player crushes a patzer in the most efficient way. Or games between strong players where you make one of the moves of one player a random one.
Then what do you use as game result for this diversionary sequence? And is one random move in say 80, enough to create the necessary bad game segment?
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: SL vs RL

Post by chrisw »

Rémi Coulom wrote: Sun Apr 28, 2019 4:19 pm RL has the same problem. Weak moves of early self-play games are rapidly forgotten.

In Go, the alphazero method has a very severe problem with ladders. It is similar to what you describe.

A fundamental flaw of the Alpha Zero approach is that it learns only from games between strong players. When the neural network is used inside the search tree, it often has to find refutations to bad moves that would have never been played between strong players.

I have not yet found a good way to overcome this problem. But I will try some ideas to include good refutations to bad moves into the training set.
LC0 nodes=0 or any low value versus itself, sf10 with limited depth, or one’s own simple net shows up many many blundering cases. Like watching games from 1980s. Lc0 policy is good but in many cases not. Of significance it can have no idea when either strongly ahead, or strongly behind, I guess these sorts of games just don’t appear in training.
If one had a net trained also on many bad positions, thenHomer Simpson brain capacity problem. Every time I hear something new a bit of old drops out the other side.
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: SL vs RL

Post by trulses »

chrisw wrote: Sun Apr 28, 2019 3:57 pm ...

Policy trained SL gave the king move evasions normal probabilities, 0.25, 0.12 sort of values. But the obvious Q capture (which would be outright winning) gets a probability of 0.005 or whatever. This situation happened enough times for an investigation.....

I think it’s because SL games are sort of sensible, compcomp or human, there are virtually no positions generated where a checking queen is just en prise,so no continuation moves in any sensible game where xQ gets promoted in training. Hence xQ not favored in policy.
Seems like the entropy of your policy is too low try training with entropy regularization. If you have the test position an immediate band-aid fix is to find which softmax temperature helps puct try this move earlier in your search. Your value net should enjoy the resulting position when up a queen so this move should eventually prove very convincing to puct independent of your policy net.
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: SL vs RL

Post by trulses »

Rémi Coulom wrote: Sun Apr 28, 2019 4:19 pm RL has the same problem. Weak moves of early self-play games are rapidly forgotten.

In Go, the alphazero method has a very severe problem with ladders. It is similar to what you describe.

A fundamental flaw of the Alpha Zero approach is that it learns only from games between strong players. When the neural network is used inside the search tree, it often has to find refutations to bad moves that would have never been played between strong players.

I have not yet found a good way to overcome this problem. But I will try some ideas to include good refutations to bad moves into the training set.
What have you tried so far to address this? Do you use a large replay buffer? Have you tried playing against older agents? RL should give you a nice spectrum from weak to strong agents to play against.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: SL vs RL

Post by chrisw »

trulses wrote: Thu May 09, 2019 4:36 pm
chrisw wrote: Sun Apr 28, 2019 3:57 pm ...

Policy trained SL gave the king move evasions normal probabilities, 0.25, 0.12 sort of values. But the obvious Q capture (which would be outright winning) gets a probability of 0.005 or whatever. This situation happened enough times for an investigation.....

I think it’s because SL games are sort of sensible, compcomp or human, there are virtually no positions generated where a checking queen is just en prise,so no continuation moves in any sensible game where xQ gets promoted in training. Hence xQ not favored in policy.
Seems like the entropy of your policy is too low try training with entropy regularization. If you have the test position an immediate band-aid fix is to find which softmax temperature helps puct try this move earlier in your search. Your value net should enjoy the resulting position when up a queen so this move should eventually prove very convincing to puct independent of your policy net.
I have so many things to fix that I applied a quick kludge to this and moved on. Kludge base is that policy score is inaccurate anyway, but inaccuracies where p is close to zero are potentially catastrophic since zero is not a good multiplier. Kludge was add absolute 0.005 to p before using it in (my version of) PUCT, that way stuff doesn’t get completely overlooked.
I also have ideas to handcraft move selection when waiting for a policy batch and/or have a handcraft term to add to p, decreasing with visits. All on todo list.
Albert Silver
Posts: 3019
Joined: Wed Mar 08, 2006 9:57 pm
Location: Rio de Janeiro, Brazil

Re: SL vs RL

Post by Albert Silver »

chrisw wrote: Sun Apr 28, 2019 3:57 pm I found my little test MCTS SL program screwing up in game play because this sort of situation...,

Black Q gives check. Check piece is en prise. King has some moves.

Policy trained SL gave the king move evasions normal probabilities, 0.25, 0.12 sort of values. But the obvious Q capture (which would be outright winning) gets a probability of 0.005 or whatever. This situation happened enough times for an investigation.....

I think it’s because SL games are sort of sensible, compcomp or human, there are virtually no positions generated where a checking queen is just en prise,so no continuation moves in any sensible game where xQ gets promoted in training. Hence xQ not favored in policy.
There is a vast difference between human games and comp games when it comes to NN training. What was your SL source material?
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."