Understanding the power of reinforcement learning

Michael Sherwin · Post by **Michael Sherwin** » Wed Dec 13, 2017 12:47 am

The following position is a bit dated as most strong engines will find the best move using normal search. However 30 years ago, just throwing a dart at the calendar, the best engines could not find the best move. Even RomiChess in 2006 could not find it, Phalanx 22 could. So this example is a bit dated. In 2005 this position is one that I hoped Romi could find with normal search but that did not happen. After I added reinforcement learning and before I added MSMD learning I tested Romi playing the black pieces to see if Romi could find the best move after training a number of games. It took Romi 40 games to find the best move but when she found it (learned it due to reinforcement learning) she won every game. I know that TSCP could also find this winning move after enough training games in the position. The point is if TSCP had reinforcement learning and won a game against SF in this position it would look superhuman. It would look like TSCP thought like a human and did the 'impossible'. It would look as incredible as AlphaZ except it would have done it on equal hardware.

[d]r5k1/pp1bppbp/3p1np1/q5B1/2r1PP2/1NN5/PPPQ2PP/1K1R3R b - - 1 16

shrapnel · Post by **shrapnel** » Wed Dec 13, 2017 5:40 am

Do you have a UCI Engine that we can test ?

Michael Sherwin · Post by **Michael Sherwin** » Wed Dec 13, 2017 5:45 am

shrapnel wrote:Do you have a UCI Engine that we can test ?

No Sorry. UCI does not support the result command so learning would have to be triggered differently.

Michael Sherwin · Post by **Michael Sherwin** » Wed Dec 13, 2017 3:21 pm

Michael Sherwin wrote:The following position is a bit dated as most strong engines will find the best move using normal search. However 30 years ago, just throwing a dart at the calendar, the best engines could not find the best move. Even RomiChess in 2006 could not find it, Phalanx 22 could. So this example is a bit dated. In 2005 this position is one that I hoped Romi could find with normal search but that did not happen. After I added reinforcement learning and before I added MSMD learning I tested Romi playing the black pieces to see if Romi could find the best move after training a number of games. It took Romi 40 games to find the best move but when she found it (learned it due to reinforcement learning) she won every game. I know that TSCP could also find this winning move after enough training games in the position. The point is if TSCP had reinforcement learning and won a game against SF in this position it would look superhuman. It would look like TSCP thought like a human and did the 'impossible'. It would look as incredible as AlphaZ except it would have done it on equal hardware.

[d]r5k1/pp1bppbp/3p1np1/q5B1/2r1PP2/1NN5/PPPQ2PP/1K1R3R b - - 1 16

I've tried my best to help everyone understand about reinforcement learning, 11 years ago and now since AlphaZ has rocked 'your' world. To those that listened and believed and became fans even though there were not all that many, I appreciated it. To the majority of engine authors that pooh poohed my message and wrote it off as some kind of cheap book tuning trick I guess you got your asses handed to you with the crushing defeat of Stockfish. I'm not referring to all engine authors, only the ones that it pertains too. Fans of your engines asked for "Romi-style learning" in your engines for over a decade now and were not paid much attention. I have explained how it works and why it is so powerful till I'm blue in the face. But whenever 'you' mention any details about it while pooh poohing it you demonstrate a total lack of understanding and just continue on your merry way like you mastered the discussion even though you had no clue what you were talking about. RomiChess can't demonstrate what AlpaZ did because there is no way for Romi to play millions of training games. I just don't have the computer power. Once again I'm only talking to certain people. The future has left you behind and now you will have to play catch up. As for me I'm done with this issue and I leave you and you can go back to your bias. I've done my part! Just know this, the power in AlphaZ is the learning and not the hardware (except for quick training) or the MCTS or the deep NN.

tsmiller1 · Post by **tsmiller1** » Wed Dec 13, 2017 7:55 pm

Hey, I for one love this kind of learning, and I think AlphaZero's conquest was a major step forward for AI. Thanks for your own contributions to the field.

jdart · Post by **jdart** » Wed Dec 13, 2017 8:23 pm

What is "normal search" has changed drastically in the past 30 years, or even in the past 10, as Moore's Law has made what used to be considered supercomputer computing power available to PC users.

Algorithms are important, but now even a program with poor algorithms can reach depths >25 very quickly. I remember when that was quite unattainable.

--Jon

giovanni · Post by **giovanni** » Wed Dec 13, 2017 8:31 pm

Michael Sherwin wrote:The following position is a bit dated as most strong engines will find the best move using normal search. However 30 years ago, just throwing a dart at the calendar, the best engines could not find the best move. Even RomiChess in 2006 could not find it, Phalanx 22 could. So this example is a bit dated. In 2005 this position is one that I hoped Romi could find with normal search but that did not happen. After I added reinforcement learning and before I added MSMD learning I tested Romi playing the black pieces to see if Romi could find the best move after training a number of games. It took Romi 40 games to find the best move but when she found it (learned it due to reinforcement learning) she won every game. I know that TSCP could also find this winning move after enough training games in the position. The point is if TSCP had reinforcement learning and won a game against SF in this position it would look superhuman. It would look like TSCP thought like a human and did the 'impossible'. It would look as incredible as AlphaZ except it would have done it on equal hardware.

[d]r5k1/pp1bppbp/3p1np1/q5B1/2r1PP2/1NN5/PPPQ2PP/1K1R3R b - - 1 16

Thanks Michael. Could you elaborate a little bit more on this post? I mean how reiforcement learning applies to this position, what is MSMD, etc?

tpoppins · Post by **tpoppins** » Wed Dec 13, 2017 9:18 pm

Going by Michael's earlier posts here MSMD is Monkey See Monkey Do.

Michael Sherwin · Post by **Michael Sherwin** » Thu Dec 14, 2017 2:02 am

giovanni wrote:
Michael Sherwin wrote:The following position is a bit dated as most strong engines will find the best move using normal search. However 30 years ago, just throwing a dart at the calendar, the best engines could not find the best move. Even RomiChess in 2006 could not find it, Phalanx 22 could. So this example is a bit dated. In 2005 this position is one that I hoped Romi could find with normal search but that did not happen. After I added reinforcement learning and before I added MSMD learning I tested Romi playing the black pieces to see if Romi could find the best move after training a number of games. It took Romi 40 games to find the best move but when she found it (learned it due to reinforcement learning) she won every game. I know that TSCP could also find this winning move after enough training games in the position. The point is if TSCP had reinforcement learning and won a game against SF in this position it would look superhuman. It would look like TSCP thought like a human and did the 'impossible'. It would look as incredible as AlphaZ except it would have done it on equal hardware.

[d]r5k1/pp1bppbp/3p1np1/q5B1/2r1PP2/1NN5/PPPQ2PP/1K1R3R b - - 1 16
Thanks Michael. Could you elaborate a little bit more on this post? I mean how reiforcement learning applies to this position, what is MSMD, etc?

MSMD is as the above post indicates, Monkey See Monkey Do, learning. It merely plays winning lines from past experience upto 180 ply in RomiChess. So Romi can play some very deep lines and use virtually no time on the clock.

How reinforcement learning applies to the Dragon position above is that unless black finds the winning move and plays some other move instead black's position is losing. On learning being triggered the entire game is overlaid onto the tree stored on the hard disk. For each node stored on the hard disk there is the reinforcement value. The nodes of the winning side are adjusted upwards and the nodes of the losing side are adjusted downward. This means that bad moves can gain value from this and good moves can lose value. Over time this corrects itself. Since higher nodes are given a larger reward/penalty higher nodes affect the search sooner but eventually the values backpropagate to the root of the current position and when all the alternative moves to the winning move look worse than Qxc3 it will play Qxc3 and win and since those moves then get rewarded it then just plays the winning move as long as it continues to win. But then that line being moved to the hash before each search the winning move starts to affect the search from an earlier node in the game. As long as there is any subtree stored in the learn file no matter how small that subtree might be those nodes with its accumulated scores affect the search.

The learning in RomiChess was intended as a sparring partner for humans. Winboard sends a result command in human versus computer games. Arena does not or did not anyway 11 years ago. So set up a position or start from the starting position and play against Romi and if you beat Romi Romi will play differently. Change sides and Romi will play your winning moves against you and then you will have to win and teach Romi better moves. Then switch sides again and if you win Romi will learn yet more. If Romi wins then Romi is the teacher. It is hard to put into words but basically the engine and human teach each other and is especially good for learning a chosen opening. Anyway in the last 11 years I have received zero reports of Romi being used like intended. That is a shame really because there is no other training system like it in existence as far as I know.

trulses · Post by **trulses** » Thu Dec 14, 2017 2:21 am

Michael Sherwin wrote:How reinforcement learning applies to the Dragon position above is that unless black finds the winning move and plays some other move instead black's position is losing. On learning being triggered the entire game is overlaid onto the tree stored on the hard disk. For each node stored on the hard disk there is the reinforcement value. The nodes of the winning side are adjusted upwards and the nodes of the losing side are adjusted downward. This means that bad moves can gain value from this and good moves can lose value. Over time this corrects itself. Since higher nodes are given a larger reward/penalty higher nodes affect the search sooner but eventually the values backpropagate to the root of the current position and when all the alternative moves to the winning move look worse than Qxc3 it will play Qxc3 and win and since those moves then get rewarded it then just plays the winning move as long as it continues to win. But then that line being moved to the hash before each search the winning move starts to affect the search from an earlier node in the game. As long as there is any subtree stored in the learn file no matter how small that subtree might be those nodes with its accumulated scores affect the search.

The learning in RomiChess was intended as a sparring partner for humans. Winboard sends a result command in human versus computer games. Arena does not or did not anyway 11 years ago. So set up a position or start from the starting position and play against Romi and if you beat Romi Romi will play differently. Change sides and Romi will play your winning moves against you and then you will have to win and teach Romi better moves. Then switch sides again and if you win Romi will learn yet more. If Romi wins then Romi is the teacher. It is hard to put into words but basically the engine and human teach each other and is especially good for learning a chosen opening. Anyway in the last 11 years I have received zero reports of Romi being used like intended. That is a shame really because there is no other training system like it in existence as far as I know.

Hey Michael, very interesting stuff, this seems like a table-based monte carlo policy evaluation. Impressive that you would independently discover such a thing on your own.

Did you ever try self-play in romi-chess using this method?

Did you ever try using one learn file with multiple different engines?

Did you always adjust the table by the same centipawn amount, or did you try lowering the centipawn bonus as you got more visits to each position?

You might experience some issues going from one engine to another, since the evaluation becomes fit not just to romi-chess but also the opponents policy. However this is indeed a first step towards the policy evaluation used in A0.

If you wanted to speed up the learning process, you could look into TD(lambda) which uses a mixture of the episode return (win/loss/draw) and the table values visited over the course of the episode to update the table values.

Understanding the power of reinforcement learning

Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning

Re: Understanding the power of reinforcement learning