Understanding the power of reinforcement learning

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Understanding the power of reinforcement learning

Post by Michael Sherwin »

The following position is a bit dated as most strong engines will find the best move using normal search. However 30 years ago, just throwing a dart at the calendar, the best engines could not find the best move. Even RomiChess in 2006 could not find it, Phalanx 22 could. So this example is a bit dated. In 2005 this position is one that I hoped Romi could find with normal search but that did not happen. After I added reinforcement learning and before I added MSMD learning I tested Romi playing the black pieces to see if Romi could find the best move after training a number of games. It took Romi 40 games to find the best move but when she found it (learned it due to reinforcement learning) she won every game. I know that TSCP could also find this winning move after enough training games in the position. The point is if TSCP had reinforcement learning and won a game against SF in this position it would look superhuman. It would look like TSCP thought like a human and did the 'impossible'. It would look as incredible as AlphaZ except it would have done it on equal hardware.

[d]r5k1/pp1bppbp/3p1np1/q5B1/2r1PP2/1NN5/PPPQ2PP/1K1R3R b - - 1 16
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
shrapnel
Posts: 1339
Joined: Fri Nov 02, 2012 9:43 am
Location: New Delhi, India

Re: Understanding the power of reinforcement learning

Post by shrapnel »

Do you have a UCI Engine that we can test ?
i7 5960X @ 4.1 Ghz, 64 GB G.Skill RipJaws RAM, Twin Asus ROG Strix OC 11 GB Geforce 2080 Tis
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Understanding the power of reinforcement learning

Post by Michael Sherwin »

shrapnel wrote:Do you have a UCI Engine that we can test ?
No Sorry. UCI does not support the result command so learning would have to be triggered differently.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Understanding the power of reinforcement learning

Post by Michael Sherwin »

Michael Sherwin wrote:The following position is a bit dated as most strong engines will find the best move using normal search. However 30 years ago, just throwing a dart at the calendar, the best engines could not find the best move. Even RomiChess in 2006 could not find it, Phalanx 22 could. So this example is a bit dated. In 2005 this position is one that I hoped Romi could find with normal search but that did not happen. After I added reinforcement learning and before I added MSMD learning I tested Romi playing the black pieces to see if Romi could find the best move after training a number of games. It took Romi 40 games to find the best move but when she found it (learned it due to reinforcement learning) she won every game. I know that TSCP could also find this winning move after enough training games in the position. The point is if TSCP had reinforcement learning and won a game against SF in this position it would look superhuman. It would look like TSCP thought like a human and did the 'impossible'. It would look as incredible as AlphaZ except it would have done it on equal hardware.

[d]r5k1/pp1bppbp/3p1np1/q5B1/2r1PP2/1NN5/PPPQ2PP/1K1R3R b - - 1 16
I've tried my best to help everyone understand about reinforcement learning, 11 years ago and now since AlphaZ has rocked 'your' world. To those that listened and believed and became fans even though there were not all that many, I appreciated it. To the majority of engine authors that pooh poohed my message and wrote it off as some kind of cheap book tuning trick I guess you got your asses handed to you with the crushing defeat of Stockfish. I'm not referring to all engine authors, only the ones that it pertains too. Fans of your engines asked for "Romi-style learning" in your engines for over a decade now and were not paid much attention. I have explained how it works and why it is so powerful till I'm blue in the face. But whenever 'you' mention any details about it while pooh poohing it you demonstrate a total lack of understanding and just continue on your merry way like you mastered the discussion even though you had no clue what you were talking about. RomiChess can't demonstrate what AlpaZ did because there is no way for Romi to play millions of training games. I just don't have the computer power. Once again I'm only talking to certain people. The future has left you behind and now you will have to play catch up. As for me I'm done with this issue and I leave you and you can go back to your bias. I've done my part! Just know this, the power in AlphaZ is the learning and not the hardware (except for quick training) or the MCTS or the deep NN.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
tsmiller1
Posts: 6
Joined: Wed Dec 13, 2017 3:37 pm
Location: Kingsport, TN

Re: Understanding the power of reinforcement learning

Post by tsmiller1 »

Hey, I for one love this kind of learning, and I think AlphaZero's conquest was a major step forward for AI. Thanks for your own contributions to the field.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Understanding the power of reinforcement learning

Post by jdart »

What is "normal search" has changed drastically in the past 30 years, or even in the past 10, as Moore's Law has made what used to be considered supercomputer computing power available to PC users.

Algorithms are important, but now even a program with poor algorithms can reach depths >25 very quickly. I remember when that was quite unattainable.

--Jon
giovanni
Posts: 142
Joined: Wed Jul 08, 2015 12:30 pm

Re: Understanding the power of reinforcement learning

Post by giovanni »

Michael Sherwin wrote:The following position is a bit dated as most strong engines will find the best move using normal search. However 30 years ago, just throwing a dart at the calendar, the best engines could not find the best move. Even RomiChess in 2006 could not find it, Phalanx 22 could. So this example is a bit dated. In 2005 this position is one that I hoped Romi could find with normal search but that did not happen. After I added reinforcement learning and before I added MSMD learning I tested Romi playing the black pieces to see if Romi could find the best move after training a number of games. It took Romi 40 games to find the best move but when she found it (learned it due to reinforcement learning) she won every game. I know that TSCP could also find this winning move after enough training games in the position. The point is if TSCP had reinforcement learning and won a game against SF in this position it would look superhuman. It would look like TSCP thought like a human and did the 'impossible'. It would look as incredible as AlphaZ except it would have done it on equal hardware.

[d]r5k1/pp1bppbp/3p1np1/q5B1/2r1PP2/1NN5/PPPQ2PP/1K1R3R b - - 1 16
Thanks Michael. Could you elaborate a little bit more on this post? I mean how reiforcement learning applies to this position, what is MSMD, etc?
tpoppins
Posts: 919
Joined: Tue Nov 24, 2015 9:11 pm
Location: upstate

Re: Understanding the power of reinforcement learning

Post by tpoppins »

Going by Michael's earlier posts here MSMD is Monkey See Monkey Do.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Understanding the power of reinforcement learning

Post by Michael Sherwin »

giovanni wrote:
Michael Sherwin wrote:The following position is a bit dated as most strong engines will find the best move using normal search. However 30 years ago, just throwing a dart at the calendar, the best engines could not find the best move. Even RomiChess in 2006 could not find it, Phalanx 22 could. So this example is a bit dated. In 2005 this position is one that I hoped Romi could find with normal search but that did not happen. After I added reinforcement learning and before I added MSMD learning I tested Romi playing the black pieces to see if Romi could find the best move after training a number of games. It took Romi 40 games to find the best move but when she found it (learned it due to reinforcement learning) she won every game. I know that TSCP could also find this winning move after enough training games in the position. The point is if TSCP had reinforcement learning and won a game against SF in this position it would look superhuman. It would look like TSCP thought like a human and did the 'impossible'. It would look as incredible as AlphaZ except it would have done it on equal hardware.

[d]r5k1/pp1bppbp/3p1np1/q5B1/2r1PP2/1NN5/PPPQ2PP/1K1R3R b - - 1 16
Thanks Michael. Could you elaborate a little bit more on this post? I mean how reiforcement learning applies to this position, what is MSMD, etc?
MSMD is as the above post indicates, Monkey See Monkey Do, learning. It merely plays winning lines from past experience upto 180 ply in RomiChess. So Romi can play some very deep lines and use virtually no time on the clock.

How reinforcement learning applies to the Dragon position above is that unless black finds the winning move and plays some other move instead black's position is losing. On learning being triggered the entire game is overlaid onto the tree stored on the hard disk. For each node stored on the hard disk there is the reinforcement value. The nodes of the winning side are adjusted upwards and the nodes of the losing side are adjusted downward. This means that bad moves can gain value from this and good moves can lose value. Over time this corrects itself. Since higher nodes are given a larger reward/penalty higher nodes affect the search sooner but eventually the values backpropagate to the root of the current position and when all the alternative moves to the winning move look worse than Qxc3 it will play Qxc3 and win and since those moves then get rewarded it then just plays the winning move as long as it continues to win. But then that line being moved to the hash before each search the winning move starts to affect the search from an earlier node in the game. As long as there is any subtree stored in the learn file no matter how small that subtree might be those nodes with its accumulated scores affect the search.

The learning in RomiChess was intended as a sparring partner for humans. Winboard sends a result command in human versus computer games. Arena does not or did not anyway 11 years ago. So set up a position or start from the starting position and play against Romi and if you beat Romi Romi will play differently. Change sides and Romi will play your winning moves against you and then you will have to win and teach Romi better moves. Then switch sides again and if you win Romi will learn yet more. If Romi wins then Romi is the teacher. It is hard to put into words but basically the engine and human teach each other and is especially good for learning a chosen opening. Anyway in the last 11 years I have received zero reports of Romi being used like intended. That is a shame really because there is no other training system like it in existence as far as I know.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: Understanding the power of reinforcement learning

Post by trulses »

Michael Sherwin wrote:How reinforcement learning applies to the Dragon position above is that unless black finds the winning move and plays some other move instead black's position is losing. On learning being triggered the entire game is overlaid onto the tree stored on the hard disk. For each node stored on the hard disk there is the reinforcement value. The nodes of the winning side are adjusted upwards and the nodes of the losing side are adjusted downward. This means that bad moves can gain value from this and good moves can lose value. Over time this corrects itself. Since higher nodes are given a larger reward/penalty higher nodes affect the search sooner but eventually the values backpropagate to the root of the current position and when all the alternative moves to the winning move look worse than Qxc3 it will play Qxc3 and win and since those moves then get rewarded it then just plays the winning move as long as it continues to win. But then that line being moved to the hash before each search the winning move starts to affect the search from an earlier node in the game. As long as there is any subtree stored in the learn file no matter how small that subtree might be those nodes with its accumulated scores affect the search.

The learning in RomiChess was intended as a sparring partner for humans. Winboard sends a result command in human versus computer games. Arena does not or did not anyway 11 years ago. So set up a position or start from the starting position and play against Romi and if you beat Romi Romi will play differently. Change sides and Romi will play your winning moves against you and then you will have to win and teach Romi better moves. Then switch sides again and if you win Romi will learn yet more. If Romi wins then Romi is the teacher. It is hard to put into words but basically the engine and human teach each other and is especially good for learning a chosen opening. Anyway in the last 11 years I have received zero reports of Romi being used like intended. That is a shame really because there is no other training system like it in existence as far as I know.
Hey Michael, very interesting stuff, this seems like a table-based monte carlo policy evaluation. Impressive that you would independently discover such a thing on your own.

Did you ever try self-play in romi-chess using this method?

Did you ever try using one learn file with multiple different engines?

Did you always adjust the table by the same centipawn amount, or did you try lowering the centipawn bonus as you got more visits to each position?

You might experience some issues going from one engine to another, since the evaluation becomes fit not just to romi-chess but also the opponents policy. However this is indeed a first step towards the policy evaluation used in A0.

If you wanted to speed up the learning process, you could look into TD(lambda) which uses a mixture of the episode return (win/loss/draw) and the table values visited over the course of the episode to update the table values.