Reinforcement learning project

mvanthoor · Post by **mvanthoor** » Thu Feb 04, 2021 11:32 am

hgm wrote: ↑Wed Feb 03, 2021 3:20 pm Strong Chess seems to be ugly / boring Chess. Blame the inventor of the game, not the engine! And then start playing Shogi.

Of course strong chess is boring, because for most people, it gets into the realm of impossible to understand. For amateur players, chess consists of some general knowledge about opening principles, material value, positional principles, endgame rules, some mating and combination patterns, and the capacity to think about 6-8 ply ahead. That will be enough to get you to around 2000 ELO, but to go much beyond that, you'll need to start studying.

Engines nowadays can see so far ahead, that the moves they make are often so subtle that it's impossible to understand for grandmasters, let alone 'normal' people. Look at AlphaGo Zero, that disproved lots of "you can't do that" theory in Go, and rewrote half the opening knowledge of the last 300 years. For a pro who can (still) somewhat understand this, that is fascinating; for the average Go-player it feels like he studied his entire life for nothing because now nothing seems to be true anymore. Chess engines have reached the point where a normal human can't compete tactically and positionally long ago already.

hgm · Post by **hgm** » Wed Feb 17, 2021 5:17 pm

I realized stage 1 of this project: a toy engine that plays Janggi. It is still very basic (e.g. no hash table yet), but it plays. And to my inexperienced eyes the moves don't look too crazy, so I hope most bugs are gone now. (First I had a tremendous search explosion in QS, because I had forgotten to define the move list as unsigned, so that the MVV/LVA sorting worked the wrong way around!

) I uploaded the engine to http://hgm.nubati.net/j2.exe ; it can be played in the latest WinBoard-AA version, which supports Janggi as a standard variant.

I spent most time on thinking how I would handle the evaluation. Janggi is much like Team-Mate Chess: no single piece has mating potential, even KRK is insufficient mating material. Pairs of pieces can only force mate against a bare King when they contain a Pawn or Rook. So where in Chess you can get away with not knowing KBK and KNK are draws, because you won't fall into that trap that often, in Jianggi not recognizing the dead draws would mean you will draw virtually every game, even those where you have a 3-piece advantage. (Which would be enough against a bare King, but cannot beat an opponent still has Advisors in the Palace. So the evaluation determines for all combinations of 3 or fewer pieces how much of a Palace it can beat (KAA, KA, bare K or not at all), and whether the defender does have enough, or even too many Advisors to survive.

I both players still have enough material to overwhelm the enemy Palace, I do a 'normal' evaluation, based on piece values and PST. The assumption here is that it doesn't hurt in this case that the Advisors can not reach the opponent Palace to attack, as I would need to assign some pieces to the defense of the Palace anyway, and I might as well assign that task to the Advisors, so that my other material is free to attack. This way the Advisors indirectly contribute to the attack power just as every other piece.

The situation becomes different when one of the players can no longer overwhelm the enemy Palace. The opponent can then throw all his remaining material against the Palace of that player, except of course his Advisors. So Advisors in excess of what was needed to defend become worthless for that attacking player. E.g. if I only have a Rook, and the opponent still has a KAA Palace, both his Advisors are completely useless, as the Rook cannot even checkmate a bare King. (And perpetually checking it would be forbidden as well.) Losing the value of his redundant Advisors is a disadvantage for the attacker, but this is often more than compensated by the fact that the defender's Pawns become mostly useless: they are almost always too far in front of the Palace to help in the defense, and the attacker simply goes around and behind them. Only a pair of Pawns on the initial rank directly in front of the Palace (protecting each other) can be a hindrance for attacking Elephants. So the evaluation ignores the defender's Pawns and the attacker's redundant Advisors. Of course when both players lack the material to overwhelm each other's Palace, the game is a dead draw.

I hope this method of evaluation captures the essence of the game; in the future it can be made more elaborate by also tabulating the material needed to overcome Palace + one other piece.

Next step would be to use this engine in self-play to generate games, and in particular quiet positions from these games, to serve as learning material to Texel-tune the evaluation. (Which currently operates based on guessed piece values and other parameters.)

Raphexon · Post by **Raphexon** » Wed Feb 17, 2021 6:54 pm

Aren't the repetition rules for Janggi really complicated (or even subjective in nature)?
How are those coded?

hgm · Post by **hgm** » Wed Feb 17, 2021 7:27 pm

To be frank, those rules are not really clear to me at the moment. The information provided to me by my Korean contact is a bit ambiguous about this. Janggi seems unique in that it does not only recognize repetition of the same game state, but also repetition of the location of a single piece during a consecutive number of moves of that piece. So basically you would have to store a sequence of from-squares of the preceding moves next to a sequence of hash keys for the preceding position. Where you then search back through the from-squares for as long as the same piece has been moving, where you would search through the position keys for as long as the currently moving piece has been moving. In general both kind of repetitions are forbidden, but Kings and Advisors are exempted from that. (Which would make a perpetual checker always at fault.)

I have not coded that yet, as it doesn't seem to me it would affect the game much how exactly the repetition rules are, as long as they cause avoidance of repetition, and outlaw perpetual checking. So I currently award a position repetition as a draw (so that one of the engines in self-play would avoid it, unless the position is so nearly equal that a draw seems a good result anyway), except when the repetition was brought about by a check evasion, in which case it is a loss for the checker.

Since this is only intended to be a temporary solution, I implemented it as 'lazy rep detection': I store the key signature of visited positions in a small hash table, using always replace (but saving the old value, and restoring it on unmake). If the new key signature is equal to what was already stored in the entry where it hashes to, it is a repetition. There is a small chance a key is overwritten before you revisit the position, so that you would not recognize it as a repetition. But the chance that this happens for all positions in a loop is very small, so then you would simply detect the repetition on the next move of the loop.

Mike Sherwin · Post by **Mike Sherwin** » Wed Feb 17, 2021 8:20 pm

hgm wrote: ↑Sun Jan 31, 2021 11:37 pm I am new to reinforcement learning, but I decide to give it a try. My plan is roughly this:

1) I will make a simple alpha-beta engine to play Janggi (Korean Chess), based on the move generator I have already written. (This currently exists as a CECP engine that moves randomly.)
2) I will give it a hand-crafted evaluation that takes all its parameters from an ini file (starting with piece values, PST, mobility).
3) I will equip it with a self-play command, which will write the position at the end of the PV of every move to a file.
4) I will have it play a huge number of super-fast games (e.g. a few thousand nodes per move). The first N moves (N=20?) will be played purely randomly, to guarantee game diversity and very unbalanced positions.
5) I will use these positions for Texel-tuning the evaluation.
6) I will then loop back to (4), using the tuned evaluation parameters.
7) When the evaluation parameters do not significantly change anymore, I am done tuning.
8) I will then play a few long-TC games, and have a Janggi expert comment on what strategic errors the engine makes. This should produce ideas for new evaluation terms to add, which would recognize the involved patterns.
9) With this extended evaluation, I then go back to (4), using the already optimized values for the old parameters, to see if beneficial weights for the new terms can be found.

I am curious what this will lead to!

It is of course a type of learning but it is not Reinforcement Learning. RL as defined has a reward and penalty for each and every situation. If a child touches a hot burner on the kitchen stove the penalty the child receives reinforces what their parent told them, "don't touch the burner". In pavlov's dog experiments a wrong action got the dog a electric shock and a right action got the dog a treat. Each decision the dog made resulted in either positive or negative reinforcement. Adapted to chess, every move is a decision and therefore every move should receive positive or negative reinforcement. It is just experiential learning as in even though the evaluator combined with a limited depth search says a move is good or bad a hundred or a thousand, etc, games played to the end shows something different. Therefore each game played should nudge the evaluation for that position just slightly so that an accumulation of experience will eventually (maybe after ten games from that position, maybe 4) cause a different move to be scored higher. Each games moves can be stored in a hash or a tree on the hard drive. A hash could detect transpositions but may lose data due to collisions. A tree remembers all moves but may not be able to handle transpositions. The true power of RL comes from the fact that accumulated information far away from the limited search is used to affect the search. Since modified valuations are stored the entire subtree that has been experienced can be loaded into the search hash. Therefore the RL does not dictate to but only influences the search. And as long as the hash is not cleared after each game move the influence remains even if there is no more subtree to load.

But that is so yesterday. "We" want to move past end of game learning to learning in real time. Let me start by asking a question. In the case of SF how much is a 40 ply search worth over a 20 ply search if there are no tactics to find? And does not a null move (I do not mean the normal null move) find tactical threats anyway? So imagine that the iteration depth called for is 40 ply. A standard AB search is also so yesterday. Instead search each root move to only 20 ply deep. Play the best move and repeat until 40 ply moves have been made. Now there is information from 60 ply deep which means that the root move will have the score from the last 20 ply search which has information 60 ply away. Now since a 20 ply search is completed in a tiny fraction of the time a 40 ply search can be completed the iterative depth can maybe reach 60 ply instead of 40. And now the 60 ply "search" that uses 30 ply searches returns information from 90 ply away. That could be the framework in which RT RL can be done. So let's say that a list of moves searched to 20 ply deep has a score for each move. Those moves can be stored in memory that is a tree structure. RL adjustments can be made to each move and the 40 ply move sequences can be played many times thus enabling RT RL. Now maybe the iterative search may only reach 50 ply instead of 60 but it will still contain information from 75 ply away. But it will be learned information. Okay, now the null move. If we are in say the 40 ply iterative level and we are doing 20 ply searches it is trivial to insert a null move at each step to judge the threat and either cut that line off or possibly extend that line, etc.

This is the future of search in games like chess once it can be demonstrated.

hgm · Post by **hgm** » Wed Feb 17, 2021 9:38 pm

Mike Sherwin wrote: ↑Wed Feb 17, 2021 8:20 pmIt is of course a type of learning but it is not Reinforcement Learning.

You are wrong about that. It is reinforcement learning because I tune the evaluation on the data generated by the engine itself. It would be supervised learning if I showed it data from good games, and tell it: "this shows you what you should do!".

But here I let the engine behave the way it wants, and depending on whether the spontaneously exhibited behavior was good (= game won) or bad (= game lost) I award or punish the engine by increasing or decreasing the eval parameters that led to the behavior (i.e. to a high score for the positions in the game).

It is just that I do this for an entire batch of games at once, rather than adjusting the parameters after every individual game (or every individual move). But that doesn't alter what type of learning it is. It just judges the behavior on a larger scale, to make the whole procedure less sensitive to statistical noise, by making sure the behavior is representatively sampled before coming to a verdict. This also has the advantage that you can see better how much you have to punish or reward it, without overshooting or hardly progressing at all.

What you describe in the second paragraph is already how top engines work, with their aggressive late-move reductions. They search enormously deeper on the PV than on side branches. You just propose a reduction scheme that says: never deepen a deviation from the PV to more than 20 ply. Of course this means that you will never find a move that has a surprising result after 21 ply. So top engines usually do not impose such an absolute limit, but rather something like: reduce this to 1/3 of the PV depth. Then they eventually get to see everything, even though they have to search the PV 3 times deeper than a fixed-depth search would have done. See the discussion in the general forum section about the 5-mover. Stockfish and Komodo need 10 minutes. Myrddin solves it in 21 sec, Fairy-Max in 13 sec.

Mike Sherwin · Post by **Mike Sherwin** » Wed Feb 17, 2021 10:10 pm

hgm wrote: ↑Wed Feb 17, 2021 9:38 pm
Mike Sherwin wrote: ↑Wed Feb 17, 2021 8:20 pmIt is of course a type of learning but it is not Reinforcement Learning.
You are wrong about that. It is reinforcement learning because I tune the evaluation on the data generated by the engine itself. It would be supervised learning if I showed it data from good games, and tell it: "this shows you what you should do!".

But here I let the engine behave the way it wants, and depending on whether the spontaneously exhibited behavior was good (= game won) or bad (= game lost) I award or punish the engine by increasing or decreasing the eval parameters that led to the behavior (i.e. to a high score for the positions in the game).

It is just that I do this for an entire batch of games at once, rather than adjusting the parameters after every individual game (or every individual move). But that doesn't alter what type of learning it is. It just judges the behavior on a larger scale, to make the whole procedure less sensitive to statistical noise, by making sure the behavior is representatively sampled before coming to a verdict. This also has the advantage that you can see better how much you have to punish or reward it, without overshooting or hardly progressing at all.

Then you are extending the definition of what RL is. In RL there is an actor (child, dog, chess engine, etc) that performs actions in the "world" whatever that world is. Each and every action the actor makes that the consequences are learned are in a specific situation. The child at the stove or the dog at the triggering device or the chess engine in any given position. Tuning evaluation parameters the way you describe is decades old. Real RL eventually, if done correctly, seeks and reaches absolute truth, touching a hot burner will burn you, activating the wrong trigger will shock you, playing the best move will have this result. I don't know enough about Alpha(x) or derivatives to say if what they do is real RL by definition or not. Maybe they have muddied the waters, idk. Maybe it is a type of fuzzy RL that leads to better general performance but not to any absolute truth for any given position.

Okay, so it is a type of RL even though it exceeds the original definition of what RL is. Who am I to argue against that?

Mike Sherwin · Post by **Mike Sherwin** » Wed Feb 17, 2021 10:14 pm

hgm wrote: ↑Wed Feb 17, 2021 9:38 pm
Mike Sherwin wrote: ↑Wed Feb 17, 2021 8:20 pmIt is of course a type of learning but it is not Reinforcement Learning.
What you describe in the second paragraph is already how top engines work, with their aggressive late-move reductions. They search enormously deeper on the PV than on side branches. You just propose a reduction scheme that says: never deepen a deviation from the PV to more than 20 ply. Of course this means that you will never find a move that has a surprising result after 21 ply. So top engines usually do not impose such an absolute limit, but rather something like: reduce this to 1/3 of the PV depth. Then they eventually get to see everything, even though they have to search the PV 3 times deeper than a fixed-depth search would have done. See the discussion in the general forum section about the 5-mover. Stockfish and Komodo need 10 minutes. Myrddin solves it in 21 sec, Fairy-Max in 13 sec.

No, what I describe is a framework in which RT RL can be done.

hgm · Post by **hgm** » Wed Feb 17, 2021 10:33 pm

Mike Sherwin wrote: ↑Wed Feb 17, 2021 10:10 pmThen you are extending the definition of what RL is. In RL there is an actor (child, dog, chess engine, etc) that performs actions in the "world" whatever that world is.

Well, the 'action' here is playing a few thousand games. I don't see how that falls outide the definitions. Actions are typically complex, or they wouldn't be worth learning. If you want to teach a dog that he has to apport a stick you would also not award him for every step he takes towards it. You give it a cookie after he performed the entire task.

Tuning evaluation parameters the way you describe is decades old.

Well, reinforcement learning is as old as humanity.

Mike Sherwin · Post by **Mike Sherwin** » Wed Feb 17, 2021 11:46 pm

hgm wrote: ↑Wed Feb 17, 2021 10:33 pm
Mike Sherwin wrote: ↑Wed Feb 17, 2021 10:10 pmThen you are extending the definition of what RL is. In RL there is an actor (child, dog, chess engine, etc) that performs actions in the "world" whatever that world is.
Well, the 'action' here is playing a few thousand games. I don't see how that falls outide the definitions. Actions are typically complex, or they wouldn't be worth learning. If you want to teach a dog that he has to apport a stick you would also not award him for every step he takes towards it. You give it a cookie after he performed the entire task.

Tuning evaluation parameters the way you describe is decades old.
Well, reinforcement learning is as old as humanity.

"Well, the 'action' here is playing a few thousand games."

In RL an action is defined as any action the actor can do at any specific moment in time like playing g4 on the first move and f3 on the second move. RL is very quickly going to reveal 2. f3 not to be any good. This supposes competence on the part of the adversary. So you do not do any learning at all for thousands of games. Then merely adjust, keep or discard those parameters. That by itself won't lead to much because it is too general. Stockfish only gained, what, a little over 100 elo? Now if there were a set of parameters for every node ever reached in real games and those nodes that apply had their parameters adjusted after every real game then that would even be far superior than just storing a modifiable valuation.

"If you want to teach a dog that he has to apport a stick you would also not award him for every step he takes towards it. You give it a cookie after he performed the entire task."

In chess the entire task is the next move. And the game record is many task performed in linear time. The main problem for a chess engine is often not knowing which move (or moves) led to the other side being able to win. That is why all the winning sides moves (good or bad) get a very small positive adjustment. All the losing sides moves (good or bad) are given a very small negative adjustment. Wrongly adjusted moves are corrected over time providing the adversaries are competent.

Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project