Reinforcement learning project

hgm · Post by **hgm** » Thu Feb 18, 2021 9:26 am

Mike Sherwin wrote: ↑Wed Feb 17, 2021 11:46 pmIn RL an action is defined as any action the actor can do at any specific moment in time like playing g4 on the first move and f3 on the second move.

Says who? Certainly not the definitions I have seen. These are completely abstract, defining actions as (probabilistic) transitions between 'states' of an environment, without putting any requirement on what these states can be. For instace, identifying a 'state' with a pair of integers (nr_of_wins, nr_of_games_played) and an 'action' to playing a game using a certain strategy (e.g. a set of evaluation parameters) satisfies all requirements of the abstract model, just to name one counter-example to your claim.

In chess the entire task is the next move. And the game record is many task performed in linear time. The main problem for a chess engine is often not knowing which move (or moves) led to the other side being able to win. That is why all the winning sides moves (good or bad) get a very small positive adjustment. All the losing sides moves (good or bad) are given a very small negative adjustment. Wrongly adjusted moves are corrected over time providing the adversaries are competent.

The problem is that game results in chess are probabilistic, and whether a strategy was good or bad can only be judged after sufficently many games have been played to get a statistically significant result. Otherwise the learning process would mainly be driven by noise, and would converge only when the learning steps are made so small that they cannot significantly modify the strategy before sufficiently many games have been played to make their accumulated result statistically significant. In that limit it doesn't really make a difference whether I first collect many games under a constant strategy, and then apply the sum of all the strategy corrections the individual games suggest, or apply them one by one.

Mike Sherwin · Post by **Mike Sherwin** » Thu Feb 18, 2021 1:15 pm

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents (a chess engine)ought to take actions (a move)in an environment (rules of chess)in order to maximize the notion of cumulative reward(from previous experience).[1] Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Reinforcement learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected(moves getting a wrong adjustment immediately corrected). Instead the focus is on finding a balance between exploration (under an evaluator)(of uncharted territory) and exploitation (of current knowledge)(after accumulated RL adjustments).[2]

The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context use dynamic programming techniques.[3] The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge (Rl is only concerned with results, not general knowledge)of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.(RL is not for improving an evaluator because an evaluator can never be precise enough in all situations. The goal of RL is to learn over time to differentiate between immediate actions at a given moment in time.)

The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation ... (RL is not about training an approximator. It is about finding optimal actions in exact circumstances)

A reinforcement learning agent interacts with its environment in discrete time steps(moves). At each time t, the agent receives the current state {\displaystyle s_{t}}s_{t} and reward (from accumulated experience){\displaystyle r_{t}}r_{t}. It then chooses an action {\displaystyle a_{t}}a_{t} from the set of available actions, which is subsequently sent to the environment(move made on board). The environment moves to a new state {\displaystyle s_{t+1}}s_{t+1} and the reward {\displaystyle r_{t+1}}r_{t+1} associated with the transition {\displaystyle (s_{t},a_{t},s_{t+1})}(s_{t},a_{t},s_{t+1}) is determined. The goal of a reinforcement learning agent is to learn a policy: {\displaystyle \pi :A\times S\rightarrow [0,1]}{\displaystyle \pi :A\times S\rightarrow [0,1]}, {\displaystyle \pi (a,s)=\Pr(a_{t}=a\mid s_{t}=s)}{\displaystyle \pi (a,s)=\Pr(a_{t}=a\mid s_{t}=s)} which maximizes the expected cumulative reward.(It's policy is to play the move that scores highest after the accumulated rewards are factored in)

Two elements make reinforcement learning powerful: the use of samples (Can be RT samples of games or game segments because they return information further away from the root)to optimize performance and the use of function approximation (evaluation function)to deal with large environments(as in not solvable like chess). Thanks to these two key components, reinforcement learning can be used in large environments in the following situations:

A model of the environment is known, but an analytic solution is not available;(like in chess)
Only a simulation model of the environment is given (the subject of simulation-based optimization);[5]
The only way to collect information about the environment is to interact with it.(Through real time sampling)
The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.

Reinforcement learning requires clever exploration mechanisms(an AB search with an evaluator); randomly selecting actions, without reference to an estimated probability distribution, shows poor performance (Monte Carlo is not very good). The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.(shallow fast searches that can intelligently examine the search space by playing many game segments under control of what it learns after each segment)

The brute force approach entails two steps:

For each possible policy, sample returns while following it (a sample, a game segment is taken for each possible action, move)
Choose the policy with the largest expected return(then play the move with the best adjusted score and then another and another to see outcome further away in time)

Deep reinforcement learning
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space.[29] The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning. So what AlphaZero and the like do is extend RL. That means that RL itself does not directly cause a differentiation between moves. Instead it trains a general approximator to be a bit stronger. The general approximator is NOT RL. It is an extension of RL. The approximator cannot contain learned rewards for each specific board position and therefore is no longer directly RL.

This is from Wikipedia. I never read it before. And yet it says exactly everything I have said albeit in more general terms. I acknowledge it is not the only source of information. But it makes very clear what RL is and what RL is not. It defines RL exactly how I defined it for myself when I reinvented it in the second half of 2005 based solely on what I remembered from high school about Pavlov's dog experiments.

hgm · Post by **hgm** » Thu Feb 18, 2021 1:40 pm

Just to clear things up: the black text you quote is from Wikipedia. All the blue stuff sprang from your own imagination.

Note I am not denying that what you describe (where actions are moves, etc.) is a case of reinforcement learning. But it is just one particular example of it. An action can also be a game, or in my case, a match. It could even be swinging a baseball bat with a certain speed at a certain angle.

It seems to me that you want to usurp the the term 'reinforcement learning', and narrow its meaning to a very specific case of what you are doing.

hgm · Post by **hgm** » Fri Feb 19, 2021 9:20 pm

I came to realize that I overlooked one thing: apart from being forced into a defense-only role, because your total material is not enough to overwhelm the opponent's Palace, you can also voluntarily choose to defend. That then makes your own Pawns practically useless, but it also makes the opponent's Advisors useless, as he cannot use those for attack. But if you have few Pawns, and the opponent many Advisors, that could actually be a very good deal. E.g. suppose both players have two Advisors, a Rook, two Horses and an Elephant each. the opponent has 3 Pawns, and you only one. So in total material you are two Pawns behind. But by using all your material for defense it has become a battle between his 3 Pawns and your two Advisors, with otherwise equal material. Conventional piece values are such that an Advisor is worth more than a Pawn (and indeed two Advisors would easily draw three Pawns if they were the only material; you would need 5 Pawns to overcome such a defense). So you would really be less than a Pawn behind.

So the switchover to a 'defensive strategy', where you no longer count friendly Pawns and enemy Advisors, should not only occur when your own 'attack surplus' has dwindled below zero, but whenever the opponent's attack surplus in smaller than his total advantage. (And your attack surplus is smaller than his, for otherwise you could still play for a win.)

hgm · Post by **hgm** » Sun Mar 07, 2021 10:49 pm

I finally seem to have the tuning process working correctly. There were lots of stupid errors, and because the whole process happened 'under the hood', I had a hard time identifying them. It also revealed a bad bug in the engine: the evaluation forgot to add the piece value of the Pawns. This could happen because Pawns need special treatment: because Pawns do not promote in Janggi they essentially become worthless when they reach the last two ranks (where the King can sneak behind them), and on second rank extra Pawns on the 7th rank could also be useless.

I made a small program to generate synthetic training positions, which calculates the win probability of those based on some standard piece values, and then determine a win/loss outcome with that probability. On a file of 10,000 such positions the Texel tuner was able to extract piece values that were close to those used in the generation of the results (starting from all-zero piece values). Since the tuner currently works by random trial and error, keeping any improvement, it is excruciatingly slow. But at least I get a sensible result now. (No more negative Pawn values!)

I am now having the engine, fixed for the Pawn value, self-play 3000 games, at a nominal thinking 'time' of 100K nodes/move. This should produce about 250K quiet positions in a few hours. The plan is to first use this training set to optimize piece values and their dependence on game phase, keeping the other eval parameters fixed at the value they had during generation of the games. If the resulting piece values look reasonable, I will use those as a starting point for tuning the complete set of parameters with a smaller step. The current evaluation has 40 parameters (including the opening and end-game piece values). I suppose I will be able to use the same set of training positions also for tuning the weights of eval terms I have still to add, such as mobility. Many of the eval terms had been set to 0 for the self-play games anyway, to make sure the various values for the term would be equally sampled. For terms that are known to strongly affect the possibility to win I did not do that, though, otherwise too many games might end in a draw even in the presence of a winning advantage. (The comparable issue in orthodox Chess would be to not encourage pushing of Pawns, so that the engine at the fast TC used to generate the game would never realize it could promote them.)

I hope to have some results tomorrow.

hgm · Post by **hgm** » Mon Mar 08, 2021 11:01 am

The first tuning attempt looks promising. For instance, the Palace PST for King and Advisors came out like

Code: Select all

King                Advisor
-12 -12 -12         37  37  37             2
 14  49  14         19  22  19             1
 33  33  33         7   7   7              0
 d   e   f          d   e   f

So it sees very well that the King should hide in the back of the Palace, behind Advisors. That it likes to be in the center of the Palace makes sense too, as it has 8 moves there instead of 3. During the games that generated the 235k training positions, these PSTs were completely neutral. (I guess I use one degree of freedom too many here; I could have fixed the rank-1 bonus at 0.)

Now I must improve the efficiency of the tuner, to be able to do the future experiments faster; so far it optimises the MSE in the result prediction by random trial and error; Of course directed methods like gradient descent would get there much faster. But the trial-and-error method was totally trivial to program, and I wanted to be sure there were no bugs there while there could be (and in fact where) bugs everywhere else. Now that I know this training set does have an optimum with reasonable parameter values, I can start to improve the tuner, to see how much I can speed it up. After that I can start adding other eval terms (e.g. mobility), and tune these too (on the same training set).

hgm · Post by **hgm** » Sun Mar 14, 2021 10:52 pm

The training set seemed a bit small, probably also because the positions were not independent. So I generated a new training set from 10k games, sampled every 9 half-moves (instead of every 3). This gave me some 945k quiet positions.

I have added mobility to the evaluation: a bonus per non-capture for each piece type, a bonus for attacking something independent of piece type, and a bonus for protecting higher-valued and one for equal-valued pieces. (Because a Korean player suggested to me this could be important.) I did not want to diversify that by piece type, (although the evaluation is written such that it does; it is just that the table is initialized with many identical values in it), because the tuner is still quite slow if I tune too many variables. For the same reason my PST are also generated from just a few free parameters, rather than havein 90 (well, 50, because of symmetry) independent values for each square. This gave me 59 parameters, which I did optimize. This also ncludes a non-linear distortion of the total score, to create the proper trading incentive. I did a long optimization run on that, and the piece values that came out look very reasonable.

Then I tried to tune complete PST (50 parameters for R, C, H and E each, 35 for P), while keeping the other 59 parameters constant. (Those used to determine the original PST would no longer be used anyway.) First I tuned the PST for each type in isolation. Since I started from the previously optimized parameter set, this of course lowered the MSE a bit. This gave very sensible results: Pawns get worth vastly more when they approach the Palace; when they enter it on d7 or e7 the PST value even saturates at 127 (I use signed char). Same for Horses and Elephants: squares where these attack two Palace squares get the highest value, in the corners of your own back rank you get very large penalty, etc. So this seems to work.

When I combined all individually optimized PST, the MSE was significantly worse than with the parametrized tables I started from, though. This was a bit unexpected. It seems there is a correlation between different pieces for being on good squares. Probably because a player that is winning can force all his pieces to good squares. (Or perhaps that is why he is winning.) So if I only optimize the Elephant PST, the Elephant starts to take credit for the Rook also being on above-average good squares when the Elephant is. And vice versa. So when I then use both tables, bot E and R get overweighted. I really must tune them all at once.

Fortunately tuning PST values is not as difficult as some of the others: each square is only rarely occupied, and thus contributes little. And the values are small enough and close enough to each other that they don't really expose the non-linearity of the sigmoid that translates scores to expected outcome. So I am doing an optimization run for all 235 PST values. And indeed the MSE quicly dropped to a value below that of the best MSE I got when optimizing a single PST.

This is an intermediate result

Code: Select all

// Pawn, King, Advisor
             -3, -3, -3,-19,-19,-19,    21, 20, 40, 44, 67, 44, 40, 20, 21,
            -50,  7,-50, -3,-18, -3,    39, 45, 32, 36, 58, 36, 32, 45, 39,
            -98,-98,-98,  7,  7,  7,    12, 17,  9,101, 44,101,  9, 17, 12,
-33,-18,  9, 13, 41, 13,  9,-18,-33,     5, 26, 55, 65, 38, 65, 55, 26,  5,
-25,-30,  0, 15, 43, 15,  0,-30,-25,     7, -6, 21, 24, 33, 24, 21, -6,  7,
  7, -6, 21, 24, 33, 24, 21, -6,  7,   -25,-30,  0, 15, 43, 15,  0,-30,-25,
  5, 26, 55, 65, 38, 65, 55, 26,  5,   -33,-18,  9, 13, 41, 13,  9,-18,-33,
 12, 17,  9,101, 44,101,  9, 17, 12,     7,  7,  7,-98,-98,-98,
 39, 45, 32, 36, 58, 36, 32, 45, 39,    -3,-18, -3,-50,  7,-50,
 21, 20, 40, 44, 67, 44, 40, 20, 21,   -19,-19,-19, -3, -3, -3,
// Elephant
-10,-12, -9,  1, 28,  1, -9,-12,-10,    22, 12,  6, 22, 24, 22,  6, 12, 22,
-16,-23,-28,-20, 29,-20,-28,-23,-16,    12, 16, 10, 20, 15, 20, 10, 16, 12,
-12,-15, -6, 33, 22, 33, -6,-15,-12,     4,  5, 11, 10, 17, 10, 11,  5,  4,
-23,-26,  1, 13, 12, 13,  1,-26,-23,     0, 21, 10, 26,  2, 26, 10, 21,  0,
 -6, -5, 21, -2,  5, -2, 21, -5, -6,    17, 38, -9, -7, 18, -7, -9, 38, 17,
 17, 38, -9, -7, 18, -7, -9, 38, 17,    -6, -5, 21, -2,  5, -2, 21, -5, -6,
  0, 21, 10, 26,  2, 26, 10, 21,  0,   -23,-26,  1, 13, 12, 13,  1,-26,-23,
  4,  5, 11, 10, 17, 10, 11,  5,  4,   -12,-15, -6, 33, 22, 33, -6,-15,-12,
 12, 16, 10, 20, 15, 20, 10, 16, 12,   -16,-23,-28,-20, 29,-20,-28,-23,-16,
 22, 12,  6, 22, 24, 22,  6, 12, 22,   -10,-12, -9,  1, 28,  1, -9,-12,-10,
// Horse
-44,-45,-40,-41, -5,-41,-40,-45,-44,    10,  2,  5, 25, 47, 25,  5,  2, 10,
-46,-40,-25,-29,  4,-29,-25,-40,-46,    17, 15, 18, 31, 30, 31, 18, 15, 17,
-49,-38,  7,  4, -2,  4,  7,-38,-49,     7,  9, 35, 39, 37, 39, 35,  9,  7,
-20,-14,-14, -8,  9, -8,-14,-14,-20,     0, 24, 22, 49, 32, 49, 22, 24,  0,
-18, 24,  8, 11, 12, 11,  8, 24,-18,     9,  7, 34, 43, 31, 43, 34,  7,  9,
  9,  7, 34, 43, 31, 43, 34,  7,  9,   -18, 24,  8, 11, 12, 11,  8, 24,-18,
  0, 24, 22, 49, 32, 49, 22, 24,  0,   -20,-14,-14, -8,  9, -8,-14,-14,-20,
  7,  9, 35, 39, 37, 39, 35,  9,  7,   -49,-38,  7,  4, -2,  4,  7,-38,-49,
 17, 15, 18, 31, 30, 31, 18, 15, 17,   -46,-40,-25,-29,  4,-29,-25,-40,-46,
 10,  2,  5, 25, 47, 25,  5,  2, 10,   -44,-45,-40,-41, -5,-41,-40,-45,-44,
// Cannon
  1,-23,-11, 95, 79, 95,-11,-23,  1,   -42,-28,-28,  7, 51,  7,-28,-28,-42,
 -2, -3,  0, 50, 44, 50,  0, -3, -2,   -30,-24,-41,-27, 51,-27,-41,-24,-30,
-11, -8,  4, 32,116, 32,  4, -8,-11,   -19, -4, -5, 33, 61, 33, -5, -4,-19,
-12,-11,-23, 34, 40, 34,-23,-11,-12,    -8,-20,-15, 19, 62, 19,-15,-20, -8,
-25,-18,-14, 21, 72, 21,-14,-18,-25,    -2,  3,-10,  8, 61,  8,-10,  3, -2,
 -2,  3,-10,  8, 61,  8,-10,  3, -2,   -25,-18,-14, 21, 72, 21,-14,-18,-25,
 -8,-20,-15, 19, 62, 19,-15,-20, -8,   -12,-11,-23, 34, 40, 34,-23,-11,-12,
-19, -4, -5, 33, 61, 33, -5, -4,-19,   -11, -8,  4, 32,116, 32,  4, -8,-11,
-30,-24,-41,-27, 51,-27,-41,-24,-30,    -2, -3,  0, 50, 44, 50,  0, -3, -2,
-42,-28,-28,  7, 51,  7,-28,-28,-42,     1,-23,-11, 95, 79, 95,-11,-23,  1,
// Rook
-55,-50,-25, 32, 68, 32,-25,-50,-55,    27, 25, 35, 51, 38, 51, 35, 25, 27,
-48,-45,-16, 27, 60, 27,-16,-45,-48,   -26,-13,-12,  1, 37,  1,-12,-13,-26,
-27, -9, -2,  9, 63,  9, -2, -9,-27,    17, 33, 22,  2, 43,  2, 22, 33, 17,
-34,-21,-28, -8, 35, -8,-28,-21,-34,   -13,-13, 10,  2, 35,  2, 10,-13,-13,
-36, -6,-18, 11, 20, 11,-18, -6,-36,   -23,-29,-39,-20, 34,-20,-39,-29,-23,
-23,-29,-39,-20, 34,-20,-39,-29,-23,   -36, -6,-18, 11, 20, 11,-18, -6,-36,
-13,-13, 10,  2, 35,  2, 10,-13,-13,   -34,-21,-28, -8, 35, -8,-28,-21,-34,
 17, 33, 22,  2, 43,  2, 22, 33, 17,   -27, -9, -2,  9, 63,  9, -2, -9,-27,
-26,-13,-12,  1, 37,  1,-12,-13,-26,   -48,-45,-16, 27, 60, 27,-16,-45,-48,
 27, 25, 35, 51, 38, 51, 35, 25, 27,   -55,-50,-25, 32, 68, 32,-25,-50,-55,
//a   b   c   d   e   f   g   h   i      a   b   c   d   e   f   g   h   i

We see for instance that the Elephant gets a high bonus for being on b5/h5.

Joost Buijs · Post by **Joost Buijs** » Mon Mar 15, 2021 11:56 am

What do you use as optimizer? Is it still a Monte Carlo approach or something like SGD?

For tuning the weights of the HCE of my chess engine I use Adam because it converges very fast. Recently I found that when the MSE gets very low it can be worthwhile to switch over from Adam to SGD with momentum. Adam sometimes gets stuck in a local minimum.

hgm · Post by **hgm** » Mon Mar 15, 2021 1:19 pm

I wrote something myself, mainly based on MC, with some improvements. One is that it sample not centered on the current best, but adds the direction of the previous successful step to it first. Also, when the trial step makes the MS worse, it tires the exact opposit step next. The difference between the MSE in the opposit direction is an estimate of the derivative in that direction. It uses these estimates to guess the gradient, and then biases the steps in the direction of that gradient. For the PST that did not work very well, though.

Joost Buijs · Post by **Joost Buijs** » Mon Mar 15, 2021 5:03 pm

This looks very much at what I was doing 4 years ago when I first started with Texel-tuning. It did the job, the only drawback is that it was slow, tuning about 800 weights from scratch (including PST's) with 4 million examples took almost a day on my old 10 core machine. When I tuned from scratch I kept the pawn value in the opening fixed at 70 cp, after the tuning process had finished the other piece values always looked very reasonable to me.

I used to keep the PST's averaged at zero, in fact piece values are not needed at all if you incorporate these into the PST's.
This is what they look like, the pawn value is now 69 cp because I usually re-tuned without keeping the pawn value fixed.

Code: Select all

// opening
int valuePawnMG = 69;
int valueKnightMG = 324;
int valueBishopMG = 340;
int valueRookMG = 514;
int valueQueenMG = 1053;

// endgame
int valuePawnEG = 117;
int valueKnightEG = 337;
int valueBishopEG = 368;
int valueRookEG = 624;
int valueQueenEG = 1198;

It needs good sample data preparation though, this is the most time consuming part.

Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project

Re: Reinforcement learning project