Positional learning

bhlangonijr · Post by **bhlangonijr** » Mon Dec 13, 2010 1:51 pm

What is the common approach to implement persistent positional learning in a chess engine?
I want to try it out using my engine and I was thinking about write a code that could automatically adjust pst values and other factors after each game and then save it to a "learning" file. Obviously the learning algorithm which will adjust the parameters is the most important here.

Is there any open source engine which implements positional learning?

The main question is: How would you do that? Any thoughts?

Thanks,

UncombedCoconut · Post by **UncombedCoconut** » Mon Dec 13, 2010 7:37 pm

This made me think of Sjeng 11.2 by Gian-Carlo Pascutto (a dated but open-source version of his engine). But now that I look at the code, it seems to be a flat file of TT entries with a capacity of 50000. Crafty's learning looks much more sophisticated; it can use search to build opening books and use game outcomes to judge in-game positions.

Don · Post by **Don** » Mon Dec 13, 2010 8:47 pm

bhlangonijr wrote:What is the common approach to implement persistent positional learning in a chess engine?
I want to try it out using my engine and I was thinking about write a code that could automatically adjust pst values and other factors after each game and then save it to a "learning" file. Obviously the learning algorithm which will adjust the parameters is the most important here.

Is there any open source engine which implements positional learning?

The main question is: How would you do that? Any thoughts?

Thanks,

An interesting and perhaps workable approach could be based on something H.G. Muller proposed which he called orthogonal multi-testing. The primary characteristic of this approach is that you can test several things simultaneously. Let's say you had 50 parameters to be tuned and you wanted to try 1 of 2 different values for each of them. You could play, for example, 10,000 games where in each game the players had a random combination of each of the 50 settings.

After the 10,000 games, about half the games will have a given parameter set 1 way and half will have it set the other way. You are interested in the games where having the parameter set one way competes with programs that have it set the other way which will be about half the games on average. The winner of these matches gives you an indication of the direction of adjustment. What is good about this idea is that you are testing a huge number of evaluation terms at the same time so it's orders of magnitude more efficient in some sense than testing a single change and playing auto-tests based on that.

This does not meet your criteria however. To make this persistent and incremental you could do something very similar. Each positional term that you want to be adjustable has a canonical value but before every game you modify all values slightly, either 1 step lower, or one step higher. For example if your bishop mobility values is 8, you change it to either 7 or 9 in a uniformly random way.

After the game is complete the result can be viewed as a vote in favor of, or against ALL the adjustments you made. You could accumulate the win/loss/draw record for each term (or use floating point values) and make an adjustment in the canonical value of the terms when enough evidence is accumulated in favor of a change. To give another example, let's say you
are using 8 for the bishop and testing 7 and 9, and at some point 9 gains a 50 game lead on the version with 7. Then you might decide to start using 9 as the "canonical" or reference value from now on.

Like every learning method, there are strengths and weaknesses. In multi-testing one of the weaknesses is the assumption is that the weight of one term has no correlation with the weight of another terms. You will find a LOT of learning algorithms have this same assumption built into them to one degree or another. In chess I'm sure there is correlation, but it's fairly weak (in my opinion) in most cases especially when you are talking about minor adjustments. Even when it isn't it may be somewhat self-correcting as no terms is likely to get way out of line.

Another downside to this is that you are never using "ideal" values when you have learning on. This method is basically a "try it and test it" approach so every time you play a game the program is a "different" program.

I'm sure that there are many improvements to the basic idea that I outlined above, and of course I give no guarantee's about how it will perform. It will require many games to make changes that are meaningful but I don't know of ANY learning methods that will not require a lot of games to produce meaningful results.

Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable.

zamar · Post by **zamar** » Mon Dec 13, 2010 10:35 pm

Don wrote: Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable.

The obvious problem here is that we are tuning evaluation function using sane positions from real games. However most of the time evaluation function deals with completely crazy positions which result from one side trying to NULL_MOVE as possible and other side trying to play every possible move.

That's why I have great doubts that any learning system based on positions from real games can ever work effectively. (I've tried some but all have failed miserably)

Don · Post by **Don** » Mon Dec 13, 2010 10:47 pm

zamar wrote:
Don wrote: Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable.
The obvious problem here is that we are tuning evaluation function using sane positions from real games. However most of the time evaluation function deals with completely crazy positions which result from one side trying to NULL_MOVE as possible and other side trying to play every possible move.

That's why I have great doubts that any learning system based on positions from real games can ever work effectively. (I've tried some but all have failed miserably)

I feel the same, but I wanted to present some options. Off-line learning is probably more feasible.

diep · Post by **diep** » Tue Dec 14, 2010 2:27 am

bhlangonijr wrote:What is the common approach to implement persistent positional learning in a chess engine?
I want to try it out using my engine and I was thinking about write a code that could automatically adjust pst values and other factors after each game and then save it to a "learning" file. Obviously the learning algorithm which will adjust the parameters is the most important here.

Is there any open source engine which implements positional learning?

The main question is: How would you do that? Any thoughts?

Thanks,

There is a paper from Marty Hirsch about this for openingsbook. Very interesting approach. It's a very vague type of business to be in.

bhlangonijr · Post by **bhlangonijr** » Tue Dec 14, 2010 11:57 am

Don wrote:
zamar wrote:
Don wrote: Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable.
The obvious problem here is that we are tuning evaluation function using sane positions from real games. However most of the time evaluation function deals with completely crazy positions which result from one side trying to NULL_MOVE as possible and other side trying to play every possible move.

That's why I have great doubts that any learning system based on positions from real games can ever work effectively. (I've tried some but all have failed miserably)
I feel the same, but I wanted to present some options. Off-line learning is probably more feasible.

Thanks for the explanations. I understand that the on-line approach is not going to be effective but I wanted to hear what is the common approach that people are using to implement it.

bhlangonijr · Post by **bhlangonijr** » Tue Dec 14, 2010 12:40 pm

zamar wrote:
Don wrote: Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable.
The obvious problem here is that we are tuning evaluation function using sane positions from real games. However most of the time evaluation function deals with completely crazy positions which result from one side trying to NULL_MOVE as possible and other side trying to play every possible move.

That's why I have great doubts that any learning system based on positions from real games can ever work effectively. (I've tried some but all have failed miserably)

Maybe you haven't tried hard enough?

IMO the problem is we are trying to fit some learning system into the current evaluation structure and all evaluation terms are combined linearly to output the static evaluation. Having all these human-like evaluation terms is really very unlikely to come up with a good learning system to create a good balance between them.
I think a better approach would be pick many evaluation terms as we know and translate it into a more "fundamental" rule. I can imagine that for the pawn structure, for example. Then it would be more feasible to model a workable learning system.

OliverUwira · Post by **OliverUwira** » Tue Dec 14, 2010 1:25 pm

bhlangonijr wrote: IMO the problem is we are trying to fit some learning system into the current evaluation structure and all evaluation terms are combined linearly to output the static evaluation. Having all these human-like evaluation terms is really very unlikely to come up with a good learning system to create a good balance between them.
I think a better approach would be pick many evaluation terms as we know and translate it into a more "fundamental" rule. I can imagine that for the pawn structure, for example. Then it would be more feasible to model a workable learning system.

I'm trying to go into a similar direction. In fact I haven't written a line of code for four weeks but spent quite some time on brainstorming ideas about evaluation design.

The approach I want to try is expressing positional characteristics as functional relations.

Take e.g. bishops. Their strength increases when pawns come off and decreases with pawns blocked or hemmed on their colour. I try to model notions like this as a term V = Base * F(x). In above cases, x might be the number of pawns or, respectively, the number of blocked pawns.

In the case of blocked pawns I would use a logarithmic function, such that the penalty does not increase too aggressively. One blocked pawn on the wrong colour is often enough for giving the other side a serious advantage (e.g. in ram positions, i.e. isolated pawns on d4/d5 and light squared bishops), but if there are more the "marginal utility" is likely diminishing.

The idea I have is that if the F(x) are continuously differentiable, the feature sum will also be, which will facilitate the application of curve fitting algorithms and similar stuff, maybe even TDL.

I'm quite excited to find out how this approach is going to work out

Don · Post by **Don** » Tue Dec 14, 2010 3:04 pm

bhlangonijr wrote:
zamar wrote:
Don wrote: Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable.
The obvious problem here is that we are tuning evaluation function using sane positions from real games. However most of the time evaluation function deals with completely crazy positions which result from one side trying to NULL_MOVE as possible and other side trying to play every possible move.

That's why I have great doubts that any learning system based on positions from real games can ever work effectively. (I've tried some but all have failed miserably)
Maybe you haven't tried hard enough?

You added a smiley there but I think there is a lot to that. Generally we give up on most ideas after we invest a certain amount of time and then move to something else. I think if we spent more time we could make a lot of ideas work unless of course we can prove there is something unsound about the general idea we are working on, and I don't think this is the case here.

IMO the problem is we are trying to fit some learning system into the current evaluation structure and all evaluation terms are combined linearly to output the static evaluation. Having all these human-like evaluation terms is really very unlikely to come up with a good learning system to create a good balance between them.
I think a better approach would be pick many evaluation terms as we know and translate it into a more "fundamental" rule. I can imagine that for the pawn structure, for example. Then it would be more feasible to model a workable learning system.

Very few learning systems actually figure out new concepts on their own, they start with raw materials we provide them - stuff that has to be engineered by hand. It would be refreshing to see something different - something that can figure it out on it's own and create new concepts.

The concepts we use are arbitrary rules of thumb that we made up anyway, based on observation. They are not first principles even though we think of them as the fundamentals. The only fundamental is try to mate the opponent and everything else flows backwards from that. It's easier to do if you have more material, better mobility, etc.

Positional learning

Positional learning

Re: Positional learning

Re: Positional learning

Re: Positional learning

Re: Positional learning

Re: Positional learning

Re: Positional learning

Re: Positional learning

Re: Positional learning

Re: Positional learning