test positions for texel tuning

Ferdy · Post by **Ferdy** » Thu Sep 24, 2015 4:25 am

brtzsnr wrote: Are there any other ideas I could try to pick better positions for tuning?

1. Take Frank's pgn, this is at tc 10min/40moves with good hardware.
2. Separate it by eco, try to balance number of games in every eco.
3. Extract positions with move score <= 6 pawns and score >= -0.5, side to move has won. (Be aware the move score in that pgn is white POV).
4. Extract positions from drawn games up to 4 to 6 pos per game. Just to counter that not all 2-bishop advantage is a win for example.
5. If the engine under tuning is using qsearch score and is not generating check evasion in qsearch, then extract only positions where the side to move is not in check.

Take other sources one that is from 2 move, 3 move, 4 move, 5 move, 6 move, 7 move, 8 move, 9 move, 10 move opening start pos. Frank's pgn is probably in the range 10 to 14 book moves. This way you have a good coverage of positions.

From experience with the Texel method, move eval method and GA Stefano method, these positions give improvement in Deuterium's performance.

brtzsnr · Post by **brtzsnr** » Wed Jan 13, 2016 7:25 pm

I did some experiments and recorded them at the link below. For Zurichess I get the best results when I drop positions from opening for late end games.

The last tests should finish tonight, so let me know if you have any ideas I missed to put to test.

https://goo.gl/S5lk46

matthewlai · Post by **matthewlai** » Wed Jan 13, 2016 7:34 pm

brtzsnr wrote:I did some experiments and recorded them at the link below. For Zurichess I get the best results when I drop positions from opening for late end games.

The last tests should finish tonight, so let me know if you have any ideas I missed to put to test.

https://goo.gl/S5lk46

It would make sense to drop positions from opening since small eval differences from the opening probably have almost 0 impact on the result of the game.

However, if you drop those positions, the engine is no longer tuned for good opening play.

IMHO, a more generic (and powerful) approach is temporal-difference learning. In positions close to end game, it works similarly to TTM, but in mid-game and opening positions it trains the evaluator to predict its own prediction in the future instead, which is more meaningful.

Most games are decided in 1 or 2 important moves. Training all the positions before those key moves toward the game result just adds noise. TD learning, on the other hand, can successfully capture this (only moves immediately before the key moves will get high error signals).

tpetzke · Post by **tpetzke** » Thu Jan 14, 2016 9:14 am

If I understood right you are using selfplay games from version N against N. I tried to avoid that because this gives you probably the highest draw rate you can get. I used N vs N-1, with still a high but somewhat lower draw rate. If you have to many drawn games in your set the tuner cannot work at all, if you have to less it does not work very good. As in the resulting games set my draw rate is still over 50% I did not pick positions from drawn games with the same probability as I picked positions from won games. I do a split of something like 60% won/lost and 40% draw.

It also might be beneficial to use the verification games, where the tuned engine play vs. the original to form another tuning set. This gives the tuner some feedback where in the first run he tuned the weights poorly.

However in my experience the TTM is able to improve your eval up to a certain limit. The better your eval gets chances increase that TTM will decrease the error for the tuning set but make the engine play worse.

Evert · Post by **Evert** » Thu Jan 14, 2016 11:28 am

tpetzke wrote:If I understood right you are using selfplay games from version N against N. I tried to avoid that because this gives you probably the highest draw rate you can get. I used N vs N-1, with still a high but somewhat lower draw rate. If you have to many drawn games in your set the tuner cannot work at all, if you have to less it does not work very good. As in the resulting games set my draw rate is still over 50% I did not pick positions from drawn games with the same probability as I picked positions from won games. I do a split of something like 60% won/lost and 40% draw.

What do you do for positions that occur in more than one game?
Do you just include them multiple times, or do you include them once with an outcome that is just a (weighted?) average of the outcomes of the individual games (say, a position that is won for white 70% of the time and drawn 30% of the time gets an outcome of 0.7)?

However in my experience the TTM is able to improve your eval up to a certain limit. The better your eval gets chances increase that TTM will decrease the error for the tuning set but make the engine play worse.

Does it matter how many evaluation terms there are to optimise?
I would guess that for this method more is better (should reduce the chance of overfitting the data, right?) but I don't know that.

I still haven't tried the method myself, but I'm planning to at some point. Discussions such as this are very useful to get an idea for the pit-falls.

tpetzke · Post by **tpetzke** » Thu Jan 14, 2016 12:07 pm

What do you do for positions that occur in more than one game?

I don't include positions from the first 16 plies, those are usually handled via book and don't provide a strong signal (even if the game was lost it usually wasn't because of an error here). This reduces duplicates but otherwise duplicated positions are just then duplicated. I don't check whether I already have them in the set.

Does it matter how many evaluation terms there are to optimise?

I usually tune all of my main weights but I also did runs where I only tuned PSQ values per piece with a weigth for each square MG and EG.

The more weights you expose the lower the error gets because the tuner gets more knobs to turn but this is actually overfitting. You can only counter that by including more positions in the set so it has to generalize.

Also the way to tune has influence. I just increment a value and if the error drops I increment it again until the error no longer drops. If increment does not work I try to decrement. The I proceed to the next value. The order in which you try the values has influence on the outcome. Also the starting set you use. Sometime I randomized the starting set a bit where I added/deducted a small number from my inital weights just to shake eval maybe a bit out of its local optimum.

So there is a lot to try. Overall it is fun but I haven't really gained much from it.

matthewlai · Post by **matthewlai** » Thu Jan 14, 2016 12:12 pm

tpetzke wrote: Also the way to tune has influence. I just increment a value and if the error drops I increment it again until the error no longer drops. If increment does not work I try to decrement. The I proceed to the next value. The order in which you try the values has influence on the outcome. Also the starting set you use. Sometime I randomized the starting set a bit where I added/deducted a small number from my inital weights just to shake eval maybe a bit out of its local optimum.

It may also be possible to do standard gradient descent even if the function is not differentiable, by numerically estimating direction of gradients (using Newton's method to estimate partial derivatives). This may perform better than ad-hoc methods, and is very well studied.

brtzsnr · Post by **brtzsnr** » Thu Jan 14, 2016 6:23 pm

Evert wrote: What do you do for positions that occur in more than one game?
Do you just include them multiple times, or do you include them once with an outcome that is just a (weighted?) average of the outcomes of the individual games (say, a position that is won for white 70% of the time and drawn 30% of the time gets an outcome of 0.7)?

Most duplicate positions appear in the beginning or few pieces end games. I have a test in which I dropped duplicates and got a slight improvement on all three runs.

mvk · Post by **mvk** » Thu Jan 14, 2016 7:49 pm

matthewlai wrote:It may also be possible to do standard gradient descent even if the function is not differentiable, by numerically estimating direction of gradients (using Newton's method to estimate partial derivatives). This may perform better than ad-hoc methods, and is very well studied.

Gradient descent looks very expensive for this application because it requires you to look around in all dimensions before making a step. To evaluate one vector is an expensive operation, even with short-cuts. And then it makes only a small step. The standard solution is to apply it on varying subsets of the training set, but that doesn't work when you need to cross a large, almost flat, plateau. I'm toying with an idea to tune on linear combinations of small groups of parameters instead. Much like the deep blue team did.

Btw, the effectiveness of the method does not just depend on the test set but also on the structure of the evaluation function. Preferably all features are binary. For example, don't have a single pawn value, because the frequency of the number of pawns is not evenly distributed over the test positions. For example, 8 pawns is rare. With a single pawn value parameter the tuner will bias to the most popular frequencies at the cost of the other game phases. This doesn't get resolved in general by adding just a few terms (e.g. a quadratic pawn term). With binary properties each segment gets tuned on its own merit. Of course you still need enough positions in each segment that are sensitive to the parameter to extract a useful signal.

The only real thing to worry about, in my point of view, is selection bias from the played games. ("Feature X is only good once it appears in games, but not in general.") But there are ways to handle that, including yours.

matthewlai · Post by **matthewlai** » Fri Jan 15, 2016 9:30 pm

mvk wrote:
matthewlai wrote:It may also be possible to do standard gradient descent even if the function is not differentiable, by numerically estimating direction of gradients (using Newton's method to estimate partial derivatives). This may perform better than ad-hoc methods, and is very well studied.
Gradient descent looks very expensive for this application because it requires you to look around in all dimensions before making a step. To evaluate one vector is an expensive operation, even with short-cuts. And then it makes only a small step. The standard solution is to apply it on varying subsets of the training set, but that doesn't work when you need to cross a large, almost flat, plateau. I'm toying with an idea to tune on linear combinations of small groups of parameters instead. Much like the deep blue team did.

Btw, the effectiveness of the method does not just depend on the test set but also on the structure of the evaluation function. Preferably all features are binary. For example, don't have a single pawn value, because the frequency of the number of pawns is not evenly distributed over the test positions. For example, 8 pawns is rare. With a single pawn value parameter the tuner will bias to the most popular frequencies at the cost of the other game phases. This doesn't get resolved in general by adding just a few terms (e.g. a quadratic pawn term). With binary properties each segment gets tuned on its own merit. Of course you still need enough positions in each segment that are sensitive to the parameter to extract a useful signal.

The only real thing to worry about, in my point of view, is selection bias from the played games. ("Feature X is only good once it appears in games, but not in general.") But there are ways to handle that, including yours.

Gradient descent is indeed much more expensive. But it is also much more accurate. Many optimization problems would be much easier if we can just optimize one parameter at a time. Unfortunately, as I am sure you know, it doesn't really work like that. Changing the second parameter will change the optimal value of the first parameter, etc.

test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning

Re: test positions for texel tuning