Evaluation Tuning

Robert Pope · Post by **Robert Pope** » Mon Aug 10, 2015 5:39 pm

Gerd Isenberg wrote:Isn't Joel Veness' RootStrap or even TreeStrap as applied in Meep supposed to gain better or faster results than TD-Leaf?

I just started looking at that - very interesting. However, it seems that TreeStrap's biggest strength is being able to quickly get to reasonable values from a random start. I'd be curious to see how well it trains relative to TD-Leaf when we already have reasonable values, a more common jumping off point for most of us.

jdart · Post by **jdart** » Mon Aug 10, 2015 5:42 pm

Nobody doubts that reinforcement learning can improve evaluation functions, especially if you are starting from a really basic one.

But there isn't much published evidence (that I know of) that this method can improve such functions to the point that they outperform a good hand-coded evaluation. But maybe I'm wrong and this is possible.

--Jon

Ferdy · Post by **Ferdy** » Mon Aug 10, 2015 6:31 pm

Fabio Gobbato wrote:I'm trying a new tuning method. I have only 2 attempts with 2 good results since now. so I'm not sure it's a very good method, but in the first attempt I have had a +12 ELO and in the second a +8 ELO.

I extract 150 millions positions, it doesn't matter the quality of the games but it's better if they are quite differents each other.
From them I remove these position:
- in check
- with fewer than 7 pieces (it's good if the engine uses the tablebases)
- static eval different from qsearch score (quiet positions)
Then I run a 2 depth search for every position and I store in a file the position with the score returned.

Ones I have this file I compare every single score of the position with the static evaluation.
I compute the error for a position as the absolute difference of the 2 score.

Then I use an optimization algorithm to minimize the error.

The idea is that it's difficult for the evaluation to see the results of the position but it's easier that it can see the eval from 2 ply ahead.
Another advantage is that this method minimize the difference from related position.

I have made only 2 iteration of this method with a total +20ELO improvement.
If someone else would try we can compare the results.

Nice idea especially the collection of training positions, and you are using
150 million!! We could be similar in the use of eval, you are using the score from a 2-depth search, while I am using the score from the move
comments from the engine. And you are using static eval similar to what Miguel is using, and I am using qsearch similar to Texel. So how many eval parameters you have allowed the tuner to tune?
Thanks for sharing.

matthewlai · Post by **matthewlai** » Mon Aug 10, 2015 6:38 pm

jdart wrote:Nobody doubts that reinforcement learning can improve evaluation functions, especially if you are starting from a really basic one.

But there isn't much published evidence (that I know of) that this method can improve such functions to the point that they outperform a good hand-coded evaluation. But maybe I'm wrong and this is possible.

--Jon

I went from material-only to near Stockfish level (eval only), so it's possible. However, my learned eval function is also orders of magnitude slower than Stockfish's.

This is with an almost completely-generic function, though (neural network), and not tuning parameters on hand-coded features.

But no, it hasn't been published yet.

matthewlai · Post by **matthewlai** » Mon Aug 10, 2015 6:50 pm

Gerd Isenberg wrote:Isn't Joel Veness' RootStrap or even TreeStrap as applied in Meep supposed to gain better or faster results than TD-Leaf?

That's very interesting! They somehow escaped my literature review.

TreeStrap does sound very interesting. It requires quite an invasive change on the engine, though, since training is also done on internal nodes. I will give it a try when I have time.

brtzsnr · Post by **brtzsnr** » Mon Aug 10, 2015 7:17 pm

As Ferdinand noted this is similar to TEXEL's tuning method. I have my own tuner https://bitbucket.org/zurichess/txt that was discussed here http://talkchess.com/forum/viewtopic.php?t=55696.

I used a set of 1.000.000 EPDs positions to tune my engine. Parsing normally takes about 25% which is fine given that the tuner can be applied to any UCI engine and it didn't add almost any code to mine (except for some minor micro-optimizations).

Things that I noticed with my tuner:

* Going over 1mil positions doesn't help that much, ELO wise.
* ELO improvement depends a lot on the starting conditions (starting values, games played, etc).
* Lower score strongly correlates with higher ELO for the same starting conditions.
* Using quiet positions is better (but I haven't done as much extensive testing on tactical positions).
* Positions from hyperbullet games are better, probably because evaluation is more relevant.

Fabio Gobbato · Post by **Fabio Gobbato** » Mon Aug 10, 2015 8:29 pm

All my evaluation parameters are in an array so I could tune all at once, but in these tests I tune about 80 parameters.

Daniel Mehrmann · Post by **Daniel Mehrmann** » Tue Aug 11, 2015 12:05 pm

Thanks for sharing your ideas.

Well, i'm working with Jörg tuning idea with epd's. At the moment it's just a "play ground". I need to find more time to work on it.

My code isn't yet ready, but useable.
However, here is a clean fruit manner written code if someone need it.

https://www.dropbox.com/sh/781g8fuonz6i ... aSvaLrLFTa

Regards
Daniel

Ferdy · Post by **Ferdy** » Tue Aug 11, 2015 9:11 pm

Fabio Gobbato wrote:All my evaluation parameters are in an array so I could tune all at once, but in these tests I tune about 80 parameters.

I am not into tuning now, I am reviewing and refactoring my code, I realized tuning is useless when the code is not thoroughly checked and later you will find some bugs. I have done some tuning before that was only around 20 params to test the tuner, I could get +20 wins in around 700 games, in only 1.8 million training pos after around 5 hrs of tuning. I have difficulty raising my traning positions because I am limited to the shredder gui pgn output where there is move comments in the game, as I extract the move eval in this game. I will probably try your method of collecting games. I might spend a 2 ply analysis time but I could get a lot of positions, not 150 million of course. Or perhaps I will just run a tourney of different engines using cutechess-cli at very fast TC then extract the pos and move eval there, no need to run a 2-depth analysis.

Ferdy · Post by **Ferdy** » Tue Aug 11, 2015 9:32 pm

Ferdy wrote:
Fabio Gobbato wrote:All my evaluation parameters are in an array so I could tune all at once, but in these tests I tune about 80 parameters.
I am not into tuning now, I am reviewing and refactoring my code, I realized tuning is useless when the code is not thoroughly checked and later you will find some bugs. I have done some tuning before that was only around 20 params to test the tuner, I could get +20 wins in around 700 games, in only 1.8 million training pos after around 5 hrs of tuning. I have difficulty raising my traning positions because I am limited to the shredder gui pgn output where there is move comments in the game, as I extract the move eval in this game. I will probably try your method of collecting pos. I might spend a 2 ply analysis time but I could get a lot of positions, not 150 million of course. Or perhaps I will just run a tourney of different engines using cutechess-cli at very fast TC then extract the pos and move eval there, no need to run a 2-depth analysis.

Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning