Parameter tuning with multi objective optimization

crybotmark · Post by **crybotmark** » Mon May 08, 2017 12:34 am

Hi all,
Time has passed since I last worked on my engine (and consequently posted something here), but I recently resumed developing Napoleon and I'm currently trying to automatically tune search parameters. I implemented a very simple genetic algorithm to only tune a binary coded string which encapsulates some (and not all) of the search parameters, including late move reduction margins, futilty pruning parameters, ecc.
My goal was to derive a method for evaluating a string fitness (i.e. a set of search parameters) that wouldn't require to run any game at all. So I downloaded the Encyclopedia of Chess Middlegames (ECM.epd) which consists of 879 positions, and used that to compute a string score. Since we do want to have a transfer from the fitness score to the actual playing strength of the tuned engine, optimizing only for node count (i.e. choosing the set of parameters that leads to a smaller tree) would be too risky. There should also be, then, a component of the score that indicates how many correct moves the engine actually finds.
The question is then: someone has ever tried to combine correctness and selectivity (respectively, #correct_moves and #nodes_visited) into some linear(or not) function and use that as a fitness function inside an evolutionary framework? Since multiobjective optimization is a rather obscure field to me, I didn't try implementing sofisticated forms of genetic algorithms.
I just linearly combined the two parameters above and used the resulting function to determine the strings fitness.
I could post the code, if anyone is interested, but it should be clear that no revolutionary technique has been implemented.
What I found is that after many iterations of the algorithm the tuned parameters looked very different from what I expected them to be... for example, as it should be, I use a constant margin to determine if a move should be pruned as a consequence of futility pruning. I originally set this parameter to 250 centipawns for depth=1 and 500 for depth=2, while the tuned ones do not exceed 10 centipawns... Of course tests are running to exactly determine if there was an increase in playing strength, but I guess something is not working as it should.
Does anyone has any experience in that?
Thank you all

jdart · Post by **jdart** » Mon May 08, 2017 3:23 am

I tuned for test results for years and years, and I can pretty much tell you that it is a waste of time.

Even 1000 test positions or so are a tiny subset of the positions that will appear in a game.

If you run a 64,000 game test gauntlet and each move visits millions of positions, your engine will visit billions of positions. That will still be a tiny amount compared to the universe of possible positions but it is enough for tuning.

Similarly if you want to use gradient descent (Texel method) and have a million position training set, and each search visits even 100 positions, that is 100 million positions. Still way more than what a limited test suite gives.

--Jon

Ferdy · Post by **Ferdy** » Mon May 08, 2017 3:24 am

I have not known such method of using correct count and node count. This could be interesting to try.

I would suggest to use the STS test suite, 1500 positions with scaled scoring.

stegemma · Post by **stegemma** » Mon May 08, 2017 10:54 am

In Satana I use GA to tune parameters but in a different way. I have a pool of engines that play all vs all, starting from random positions.

I have always get weaker engines, since now, with GA.

crybotmark · Post by **crybotmark** » Mon May 08, 2017 12:17 pm

jdart wrote:I tuned for test results for years and years, and I can pretty much tell you that it is a waste of time.

Even 1000 test positions or so are a tiny subset of the positions that will appear in a game.

If you run a 64,000 game test gauntlet and each move visits millions of positions, your engine will visit billions of positions. That will still be a tiny amount compared to the universe of possible positions but it is enough for tuning.

Similarly if you want to use gradient descent (Texel method) and have a million position training set, and each search visits even 100 positions, that is 100 million positions. Still way more than what a limited test suite gives.

--Jon

What's the de facto standard then? Despite of the unusual output, I'm actually getting some elo increases with Napoleon... I've only ran 130 self games at 40/4' for now, but the results are promising (+56 elo), yet uncertain. What do you suggest to do?

I have not known such method of using correct count and node count. This could be interesting to try.

I would suggest to use the STS test suite, 1500 positions with scaled scoring.

I forgot to mention that Napoleon has a very poor evaluation function, since I'm not a chess player I would rather work on the search framework, instead of studying chess positional play (which I'm totally unaware of). So it might be pointless to use STS as a test suite, since the goal here is to optimize tactical play by enhancing search selectivity.

For the parameters I'm currently testing I used this genetic algorithm:

Code: Select all

population_size=10;
generations=+INF &#40;I quit arbitrarily as soon as I realize the algorithm converges&#41;
generate 10 random chromosomes;
while&#40;true&#41;
&#123;
  for each chromosome&#58;
     compute fitness;

  sort chromosomes with descending order of fitness;
  take 1 random chromosome out of the second half of the list and save it;
  truncate the list and keep only the first half;
  randomly shuffle the remaining chromosomes;
  generate pairs of crossovers by picking adjacent elements of the list and add them to back of the list;
  // by now population size is back to 10
&#125;

The fitness function is computed as follows:

Code: Select all

f&#40;chromosome&#41;
&#123;
   apply chromosome.parameters;
   for each position in suite&#58;
      run iterative deepening with depth_limit=7&#58;
         if correct move is found before depth=7, then return from search;
      if &#40;move is correct&#41; then correct_count++;
      add node_count to total_node_count;

    return score as&#58; correct_count - total_node_count/50000;
&#125;

Now, I also tried non linear functions of correct_count and node_count, but this linear sum approach seems to work better. The scaling factor is empirical and by no means it should be considered optimal. Also depth limit of 7 is used to limit search time, since greater depths would require much more time to compute.

Ferdy · Post by **Ferdy** » Mon May 08, 2017 12:52 pm

Back in 2015 we have a fun tuning eval parameters using GA.

http://talkchess.com/forum/viewtopic.ph ... 37&t=57246

jdart · Post by **jdart** » Mon May 08, 2017 3:25 pm

I don't think it impossible this could improve things if you start with a very poor starting set of parameters. But the fact that it is tuning your futility margins down to a very low value should indicate something is wrong.

For search parameters such as futility margins I suggest using self-play at reasonably long time controls. You can start by using a linear formula by depth and just vary the base value and depth increment. Stockfish uses game/1 minute + 0.6 seconds increment for its "LTC" tests. For some parameters you may need even longer time controls. And run until you get a positive or negative SPRT result. This could be a large number of games.

For eval you have a few approaches. Self-play as above. Texel method (gradient descent), based on a large training set. These are now established approaches that work if implemented correctly. You can try CLOP (https://www.remi-coulom.fr/CLOP/) but most here have had poor results from that.

If you have multiple interacting parameters then you can consider something like SPSA. I have occasionally used an RBF based nonlinear optimizer. These do not work for large numbers of parameters but can tune a few at a time, as CLOP does.

crybotmark · Post by **crybotmark** » Mon May 08, 2017 6:55 pm

The fact that it has possibly improved with such futility values may be due to the deeper search that derives from that parameters. It actually gained 4 plies on average over the not tuned version, and that may overcome the pruned search untrustworthiness...
That makes me think that optimizing only for small subsets of related parameters may leads to better results, since we would ignore any form of selectivity interdependence. This way the tuned parameters would probably be closer to the manual set ones.
Do you think that disabling the parameters not subject to optimization would make the whole process more accurate?
Also, I think that if I used a much bigger test set and randomly picked a subset for each iteration (or even chromosome), that would cause the algorithm to ignore local maximum.
There exists a tactical test position much bigger than ECM?

jdart · Post by **jdart** » Mon May 08, 2017 10:11 pm

There is no really huge tactical test suite that I know of. ECM is not nearly enough positions. By the way too - I hope you know, the original ECM set as published by Chess Informant has numerous errors (wrong bm).

As I said I think test suites are a flawed method.

--Jon

tpetzke · Post by **tpetzke** » Wed May 10, 2017 2:27 pm

Hi,

I use genetic algorithms for some years now to tune iCE and for me it works in general. In general means it improves a not tuned evaluation quiet a lot. The better the evaluation gets the more likely further runs will not be able to improve it further or might even come up with slightly worse weights.

Fitness functions that don't factor in game play have not worked for me. I always needed to play matches. However I use a set of test positions to filter out population members that have an absurd fitness (e.g. if the value of the queen is less than a knight it will not score very well in the set) just to save compute time. I also use it to decide draws. As engines that score better in the set have a higher chance of survival the final genom is also solving the set better but is not overfitted towards it (e.g. sacrificing playing strength to solve 5 additional positions).

The computational effort for a single run (1000 generations, population size 128 - 256) in my framework is very high (2 to 3 weeks) so I tried a lot of different cheaper fitness functions unfortunately without any success.

Parameter tuning with multi objective optimization

Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization

Re: Parameter tuning with multi objective optimization