Evaluation Tuning

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Evaluation Tuning

Post by Desperado »

Introduction
=============

Now, operating in the shadow on my new engine, one of my goals is to get rid of the guesswork for evaluation parameters.
There are some well explored features but most of positional parameters are simply crap, based on human intuition.
Time showed that parameters like doubled pawns, just one example, had been evaluated completely wrong, just a decade ago.
The standard, to play several thousands of games to get somehow correct values for a feature is doable on today's hardware
which allows extremely fast games. But to start with reasonable parameters (and tune them continuously) which are derived
from data analysis should give more freedom to an engine developer and goes along with some other basic considerations.

So, i do not expect values to be better in short run, maybe in long run, than values that are already used in an engine.
The main difference will be to know where the values come from and to be able to close the gap by understanding how values
from data analysis differ from parameters used in an engine. At the end of the post this point may be more clear because
"Tapered Eval" is one subject on this matter.

Another point is from developer's point of view to stop twiddling between the search/evaluation components.
Of course there is a strong correlation between search and static evaluation, without any doubt.
Nonetheless, it does not make sense to repair a broken search feature by tuning the evaluation in a way to catch this up,
nor the other way around. Therefore a definition needs to be done of what is correct for each of both components, so
it gets clear where a feature needs to be tuned/fixed/introduced.
For this purpose i consider evaluation parameters to be correct based on data analysis.
Well, i need to examine this idea more closely, but if evaluation parameters do not work in conjunction with the search,
the idea is to make the search sensitive for the parameters and not the other way around. A fix on the evaluation parameters
or its conditions would occur if database is flawed/insufficient.

Now, the following sections include some basic ideas and explanations of my algorithm, and some various hints.

1: Starting with a vector set to zero ( without normalization / Bias )
=======================================================================

Code: Select all

1a: Human 2500+
---------------------

Finished:   1000 games /   42045 positions / 115 240 275 405  825 Time:   3.98s
Finished:   2000 games /   83494 positions / 100 210 245 365  725 Time:   6.93s
Finished:   4000 games /  167823 positions / 100 220 235 365  750 Time:  15.63s
Finished:   8000 games /  335833 positions / 100 230 250 385  795 Time:  20.90s
Finished:  16000 games /  686395 positions /  90 210 225 350  705 Time:  62.99s
Finished:  32000 games / 1415813 positions /  90 220 235 360  730 Time: 196.14s
Finished:  64000 games / 2845859 positions /  85 205 225 350  705 Time: 222.24s
Finished: 128000 games / 5794124 positions /  85 205 220 340  690 Time: 522.15s

1b: CCRL 2800+
-----------------------

Finished:   1000 games /    95093 positions / 110 320 355 465 835 Time: 13.28s
Finished:   2000 games /   186361 positions / 100 290 325 440 850 Time: 17.94s
Finished:   4000 games /   372961 positions / 100 300 340 465 900 Time: 36.27s
Finished:   8000 games /   760902 positions / 100 280 310 445 860 Time: 95.25s
Finished:  16000 games /  1534765 positions /  95 275 305 435 860 Time: 127.27s
Finished:  32000 games /  3139159 positions / 105 275 305 440 860 Time: 304.26s
Finished:  64000 games /  6322204 positions / 105 270 305 440 860 Time: 521.43s
Finished: 128000 games / 12670934 positions /  95 255 285 415 825 Time: 1202.57s

1c: Human 2500+ Excluding draws
--------------------------------

Finished:   4000 games / 212790 positions  / 170 430 465 710 1430 Time:  36.22s
Finished:   8000 games / 424052 positions  / 160 385 420 635 1285 Time:  52.67s
Finished:  16000 games / 851718 positions  / 150 385 420 635 1275 Time: 100.86s

1d: CCRL 2800+ Excluding draws
--------------------------------

Finished:  8000 games /  686922 positions / 175 530 585 820 1620 Time:  90.62s
Finished: 16000 games / 1412146 positions / 180 520 575 810 1585 Time: 460.81s

The results are given for 2 different databases (human,machine). Looking at the human values,
my interpretation is not that the humans do use different values, but more likely that they aren't able
to exploit them in an adequat way. Machines seem to be stronger and more accurate/consequent.
This interpretation comes to my mind because an early stage engine produces values closer to the human values.

Beside the useless but interesting interpretation of the data the most important point is,
that the time used to derive such values is very low. Even the inaccurate levels of 1000/2000/4000 games are pretty useful.
To have a database of high level games with 10.000.000 games can be analyzed within a day if someone needs to be
that accurate, but seems to be pointless, because a feature might be tuned again with respect to orthogonality.
That the results depend strongly on the database is needless to say, and is shown by human/machine differences above.

Of course the runtime depends on a handful of settings, but in any case the range of the values factors in and
is for material higher than for nearly any other feature. You can see the material values above with a resolution
of 5 points "only". Later more on that.


2: Error computation
=====================

Code: Select all

double Database::getFitness(uint32_t id,Position* pos)
{
    // Evaluate current game position
    double score = getWhitePositionScore(id,pos);
           score = Misc::sigmoid(score,400);

    // Game result 1.0/0.5/0.0
    double result = getGameResult(id);

    // Fitness logic
    return 1.0 - pow(result-score,2);
}
This is a standard computation for the squared error, but expressed as fitness.
The idea is that you can have different objective functions, like a combination of correct move counts
and square error computation. But having a correct move is usually counted positive. Just a little flexibility
for the controller functions and something to play with in future.

The sigmoid function at this place is the base for the result values above and arbitrary in the sense that i just picked
it frome the elo formular.Using another mapping of the winning probabilities will give different results.

The "correctness" of a value is computed without any draw ratio of which i read in Amir Ban's paper two days before.
Looking at the result values at the beginning, it is pretty obvious that draws play a role, but it is sufficient
to have a sample set that includes a real distribution of the game results.

3:Evaluation
=============

Code: Select all


double Database::getWhitePositionScore(uint32_t id,Position* pos)
{
    return (double) _gCache[id].staticEval(pos);
    //return (double) _gCache[id].fixDepth(pos,1);
    //return (double) _gCache[id].fixNodes(pos,1000);
}

The score used in the fitness function can be determined with the full power of the engine.
By far the fastest method is to use the static evaluation of the engine. But the quality of the
results will be more accurate with a more complex search.

An important point is that the evaluation needs to be deterministic. Any type of collected data
needs to be reset before searching (transposition-tables,material-hash-tables,pawn-hash-tables...),
especially if the parameter is part of such a component.

Another finding is that increasing the positions/games will not necessarily improve an inaccurate result.
Just a little arbitrary example: Looking at 4000 games may produce mobility values 6,6,6,4(n,b,r,q) and
an error/fitness of 3725.1234 points. Some other combination like 5,6,6,4 produces a fitness of 3723.1234 points.
Now increasing the games on 32000 games may give the result 30000.1234 points and 29990.1234 points, so no change.

You can continue increasing the game number, but vector 6,6,6,4 keeps to be the better one.
Well, if you change the quality, say doing analysis with a fixDepth(1) this may change and vector 5,6,6,4 becomes better.
So, having a more accurate statistic does not help in that case.

Being aware of that, it will get interesting to find a way inbetween, especially in context of the mentioned guiding ideas.

4: Exctracted control
======================

4a: Position analysis
----------------------

Code: Select all


void Database::playout(uint32_t id)
{
    restore_t restore;
    Position pos;
    pos.setup("startpos");

    error_t gameError;
    gameError.clear();

    for&#40;int i=0;i<_gCache&#91;id&#93;.mCount;i++)
    &#123;
        pos.doMove&#40;&restore,_gCache&#91;id&#93;.move&#91;i&#93;);

	// At this point a lot of filters can be used.
	// A restriction to move count / game phase / whatever you like
	// eg&#58; if&#40;i<20&#41; continue; This would ignore book moves for example

	...


        // Error type holds error/fitness
        gameError.total += getFitness&#40;id,&pos&#41;;
        gameError.count++;
    &#125;

    // At this point the game fitness/error is added to the total fitness/error
    // The simplest form is _error.total += gameError.total

    ...
&#125;

Some words on how the data is accessed. I read some posts on parameter tuning and a lot of people extract data
from pgn files and make epd exports out of it. The same way i started too, but it is not handy, needs a lot of
memory and more important is that most of the positions are out of context. Some decisions are made already on this
stage, like removing dublicates. A lot of information gets lost on the way, which tools do not handle correctly when converting,
like halfmoveclock counter and so on.

My proposal is to access positions in game context for several reasons. Excluding data from analysis is simply done by filters,
like opening moves, well, if there is need to do so, because these positions are full of information too. Implicitely the tuner
is sensitive to game results for dublicates of positions.
The position afer 1.e4 (for example) is not "0" or "1" or "1/2", it is more closely to 55% or something like that.

There is no practical limit. Think of backward analysis from the end of the game to the start, which may improve the
error computation based on the result, computing intermediate results between last computation and game result.
Creating books out of the data,...,so many possibilities.

But one of the first practical improvements is that you store relative positions, you can play them out, you only need to
store the movelist, 2 bytes per move and setup the start position for any game analysis. Finally it is fast too and any
other information for the game is available, like "bestMove" as next move and so on.

HINT: pgn-extract-17-21 -tFall.txt my.pgn -Wuci -oAll.pgn
The -Wuci option in pgn-extract works very good and should make it very easy for all UCI-Authors to get uci compatible movelists.
An example of a converted pgn to uci gamefile.

Code: Select all


&#91;Event "CCRL 40/40"&#93;
&#91;Site "CCRL"&#93;
&#91;Date "2006.04.25"&#93;
&#91;Round "19.1.1"&#93;
&#91;White "Rybka 1.1 64-bit"&#93;
&#91;Black "Zap!Chess Paderborn 64-bit 2CPU"&#93;
&#91;Result "1-0"&#93;
&#91;WhiteElo "2894"&#93;
&#91;BlackElo "2808"&#93;
&#91;ECO "B48"&#93;
&#91;Opening "Sicilian"&#93;
&#91;Variation "Taimanov variation"&#93;
&#91;PlyCount "107"&#93;

e2e4 c7c5 g1f3 b8c6 d2d4 c5d4 f3d4 e7e6 b1c3 d8c7 c1e3 a7a6 f1d3 g8f6 e1g1 c6e5 h2h3 f8c5 g1h1 d7d6 f2f4 e5g6 d1d2 e8g8 d4b3 c5b4 a2a3 
b4c3 d2c3 c7c3 b2c3 e6e5 f4f5 g6e7 b3d2 c8d7 c3c4 h7h6 a3a4 d7c6 h1h2 a6a5 g2g4 g8h7 f1e1 f6d7 d2b3 b7b6 h2g2 g7g6 g2g3 e7g8 h3h4 g8f6
e3d2 h7g7 g4g5 f6h5 g3f2 g7h7 d2e3 a8e8 f5f6 e8b8 f2f3 f8d8 e1b1 d8e8 b1d1 e8d8 d3f1 d7c5 b3c5 d6c5 d1d3 h7g8 d3d5 h5f4 d5e5 f4h5 e5e7
d8d6 a1b1 h6g5 h4g5 c6a4 e4e5 a4c6 f3f2 d6d7 e3c5 h5f4 c5d6 b8c8 b1b6 c6a4 b6a6 d7e7 f6e7 a4d7 c4c5 a5a4 a6a7 d7c6 f1a6 c8e8 a7c7 1-0
 
4b: Looping the dataset
------------------------

Code: Select all


void Database&#58;&#58;analyze&#40;uint32_t gameLimit&#41;
&#123;
    // Reset total error
    _error.clear&#40;);

    // Check database limitation
    gameLimit = min&#40;gamecount,gameLimit&#41;;

    // Alternatively the control can be _error.gamecount < gamelimit if filters are active.
    // For example if draws are excluded and you want to have 4000 games anyway.

    for&#40;int i=0;i<gameLimit;i++) 
    &#123;
	// At this point more general filters can be used
	// for the games that should be analyzed. 
        ...

        // Exclude draw games
        //if&#40;_gCache&#91;i&#93;.result == RESULT_DRAW ) continue;

	// Call position analysis
        playout&#40;i&#41;;
    &#125;
&#125;

Instead of accessing a database that includes independent (EPD/FEN) positions, the control works
on 2 layers. Game layer, seen here, and position layer, seen above.
It is easy to look at one random position, all positions of the game, phase dependent positions of a game,
positions with a special opening or position feature. "4a" and "4b" are the corresponding controllers.

I already mentioned the paper of Amir Ban who picks one position out of a game, but there are exist more elegant
solutions to avoid noise due to unequally distributed movecounts, so finally counting game results multiple times.


4c: The tuning algorithm
-------------------------

Code: Select all


void Database&#58;&#58;tuneFast&#40;)
&#123;
    const int STEPSIZE   = 5;
    const int RESOLUTION = 5;
    const int GAMELIMIT  = 16000;

    double randomStep,score,bestscore,startingscore;
    param_t cur,best;

    // Set best vector. Of course it can be set to values that are
    // already used. Example&#58; for&#40;int i=0;i<VECTORSIZE;i++&#41;best.value&#91;i&#93;=0;
    ...

    // Starting values. "Param" is the global vector interface
    // which can be used at any place in the evaluation and sets
    // the values that will be tuned.
    param = best;

    // The initial vector needs to be valued
    analyze&#40;GAMELIMIT&#41;;
    startingscore = bestscore = _error.total;

    // Stepsize&#58; big to small changes
    // example&#58; &#91;5,-5,4,-4,3,-3,2,-2,1,-1&#93;
    int step&#91;STEPSIZE*2&#93;;
    ...

    for&#40;int i=0;i<2*STEPSIZE;i++)
    &#123;
        printf&#40;"\nGeneration&#58; &#91;%d,%d&#93; %f (%f&#41;",_error.gamecount,_error.count,bestscore,startingscore&#41;;

        for&#40;int j=0;j<VECTORSIZE;j++)
        &#123;
            cur = best;
            modifyParam&#40;&cur,j,step&#91;i&#93;*RESOLUTION&#41;;
            param = cur;
            analyze&#40;GAMELIMIT&#41;;

            if&#40;_error.total > bestscore&#41;
            &#123;
                best = cur;
                bestscore = _error.total;
                best.dump&#40;);
                i=-1;
                break;
            &#125;
        &#125;
    &#125;

    printf&#40;"\n\nFinished&#58; %d games / %d positions / ",GAMELIMIT,_error.count&#41;;
    best.dump&#40;);
&#125;

That's looking easy, like always when it is done, but it was like the Hobbit: "There and Back Again".
There is nearly nothing left from simple regression analysis of my first "solution" and my previous gradient descent algorithm.
Just a modified "delta" scoring scheme (for optional use), and some things i need to keep in mind.

First, having something like a delta evaluation is very useful functionality. If, i like to measure a feature without
BIAS, the error is computed by evaluateAllFeatures - zeroVectorEvaluation. The delta shows the impact of the feature for itself,
in other words ignoring orthogonality of other features.

In the previous version my algorithm needed to be stopped after a fixed generation count.
Each parameter was checked into a "direction", with a random amount to avoid local extreme values.
It followed the biggest gain, later it just picked any improvement and produced another randomstep.
Finally, it turned out to be the best thing to do it the way you can see above, because it terminates
when none of the steps are able to produce a progress.

At the beginning i mentioned the runtime, which is influenced by the range of values, the number of games(positions) that are
analyzed, but of course the RESOLUTION, STEPSIZE and length of the vector are important too.

What is defined in the "step[]" array is something like this:

Code: Select all

        step  = rand&#40;) % STEPSIZE;
        step *= RESOLUTION;
        step  = rand&#40;) % 2 ? randomStep &#58; -randomStep;
In my previous algorithm where i used various values, a random direction/amount was useful.
In the algorithm above and knowing the complexity of the problem now, a value with a range of 1000 can be solved
within the time you can see in the results above. Parameters like mobility or many others have ranges up to 10 maybe 100.
Using lower resolutions can be done easily, so for ranges up to 100. Compared to material computation it is a lot faster,
but if you tune arrays of [8],[16],[32],[64] the computational costs rise again.


Future work
-----------

The base has been built now. Actually the algorithm can be used to introduce parameter values without any guess.
It is not about tuning 70,700 or 7000 in one run, but introduce values 4,8,16,32 or 64 at one time. The design of
parameters is still important. A PST table can be introduced as file[8],rank[8] combined value or as array[64].
Less parameters are still easier to tune. A re-tuning of a feature is possible at any time the conditions change or
the evaluation will get more complex, so it will include more overlapping features for example.

Countless improvments on the algorithm are left and still some puzzles to solve.
The next topic will be about "Tapered Eval" which is a real handicap for a tuner, at least the common way (Fruit style).

Maybe some of you can pick up an idea, or start with an own automated tuner based on this post.
:)
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Evaluation Tuning

Post by Ferdy »

Thanks for sharing.
I have such tuner too similar to the idea from Texel tuning, but designed to be used for other uci engines. Something like run
3 engines in parallel (for 4-core machine), they are same engine actually,
and then each engine has an epd of say 500k pos to analyze, all unique pos. from first epd to 3rd epd. I just use the command go depth 0
after sending the position command and expects the engine to return a score from qsearch (at least this is what I did with Deuterium).
This is not standard and I tried other uci engines using go depth 1 but this is time consuming.
The bottleneck that I have observed is the time the engine setups a position, I would normally send,

Code: Select all

ucinewgame
position fen z
go depth 0 or go movetime 20 or go depth 1.
Regarding the training positions, I collect games from pgn with move
comments (actually from Frank). I extract position, and record the result, the move made by the actual engine,
and also the evaluation of the engine in actual game. Something like this

Code: Select all

br1r2k1/5pb1/6pp/1p1P2P1/p3NP2/P7/BP1R1K1P/3R4 b - - c9 "0-1"; sm "b8b6"; ce -51; 
So now from that data, I can go for result, or move made, or the eval.
Currently I use the eval, this is where I calculate the eval error of the
engine under test. I still have the option to use the result or the move.

I also have a tool in optimizing the move in that sample epd. Just for
fan creating a personality. What I did was get Capablanca games, extract
interesting pos, save it to epd. Run Sf6 to remove blunders, what remains
are positions without serious mistakes, then use it as traning positions.
All positions are capablanca to play and the best eval parameters are those that
get high matching move in the epd found in sm opcode.
I expose in the tuning my null move pruning, and use the command
go movetime 100. I was surprised in one of those best optimized parameter
combinations, my nullmove pruning was turned off :).
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Evaluation Tuning

Post by matthewlai »

Ferdy wrote:Thanks for sharing.
I have such tuner too similar to the idea from Texel tuning, but designed to be used for other uci engines. Something like run
3 engines in parallel (for 4-core machine), they are same engine actually,
and then each engine has an epd of say 500k pos to analyze, all unique pos. from first epd to 3rd epd. I just use the command go depth 0
after sending the position command and expects the engine to return a score from qsearch (at least this is what I did with Deuterium).
This is not standard and I tried other uci engines using go depth 1 but this is time consuming.
The bottleneck that I have observed is the time the engine setups a position, I would normally send,

Code: Select all

ucinewgame
position fen z
go depth 0 or go movetime 20 or go depth 1.
Regarding the training positions, I collect games from pgn with move
comments (actually from Frank). I extract position, and record the result, the move made by the actual engine,
and also the evaluation of the engine in actual game. Something like this

Code: Select all

br1r2k1/5pb1/6pp/1p1P2P1/p3NP2/P7/BP1R1K1P/3R4 b - - c9 "0-1"; sm "b8b6"; ce -51; 
So now from that data, I can go for result, or move made, or the eval.
Currently I use the eval, this is where I calculate the eval error of the
engine under test. I still have the option to use the result or the move.

I also have a tool in optimizing the move in that sample epd. Just for
fan creating a personality. What I did was get Capablanca games, extract
interesting pos, save it to epd. Run Sf6 to remove blunders, what remains
are positions without serious mistakes, then use it as traning positions.
All positions are capablanca to play and the best eval parameters are those that
get high matching move in the epd found in sm opcode.
I expose in the tuning my null move pruning, and use the command
go movetime 100. I was surprised in one of those best optimized parameter
combinations, my nullmove pruning was turned off :).
Be sure to think about overfitting. It's a very common problem in machine learning (and you are essentially doing machine learning now).

The problem is that when you have a relatively small training set (like in your case), the model can evolve to solve your training positions perfectly, while not evolving knowledge that can be generalized to other positions.

http://qr.ae/RC7iyx

The typical solution is to withhold say 20% of the positions (not use them for training), and periodically evaluate the model on that set. Once performance (accuracy) on that validation set stops decreasing, stop the training, so it won't become too specialized at solving the training set.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Evaluation Tuning

Post by jdart »

I have considerable experience with automated tuning now and it has also been discussed in this forum extensively.

Some observations:

1. The objective function: the most direct way to measure the effect of any change, including scoring parameters, is to run actual game matches. It is pretty well established now that "hyperbullet" matches produce results that can generally also apply to slower time-control games. But since many changes to parameters produce small changes in results, you still need a large number of games to get results that are not just noise. It is important to have enough games so that changes in inputs produce changes in outputs that are outside the error bars. This can be very expensive and time-consuming even using hyper-bullet games. Furthermore if you change > 1 parameter at once then finding the optimal combination of values is even more time consuming.

2. CLOP supposedly alleviates this problem by dynamically zeroing in on improved values for several parameters over the course of a single long series of matches but I have had generally poor results from it. It can be very slow to converge and values do not always pass verification.

3. I personally have had no luck using surrogate measures of fitness such as the difference between magnitude of evals or low-depth search results and game outcomes. I have observed many times that changes that reduced the error in this or similar fitness measures failed to produce improvements when measured by actual game matches.

4. All optimization methods suffer from the so-called "curse of dimensionality." The more parameters you are trying to tune simultaneously, the more difficult it is to have confidence that you are finding the globally best set of values. (But if your objective is only to find better values than your starting point, that is an easier problem).

5. However, you might take a look at this recent paper on parameter optimization for shogi, which has a different kind of fitnesss measure combined with some clever and apparently effective optimization techniques: https://www.jair.org/media/4217/live-4217-7792-jair.pdf. There is a little bit of stuff on experimentally applying these ideas to Crafty, but they used test suites to measure improvement, so I think that is bogus and more/different testing would be needed to verify that this is actually working. Empirically it does work for Shogi programs though.

--Jon
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Evaluation Tuning

Post by Ferdy »

matthewlai wrote:
Ferdy wrote:Thanks for sharing.
I have such tuner too similar to the idea from Texel tuning, but designed to be used for other uci engines. Something like run
3 engines in parallel (for 4-core machine), they are same engine actually,
and then each engine has an epd of say 500k pos to analyze, all unique pos. from first epd to 3rd epd. I just use the command go depth 0
after sending the position command and expects the engine to return a score from qsearch (at least this is what I did with Deuterium).
This is not standard and I tried other uci engines using go depth 1 but this is time consuming.
The bottleneck that I have observed is the time the engine setups a position, I would normally send,

Code: Select all

ucinewgame
position fen z
go depth 0 or go movetime 20 or go depth 1.
Regarding the training positions, I collect games from pgn with move
comments (actually from Frank). I extract position, and record the result, the move made by the actual engine,
and also the evaluation of the engine in actual game. Something like this

Code: Select all

br1r2k1/5pb1/6pp/1p1P2P1/p3NP2/P7/BP1R1K1P/3R4 b - - c9 "0-1"; sm "b8b6"; ce -51; 
So now from that data, I can go for result, or move made, or the eval.
Currently I use the eval, this is where I calculate the eval error of the
engine under test. I still have the option to use the result or the move.

I also have a tool in optimizing the move in that sample epd. Just for
fan creating a personality. What I did was get Capablanca games, extract
interesting pos, save it to epd. Run Sf6 to remove blunders, what remains
are positions without serious mistakes, then use it as traning positions.
All positions are capablanca to play and the best eval parameters are those that
get high matching move in the epd found in sm opcode.
I expose in the tuning my null move pruning, and use the command
go movetime 100. I was surprised in one of those best optimized parameter
combinations, my nullmove pruning was turned off :).
Be sure to think about overfitting. It's a very common problem in machine learning (and you are essentially doing machine learning now).
Yes of course I know what is over-fitting.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Evaluation Tuning

Post by matthewlai »

jdart wrote: 3. I personally have had no luck using surrogate measures of fitness such as the difference between magnitude of evals or low-depth search results and game outcomes. I have observed many times that changes that reduced the error in this or similar fitness measures failed to produce improvements when measured by actual game matches.
I've had a lot of success with TD-Leaf. I found that the trick is to apply a limit on the magnitude of differences and/or use MAE instead of MSE. The goal is the same - so that huge errors don't completely dominate.

I am able to train my eval from about 5500/15000 on the STS to now 8500/15000 on the STS, at extremely low node counts. This takes about 24 hours, for about 50K params (neural network).

I have abandoned optimizing using gameplay because it's just way too inefficient, and signal to noise ratio is too low. It only takes a blunder to lose a game, and all the decisions in the game were penalized, even though it's entirely possible that they were all good except for that one blunder.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
User avatar
Fabio Gobbato
Posts: 217
Joined: Fri Apr 11, 2014 10:45 am
Full name: Fabio Gobbato

Re: Evaluation Tuning

Post by Fabio Gobbato »

I'm trying a new tuning method. I have only 2 attempts with 2 good results since now. so I'm not sure it's a very good method, but in the first attempt I have had a +12 ELO and in the second a +8 ELO.

I extract 150 millions positions, it doesn't matter the quality of the games but it's better if they are quite differents each other.
From them I remove these position:
- in check
- with fewer than 7 pieces (it's good if the engine uses the tablebases)
- static eval different from qsearch score (quiet positions)
Then I run a 2 depth search for every position and I store in a file the position with the score returned.

Ones I have this file I compare every single score of the position with the static evaluation.
I compute the error for a position as the absolute difference of the 2 score.

Then I use an optimization algorithm to minimize the error.

The idea is that it's difficult for the evaluation to see the results of the position but it's easier that it can see the eval from 2 ply ahead.
Another advantage is that this method minimize the difference from related position.

I have made only 2 iteration of this method with a total +20ELO improvement.
If someone else would try we can compare the results.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Evaluation Tuning

Post by matthewlai »

Fabio Gobbato wrote:I'm trying a new tuning method. I have only 2 attempts with 2 good results since now. so I'm not sure it's a very good method, but in the first attempt I have had a +12 ELO and in the second a +8 ELO.

I extract 150 millions positions, it doesn't matter the quality of the games but it's better if they are quite differents each other.
From them I remove these position:
- in check
- with fewer than 7 pieces (it's good if the engine uses the tablebases)
- static eval different from qsearch score (quiet positions)
Then I run a 2 depth search for every position and I store in a file the position with the score returned.

Ones I have this file I compare every single score of the position with the static evaluation.
I compute the error for a position as the absolute difference of the 2 score.

Then I use an optimization algorithm to minimize the error.

The idea is that it's difficult for the evaluation to see the results of the position but it's easier that it can see the eval from 2 ply ahead.
Another advantage is that this method minimize the difference from related position.

I have made only 2 iteration of this method with a total +20ELO improvement.
If someone else would try we can compare the results.
This sounds similar to TD-Leaf. Have you seen that paper by any chance?

I am doing something somewhat similar.

I also start with a lot of random unlabelled positions.

Then for each position, I make a few moves with a 2-ply search to decide each move. I also record the PV of each search.

Then I optimize the eval of the leaf of the PV on the search on move 0 to the eval of the leaf of the PV on the search on move 2, etc.

This is the TD-Leaf algorithm.

I got at least 400 Elo from it, starting from a material-only eval.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
User avatar
Fabio Gobbato
Posts: 217
Joined: Fri Apr 11, 2014 10:45 am
Full name: Fabio Gobbato

Re: Evaluation Tuning

Post by Fabio Gobbato »

No, I didn't know TD-Leaf. I thought to have discovered something :)
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Evaluation Tuning

Post by Gerd Isenberg »

Isn't Joel Veness' RootStrap or even TreeStrap as applied in Meep supposed to gain better or faster results than TD-Leaf?