=============
Now, operating in the shadow on my new engine, one of my goals is to get rid of the guesswork for evaluation parameters.
There are some well explored features but most of positional parameters are simply crap, based on human intuition.
Time showed that parameters like doubled pawns, just one example, had been evaluated completely wrong, just a decade ago.
The standard, to play several thousands of games to get somehow correct values for a feature is doable on today's hardware
which allows extremely fast games. But to start with reasonable parameters (and tune them continuously) which are derived
from data analysis should give more freedom to an engine developer and goes along with some other basic considerations.
So, i do not expect values to be better in short run, maybe in long run, than values that are already used in an engine.
The main difference will be to know where the values come from and to be able to close the gap by understanding how values
from data analysis differ from parameters used in an engine. At the end of the post this point may be more clear because
"Tapered Eval" is one subject on this matter.
Another point is from developer's point of view to stop twiddling between the search/evaluation components.
Of course there is a strong correlation between search and static evaluation, without any doubt.
Nonetheless, it does not make sense to repair a broken search feature by tuning the evaluation in a way to catch this up,
nor the other way around. Therefore a definition needs to be done of what is correct for each of both components, so
it gets clear where a feature needs to be tuned/fixed/introduced.
For this purpose i consider evaluation parameters to be correct based on data analysis.
Well, i need to examine this idea more closely, but if evaluation parameters do not work in conjunction with the search,
the idea is to make the search sensitive for the parameters and not the other way around. A fix on the evaluation parameters
or its conditions would occur if database is flawed/insufficient.
Now, the following sections include some basic ideas and explanations of my algorithm, and some various hints.
1: Starting with a vector set to zero ( without normalization / Bias )
=======================================================================
Code: Select all
1a: Human 2500+
---------------------
Finished: 1000 games / 42045 positions / 115 240 275 405 825 Time: 3.98s
Finished: 2000 games / 83494 positions / 100 210 245 365 725 Time: 6.93s
Finished: 4000 games / 167823 positions / 100 220 235 365 750 Time: 15.63s
Finished: 8000 games / 335833 positions / 100 230 250 385 795 Time: 20.90s
Finished: 16000 games / 686395 positions / 90 210 225 350 705 Time: 62.99s
Finished: 32000 games / 1415813 positions / 90 220 235 360 730 Time: 196.14s
Finished: 64000 games / 2845859 positions / 85 205 225 350 705 Time: 222.24s
Finished: 128000 games / 5794124 positions / 85 205 220 340 690 Time: 522.15s
1b: CCRL 2800+
-----------------------
Finished: 1000 games / 95093 positions / 110 320 355 465 835 Time: 13.28s
Finished: 2000 games / 186361 positions / 100 290 325 440 850 Time: 17.94s
Finished: 4000 games / 372961 positions / 100 300 340 465 900 Time: 36.27s
Finished: 8000 games / 760902 positions / 100 280 310 445 860 Time: 95.25s
Finished: 16000 games / 1534765 positions / 95 275 305 435 860 Time: 127.27s
Finished: 32000 games / 3139159 positions / 105 275 305 440 860 Time: 304.26s
Finished: 64000 games / 6322204 positions / 105 270 305 440 860 Time: 521.43s
Finished: 128000 games / 12670934 positions / 95 255 285 415 825 Time: 1202.57s
1c: Human 2500+ Excluding draws
--------------------------------
Finished: 4000 games / 212790 positions / 170 430 465 710 1430 Time: 36.22s
Finished: 8000 games / 424052 positions / 160 385 420 635 1285 Time: 52.67s
Finished: 16000 games / 851718 positions / 150 385 420 635 1275 Time: 100.86s
1d: CCRL 2800+ Excluding draws
--------------------------------
Finished: 8000 games / 686922 positions / 175 530 585 820 1620 Time: 90.62s
Finished: 16000 games / 1412146 positions / 180 520 575 810 1585 Time: 460.81s
my interpretation is not that the humans do use different values, but more likely that they aren't able
to exploit them in an adequat way. Machines seem to be stronger and more accurate/consequent.
This interpretation comes to my mind because an early stage engine produces values closer to the human values.
Beside the useless but interesting interpretation of the data the most important point is,
that the time used to derive such values is very low. Even the inaccurate levels of 1000/2000/4000 games are pretty useful.
To have a database of high level games with 10.000.000 games can be analyzed within a day if someone needs to be
that accurate, but seems to be pointless, because a feature might be tuned again with respect to orthogonality.
That the results depend strongly on the database is needless to say, and is shown by human/machine differences above.
Of course the runtime depends on a handful of settings, but in any case the range of the values factors in and
is for material higher than for nearly any other feature. You can see the material values above with a resolution
of 5 points "only". Later more on that.
2: Error computation
=====================
Code: Select all
double Database::getFitness(uint32_t id,Position* pos)
{
// Evaluate current game position
double score = getWhitePositionScore(id,pos);
score = Misc::sigmoid(score,400);
// Game result 1.0/0.5/0.0
double result = getGameResult(id);
// Fitness logic
return 1.0 - pow(result-score,2);
}
The idea is that you can have different objective functions, like a combination of correct move counts
and square error computation. But having a correct move is usually counted positive. Just a little flexibility
for the controller functions and something to play with in future.
The sigmoid function at this place is the base for the result values above and arbitrary in the sense that i just picked
it frome the elo formular.Using another mapping of the winning probabilities will give different results.
The "correctness" of a value is computed without any draw ratio of which i read in Amir Ban's paper two days before.
Looking at the result values at the beginning, it is pretty obvious that draws play a role, but it is sufficient
to have a sample set that includes a real distribution of the game results.
3:Evaluation
=============
Code: Select all
double Database::getWhitePositionScore(uint32_t id,Position* pos)
{
return (double) _gCache[id].staticEval(pos);
//return (double) _gCache[id].fixDepth(pos,1);
//return (double) _gCache[id].fixNodes(pos,1000);
}
By far the fastest method is to use the static evaluation of the engine. But the quality of the
results will be more accurate with a more complex search.
An important point is that the evaluation needs to be deterministic. Any type of collected data
needs to be reset before searching (transposition-tables,material-hash-tables,pawn-hash-tables...),
especially if the parameter is part of such a component.
Another finding is that increasing the positions/games will not necessarily improve an inaccurate result.
Just a little arbitrary example: Looking at 4000 games may produce mobility values 6,6,6,4(n,b,r,q) and
an error/fitness of 3725.1234 points. Some other combination like 5,6,6,4 produces a fitness of 3723.1234 points.
Now increasing the games on 32000 games may give the result 30000.1234 points and 29990.1234 points, so no change.
You can continue increasing the game number, but vector 6,6,6,4 keeps to be the better one.
Well, if you change the quality, say doing analysis with a fixDepth(1) this may change and vector 5,6,6,4 becomes better.
So, having a more accurate statistic does not help in that case.
Being aware of that, it will get interesting to find a way inbetween, especially in context of the mentioned guiding ideas.
4: Exctracted control
======================
4a: Position analysis
----------------------
Code: Select all
void Database::playout(uint32_t id)
{
restore_t restore;
Position pos;
pos.setup("startpos");
error_t gameError;
gameError.clear();
for(int i=0;i<_gCache[id].mCount;i++)
{
pos.doMove(&restore,_gCache[id].move[i]);
// At this point a lot of filters can be used.
// A restriction to move count / game phase / whatever you like
// eg: if(i<20) continue; This would ignore book moves for example
...
// Error type holds error/fitness
gameError.total += getFitness(id,&pos);
gameError.count++;
}
// At this point the game fitness/error is added to the total fitness/error
// The simplest form is _error.total += gameError.total
...
}
from pgn files and make epd exports out of it. The same way i started too, but it is not handy, needs a lot of
memory and more important is that most of the positions are out of context. Some decisions are made already on this
stage, like removing dublicates. A lot of information gets lost on the way, which tools do not handle correctly when converting,
like halfmoveclock counter and so on.
My proposal is to access positions in game context for several reasons. Excluding data from analysis is simply done by filters,
like opening moves, well, if there is need to do so, because these positions are full of information too. Implicitely the tuner
is sensitive to game results for dublicates of positions.
The position afer 1.e4 (for example) is not "0" or "1" or "1/2", it is more closely to 55% or something like that.
There is no practical limit. Think of backward analysis from the end of the game to the start, which may improve the
error computation based on the result, computing intermediate results between last computation and game result.
Creating books out of the data,...,so many possibilities.
But one of the first practical improvements is that you store relative positions, you can play them out, you only need to
store the movelist, 2 bytes per move and setup the start position for any game analysis. Finally it is fast too and any
other information for the game is available, like "bestMove" as next move and so on.
HINT: pgn-extract-17-21 -tFall.txt my.pgn -Wuci -oAll.pgn
The -Wuci option in pgn-extract works very good and should make it very easy for all UCI-Authors to get uci compatible movelists.
An example of a converted pgn to uci gamefile.
Code: Select all
[Event "CCRL 40/40"]
[Site "CCRL"]
[Date "2006.04.25"]
[Round "19.1.1"]
[White "Rybka 1.1 64-bit"]
[Black "Zap!Chess Paderborn 64-bit 2CPU"]
[Result "1-0"]
[WhiteElo "2894"]
[BlackElo "2808"]
[ECO "B48"]
[Opening "Sicilian"]
[Variation "Taimanov variation"]
[PlyCount "107"]
e2e4 c7c5 g1f3 b8c6 d2d4 c5d4 f3d4 e7e6 b1c3 d8c7 c1e3 a7a6 f1d3 g8f6 e1g1 c6e5 h2h3 f8c5 g1h1 d7d6 f2f4 e5g6 d1d2 e8g8 d4b3 c5b4 a2a3
b4c3 d2c3 c7c3 b2c3 e6e5 f4f5 g6e7 b3d2 c8d7 c3c4 h7h6 a3a4 d7c6 h1h2 a6a5 g2g4 g8h7 f1e1 f6d7 d2b3 b7b6 h2g2 g7g6 g2g3 e7g8 h3h4 g8f6
e3d2 h7g7 g4g5 f6h5 g3f2 g7h7 d2e3 a8e8 f5f6 e8b8 f2f3 f8d8 e1b1 d8e8 b1d1 e8d8 d3f1 d7c5 b3c5 d6c5 d1d3 h7g8 d3d5 h5f4 d5e5 f4h5 e5e7
d8d6 a1b1 h6g5 h4g5 c6a4 e4e5 a4c6 f3f2 d6d7 e3c5 h5f4 c5d6 b8c8 b1b6 c6a4 b6a6 d7e7 f6e7 a4d7 c4c5 a5a4 a6a7 d7c6 f1a6 c8e8 a7c7 1-0
------------------------
Code: Select all
void Database::analyze(uint32_t gameLimit)
{
// Reset total error
_error.clear();
// Check database limitation
gameLimit = min(gamecount,gameLimit);
// Alternatively the control can be _error.gamecount < gamelimit if filters are active.
// For example if draws are excluded and you want to have 4000 games anyway.
for(int i=0;i<gameLimit;i++)
{
// At this point more general filters can be used
// for the games that should be analyzed.
...
// Exclude draw games
//if(_gCache[i].result == RESULT_DRAW ) continue;
// Call position analysis
playout(i);
}
}
on 2 layers. Game layer, seen here, and position layer, seen above.
It is easy to look at one random position, all positions of the game, phase dependent positions of a game,
positions with a special opening or position feature. "4a" and "4b" are the corresponding controllers.
I already mentioned the paper of Amir Ban who picks one position out of a game, but there are exist more elegant
solutions to avoid noise due to unequally distributed movecounts, so finally counting game results multiple times.
4c: The tuning algorithm
-------------------------
Code: Select all
void Database::tuneFast()
{
const int STEPSIZE = 5;
const int RESOLUTION = 5;
const int GAMELIMIT = 16000;
double randomStep,score,bestscore,startingscore;
param_t cur,best;
// Set best vector. Of course it can be set to values that are
// already used. Example: for(int i=0;i<VECTORSIZE;i++)best.value[i]=0;
...
// Starting values. "Param" is the global vector interface
// which can be used at any place in the evaluation and sets
// the values that will be tuned.
param = best;
// The initial vector needs to be valued
analyze(GAMELIMIT);
startingscore = bestscore = _error.total;
// Stepsize: big to small changes
// example: [5,-5,4,-4,3,-3,2,-2,1,-1]
int step[STEPSIZE*2];
...
for(int i=0;i<2*STEPSIZE;i++)
{
printf("\nGeneration: [%d,%d] %f (%f)",_error.gamecount,_error.count,bestscore,startingscore);
for(int j=0;j<VECTORSIZE;j++)
{
cur = best;
modifyParam(&cur,j,step[i]*RESOLUTION);
param = cur;
analyze(GAMELIMIT);
if(_error.total > bestscore)
{
best = cur;
bestscore = _error.total;
best.dump();
i=-1;
break;
}
}
}
printf("\n\nFinished: %d games / %d positions / ",GAMELIMIT,_error.count);
best.dump();
}
There is nearly nothing left from simple regression analysis of my first "solution" and my previous gradient descent algorithm.
Just a modified "delta" scoring scheme (for optional use), and some things i need to keep in mind.
First, having something like a delta evaluation is very useful functionality. If, i like to measure a feature without
BIAS, the error is computed by evaluateAllFeatures - zeroVectorEvaluation. The delta shows the impact of the feature for itself,
in other words ignoring orthogonality of other features.
In the previous version my algorithm needed to be stopped after a fixed generation count.
Each parameter was checked into a "direction", with a random amount to avoid local extreme values.
It followed the biggest gain, later it just picked any improvement and produced another randomstep.
Finally, it turned out to be the best thing to do it the way you can see above, because it terminates
when none of the steps are able to produce a progress.
At the beginning i mentioned the runtime, which is influenced by the range of values, the number of games(positions) that are
analyzed, but of course the RESOLUTION, STEPSIZE and length of the vector are important too.
What is defined in the "step[]" array is something like this:
Code: Select all
step = rand() % STEPSIZE;
step *= RESOLUTION;
step = rand() % 2 ? randomStep : -randomStep;
In the algorithm above and knowing the complexity of the problem now, a value with a range of 1000 can be solved
within the time you can see in the results above. Parameters like mobility or many others have ranges up to 10 maybe 100.
Using lower resolutions can be done easily, so for ranges up to 100. Compared to material computation it is a lot faster,
but if you tune arrays of [8],[16],[32],[64] the computational costs rise again.
Future work
-----------
The base has been built now. Actually the algorithm can be used to introduce parameter values without any guess.
It is not about tuning 70,700 or 7000 in one run, but introduce values 4,8,16,32 or 64 at one time. The design of
parameters is still important. A PST table can be introduced as file[8],rank[8] combined value or as array[64].
Less parameters are still easier to tune. A re-tuning of a feature is possible at any time the conditions change or
the evaluation will get more complex, so it will include more overlapping features for example.
Countless improvments on the algorithm are left and still some puzzles to solve.
The next topic will be about "Tapered Eval" which is a real handicap for a tuner, at least the common way (Fruit style).
Maybe some of you can pick up an idea, or start with an own automated tuner based on this post.