Evaluation Tuning

Desperado · Post by **Desperado** » Wed Aug 19, 2015 1:37 pm

Ferdy wrote:
Desperado wrote:I would prefer a general solution, that does not depend on restrictions on the values, but on a adaption for phase computing or more general weighted evaluations.
That would be good indeed for positional factors but for piece values I have some doubts.

The crux is:

Code: Select all


1&#58; simple parameter types

score_t evaluation&#40;)
&#123;
   score = sumMaterialValues as example
   ...
   return score;
&#125;

is different to

Code: Select all


// mixed parameters for mg/eg scoring

score_t evaluation&#40;)
&#123;
   scoreMG = sumMaterialValuesMG as example
   scoreEG = sumMaterialValuesEG as example
   ...

   score = (&#40;scoreMG * phase&#41; + scoreEG*&#40;MAXPHASE-phase&#41;) / MAXPHASE
   
   ...
  return score;
&#125;

The latter leads to obscure results like [200,1400] for the queen whereas the first function produces usable results for the queen like 800.

Ferdy · Post by **Ferdy** » Wed Aug 19, 2015 1:51 pm

Desperado wrote:The latter leads to obscure results like [200,1400] for the queen whereas the first function produces usable results for the queen like 800.

Where is this [200,1400] coming from, did you just made this up or it is coming from your tuner?

Desperado · Post by **Desperado** » Wed Aug 19, 2015 2:02 pm

I made this up, but believe me, the tuner results are close to this example and it doesn't matter how my tuner configuration looks like. The effect will be seen as the tuner is close to finish its doing.
If i would stop the tuner in the middle of the runtime, reasonable values appear occasionally, but they aren't the optimal values in context to the fitness function.
I am certain this is not a problem of the tuner, because this handicap appears with different tuning algorithms i played around. (Genetical Tuning and different other stuff...).

It may be disquiesed by the runtime for different tuning algorithms, because as i mentioned, the reasonable values occur, somewhere in between, but they are tuned away for better results with respect to the fitness function (which finally is just a conversion into the winning probability).

So, while the tuner is doing what it should do, the results show that interpolation of game phases (based on material computation) seem to be a flawed idea,
especially if you use "optimal" values for such a pair of values. (imho and under my current impression)

Ferdy · Post by **Ferdy** » Wed Aug 19, 2015 2:44 pm

Desperado wrote:
Ferdy wrote:
Desperado wrote:Mentioned this... well, the fitness function returns winning probabilities and it is pretty obvious that mg advantages/values are more exhalable.
It is the mg values that matters and not the winning probabilities. Winning probabilities is the more exhalable.
I am not sure what you mean.
Code: Select all
double Database&#58;&#58;getFitness&#40;uint32_t id,Position* pos&#41;
&#123;
    // Evaluate current game position
    double score = getWhitePositionScore&#40;id,pos&#41;;
           score = Misc&#58;&#58;sigmoid&#40;score,400&#41;;

    // Game result 1.0/0.5/0.0
    double result = getGameResult&#40;id&#41;;

    // Fitness logic
    return 1.0 - pow&#40;result-score,2&#41;;
&#125;
Only the error of the computed score expressed as winning probability will be minimized. The probability function maps the evaluation score, that's all.
Of course the mg score influences the straightforward output of the evaluation function but finally the probability will be different too.

What I mean is the score is important not the probability because the calculation of probability is coming from the score. I am reacting to the word exhalable.

Desperado · Post by **Desperado** » Wed Aug 19, 2015 6:59 pm

Ferdy wrote:
Desperado wrote:
Ferdy wrote:
Desperado wrote:Mentioned this... well, the fitness function returns winning probabilities and it is pretty obvious that mg advantages/values are more exhalable.
It is the mg values that matters and not the winning probabilities. Winning probabilities is the more exhalable.
I am not sure what you mean.
Code: Select all
double Database&#58;&#58;getFitness&#40;uint32_t id,Position* pos&#41;
&#123;
    // Evaluate current game position
    double score = getWhitePositionScore&#40;id,pos&#41;;
           score = Misc&#58;&#58;sigmoid&#40;score,400&#41;;

    // Game result 1.0/0.5/0.0
    double result = getGameResult&#40;id&#41;;

    // Fitness logic
    return 1.0 - pow&#40;result-score,2&#41;;
&#125;
Only the error of the computed score expressed as winning probability will be minimized. The probability function maps the evaluation score, that's all.
Of course the mg score influences the straightforward output of the evaluation function but finally the probability will be different too.
What I mean is the score is important not the probability because the calculation of probability is coming from the score. I am reacting to the word exhalable.

I think we can agree on that the output from the objective function always depends on the parameters that are tuned and of course it depends on the function definition itself too. Both elements depend on each other.

The interpretation of the objective is the winning probability known from the elo formular. So, i begin to understand that the reported results do make sense especially because all values are drifting in this way but with different amounts.
That means that the proportions change of the parameters.If i have some time left later in the evening, i can provide some real data which is easier to follow i think. If not today, then the next days for sure.

nionita · Post by **nionita** » Thu Aug 20, 2015 12:02 am

Hi Michael,

Do I understand correct, you try to tune only the material with this method?

If yes, then my opinion is that this will not work, unless you are very lucky (which actually would be again bad luck in the end because then next time will not work). And this is why I think it doesn't work:

The tuner trys to "explain" the differences between the training positions with too less parameters (the material values). But we all know that this is by far not enough. Then the result is biased by the few hundred thousand positions used in the training. But the number of positions encountered in a chess game (in the analysis, not in the real moves!) are much more than a few hundred thousend, and theyre variation (especially material imbalance) is even much much more then what you see in a real game!

So actually what you get is a set of overfitted parameter values.

Just my 2 cents.

[Edit: overfitting must be the wrong term: is should be a model error - the model is too simple - and the training set is probably not representative]

Regards, Nicu

Desperado · Post by **Desperado** » Thu Aug 20, 2015 9:16 am

nionita wrote:Hi Michael,

Do I understand correct, you try to tune only the material with this method?

If yes, then my opinion is that this will not work, unless you are very lucky (which actually would be again bad luck in the end because then next time will not work). And this is why I think it doesn't work:

The tuner trys to "explain" the differences between the training positions with too less parameters (the material values). But we all know that this is by far not enough. Then the result is biased by the few hundred thousand positions used in the training. But the number of positions encountered in a chess game (in the analysis, not in the real moves!) are much more than a few hundred thousend, and theyre variation (especially material imbalance) is even much much more then what you see in a real game!

So actually what you get is a set of overfitted parameter values.

Just my 2 cents.

[Edit: overfitting must be the wrong term: is should be a model error - the model is too simple - and the training set is probably not representative]

Regards, Nicu

Hello, Nicu,

i tune all parameters like this, but the idea is only to tune a subset of all parameters at the same time.
Currently i pick a feature (like mobility values, file/rank values) as subset and so the subset includes
4,5,8...16,64 parameters to tune. Of course there would be no problem to mix parameters of different features.
Material values are mainly choosen in this thread to explain the ideas, but there is no restrictions to nothing.

The databases i use include 200000/400000/...games, up to 40 million positions. The starting post shows that
even using small databases which only includes 100000 positions or less are able to give usable results.
Especially because i want to have a set of positions that are not biased by any properties (like gamephase),
the main control loops over games (not positions), to get a complete picture. So, looping about 1000 games means
to loop over 1000 games * (let's say) 100 positions = 100000 positions with balanced characteristics.

The tuner trys to "explain" the differences between the training positions with too less parameters (the material values).

That is a good point, i will enable some more parameters, so orthogonality of features may influence more than i do expect it for that stage.
But it does not explain why it works for a "simple" evaluation and not when score interpolation is included.
Well, it works too, but the results are somehow unexpected.

As i said, i need some time to produce/provide some data and to formulate some thoughts on the results.

First, i need to go to work...

tpetzke · Post by **tpetzke** » Mon Aug 24, 2015 2:04 pm

Hi Michael,

my two cents after playing a bit with the Texel tuning method.

The outcome is very sensitive about the percentage of positions of drawn games that you include. As an extreme imagine a set with positions only from drawn games, the lowest error will be produced by a set with all weights being 0. This set will not win a lot of games.

A lower evaluation error does not necessarily mean better game playing performance. The most common case in all my tests was a new set with a lower error that finally scores only 48% or so. You can also drive the evaluation error easily down by exposing more parameters. However it does not improve the engine.

I was tuning MG and EG values at the same time and I did not experience crazy values like Queen 200 and 1400 but I usually ended up with queen values bigger than 1200 (both for MG and EG).

One thing that always troubles me is that you use actual positions from the game. In most of those positions either both sides have a queen or they are already exchanged. So the material value of the queen is zeroed out for most of the positions. Positions with a material imbalance would probably be better but the sets most likely contain not enough. Most of the positions a chess engine has to deal with in eval are later not occurring in the game. So fitting the eval to those positions might not be the best preparation for real life.

Thomas...

Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning

Re: Evaluation Tuning