Ideas and questions about how to test evaluation functions

nkg114mc · Post by **nkg114mc** » Sat Dec 03, 2011 10:57 am

Hi, all.

I was thinking about how to test an eval functions or how to compare several eval funtions to see whether one is stronger than another or not. I remember that Don Dailey published a post about “different kinds of testing(http://www.talkchess.com/forum/viewtopic.php?t=30550)”, in which he mentioned three testing methods: 1) time control games, 2) fixed depth 3) fixed nodes. So I was thinking that: if I want to compare two engines which have the totally same search code but different eval functions (also means that the only difference of these two are eval function), which testing method should I take?

Here are some of my own points:
1) Using the “fixed depth” testing with a shallow depth limit, say depth 2 or depth 3, so that engines can play fast games. Lazy evaluation should be closed.
2) Using a opening test suit, for example, the Dr. Hyatt’s EPD opening suit “openings.epd” which includes 4000 opening positions, so that we can get the performance on the different games.
3) Make a tournament of N games using the two points above, and see the ELO of these two engines to say which is better.

Now my question is: is this method correct to conclude “one eval is better than the other”?
And if it is correct, there are still some detailed questions:
About 1): What depth limit is proper?
Should we keep the Qsearch or close it in the test?
About 3): How many games is enough to make the conclusion? 1000? 10000? Or more?

I know that it is meaningless to just test a evaluation function without combining it with the search part of a engine. I am just curious about how exactly an evaluation could be and try to implement one as exactly as I can.

Can any give me some advices about this? Thanks very much!

rbarreira · Post by **rbarreira** » Sat Dec 03, 2011 11:05 am

I don't like the fixed depth idea so much, as the evaluation has an effect on move ordering and therefore search tree size. So in a real game different evals will make the engine search deeper, which makes it unfair to test them as the same depth.

For tuning evals it would seem testing at fixed number of nodes or timed games would be the best. Fixed number of nodes seems quite fine here, as when you're tuning parameters the speed of the eval doesn't change, which means fixed nodes is basically equal to timed games (but with certain advantages such as fixed nodes being hardware-independent and not being affected by system load).

If you want to compare evals which are different in more than parameters (i.e. one has different features from the other), then it would seem the only fair way is to run timed games. Fixed depth is unfair due to different eval quality and speed. Fixed nodes is unfair due to different eval speed.

nkg114mc · Post by **nkg114mc** » Sat Dec 03, 2011 12:14 pm

Hi, Ricardo. Thanks for your reply!

I understand what you mean, and agree with you in some points.

In fact, the real problem that I am curious about is that: "Is there any method that can allow us to compare the accuracy of two eval functions theoretically?" Here "theoretically" means we can ignore the computing time (but it should be static function without search) and its affects to the search (For example, we use a 2 depth full min-max search without pruning and without TT, so that the eval will not change the moves order anymore), just want to see whether one is more exact than the other.

Does this "theoretical method" exist?

Don · Post by **Don** » Sat Dec 03, 2011 2:26 pm

nkg114mc wrote:Hi, all.

I was thinking about how to test an eval functions or how to compare several eval funtions to see whether one is stronger than another or not. I remember that Don Dailey published a post about “different kinds of testing(http://www.talkchess.com/forum/viewtopic.php?t=30550)”, in which he mentioned three testing methods: 1) time control games, 2) fixed depth 3) fixed nodes. So I was thinking that: if I want to compare two engines which have the totally same search code but different eval functions (also means that the only difference of these two are eval function), which testing method should I take?

Here are some of my own points:
1) Using the “fixed depth” testing with a shallow depth limit, say depth 2 or depth 3, so that engines can play fast games. Lazy evaluation should be closed.
2) Using a opening test suit, for example, the Dr. Hyatt’s EPD opening suit “openings.epd” which includes 4000 opening positions, so that we can get the performance on the different games.
3) Make a tournament of N games using the two points above, and see the ELO of these two engines to say which is better.

Now my question is: is this method correct to conclude “one eval is better than the other”?
And if it is correct, there are still some detailed questions:
About 1): What depth limit is proper?
Should we keep the Qsearch or close it in the test?
About 3): How many games is enough to make the conclusion? 1000? 10000? Or more?

I know that it is meaningless to just test a evaluation function without combining it with the search part of a engine. I am just curious about how exactly an evaluation could be and try to implement one as exactly as I can.

Can any give me some advices about this? Thanks very much!

Our own thinking on this has evolved over time.

We generally limit fixed depth testing to things that have little impact on the tree or simply to do a study to learn something about a change and it's impact on the speed of the program. Of course almost everything has some impact, but we always instrument any slowdown or speedup and any change in nodes.

Here is an example from a 12 ply test:

Code: Select all


Rank Name              Elo      +      -    games   score   oppo.   draws 
   1 kse-4325.00     3000.0    8.2    8.2    5738   50.4%  2997.7   56.9% 
   2 kse-4325.08-25  2998.8    8.2    8.2    5737   50.1%  2998.1   56.1% 
   3 kse-4325.08-30  2998.5    8.2    8.2    5736   50.0%  2998.2   56.0% 
   4 kse-4325.08-20  2995.7    8.3    8.3    5737   49.4%  2999.1   55.1% 


      TIME       RATIO    log(r)     NODES    log(r)  ave DEPTH    GAMES   PLAYER
 ---------  ----------  --------  --------  --------  ---------  -------   --------------
    0.3273       0.940    -0.062     0.272    -0.005    11.9989     5737   kse-4325.08-25
    0.3276       0.941    -0.061     0.272    -0.006    11.9988     5736   kse-4325.08-30
    0.3294       0.946    -0.056     0.274     0.003    11.9991     5737   kse-4325.08-20
    0.3483       1.000     0.000     0.273     0.000    11.9992     5738   kse-4325.00

So we don't just rate the change, we instrument it's impact in time as well as nodes.

We are trying a parameter of 25, 30, and 20 for some search feature we are testing here. Notice that for some versions there is a reduction in nodes and a speedup. So the slight reduction in ELO is easily compensated for. This will be followed up by a time control test later because we don't completely trust fixed depth.

In general evaluation weights do not affect the tree very much and we usually use fixed depth for this followed up by time control games. However we have discovered that even evaluation weights have funky scaling properties, I think we have proved to ourselves (empirically) that the best weight for a 3 ply search may be significantly different that the best weight for a deep search for some particularly positional feature. So for weight tuning we still avoid shallow depths and we can still get tens of thousands of games at intermediate depths such as 7-10 ply.

There is no testing that is completely valid as you can get minor differences in results at different time controls and of course there is always the scaling issue, something that is starting to plague chess programs today. As we reach higher and higher levels chess programs are more sensitive to this. But time control games are the most realistic way to test and how we verify everything, because that is how programs are actually used.

hgm · Post by **hgm** » Sat Dec 03, 2011 2:29 pm

I think the main problem is how you define 'better'. By testing which evaluation performs better with the same search, you are optimizing it for that search, and with another search another evaluation, which now tests as inferior, might turn out better. E.g. a search without check extension could be prone to tactical errors by the side that can check, because he will use the checks to push unavoidable losses over the horizon. So exposing your King to checks might be a profitable strategy with such a search.

nkg114mc · Post by **nkg114mc** » Sun Dec 04, 2011 5:54 am

hgm wrote:I think the main problem is how you define 'better'. By testing which evaluation performs better with the same search, you are optimizing it for that search, and with another search another evaluation, which now tests as inferior, might turn out better. E.g. a search without check extension could be prone to tactical errors by the side that can check, because he will use the checks to push unavoidable losses over the horizon. So exposing your King to checks might be a profitable strategy with such a search.

Thanks, Mr. Muller! You mentioned another thing that I have considered for a period a time.
I don't know whether there is some method that allows us to develop the evaluation function and search engine separately. For example, if I have a team of two members to develop a chess engine, one mainly response for the search algorithm and another one for evaluation. Then, can the member that response for the eval be able to develop the eval function independently with the others? It seems that some essential communication can not be omitted, isn’t it?

nkg114mc · Post by **nkg114mc** » Sun Dec 04, 2011 6:16 am

Don wrote: Our own thinking on this has evolved over time.

We generally limit fixed depth testing to things that have little impact on the tree or simply to do a study to learn something about a change and it's impact on the speed of the program. Of course almost everything has some impact, but we always instrument any slowdown or speedup and any change in nodes.

Here is an example from a 12 ply test:
Code: Select all
Rank Name              Elo      +      -    games   score   oppo.   draws 
   1 kse-4325.00     3000.0    8.2    8.2    5738   50.4%  2997.7   56.9% 
   2 kse-4325.08-25  2998.8    8.2    8.2    5737   50.1%  2998.1   56.1% 
   3 kse-4325.08-30  2998.5    8.2    8.2    5736   50.0%  2998.2   56.0% 
   4 kse-4325.08-20  2995.7    8.3    8.3    5737   49.4%  2999.1   55.1% 


      TIME       RATIO    log(r)     NODES    log(r)  ave DEPTH    GAMES   PLAYER
 ---------  ----------  --------  --------  --------  ---------  -------   --------------
    0.3273       0.940    -0.062     0.272    -0.005    11.9989     5737   kse-4325.08-25
    0.3276       0.941    -0.061     0.272    -0.006    11.9988     5736   kse-4325.08-30
    0.3294       0.946    -0.056     0.274     0.003    11.9991     5737   kse-4325.08-20
    0.3483       1.000     0.000     0.273     0.000    11.9992     5738   kse-4325.00
So we don't just rate the change, we instrument it's impact in time as well as nodes.

We are trying a parameter of 25, 30, and 20 for some search feature we are testing here. Notice that for some versions there is a reduction in nodes and a speedup. So the slight reduction in ELO is easily compensated for. This will be followed up by a time control test later because we don't completely trust fixed depth.

In general evaluation weights do not affect the tree very much and we usually use fixed depth for this followed up by time control games. However we have discovered that even evaluation weights have funky scaling properties, I think we have proved to ourselves (empirically) that the best weight for a 3 ply search may be significantly different that the best weight for a deep search for some particularly positional feature. So for weight tuning we still avoid shallow depths and we can still get tens of thousands of games at intermediate depths such as 7-10 ply.

There is no testing that is completely valid as you can get minor differences in results at different time controls and of course there is always the scaling issue, something that is starting to plague chess programs today. As we reach higher and higher levels chess programs are more sensitive to this. But time control games are the most realistic way to test and how we verify everything, because that is how programs are actually used.

Thanks for such a detailed reply, Mr. Dailey! It makes me more clear about which method should we choose in testing.
By the way, Don, how long time do you think to be proper for the fixed time test for one game? I guess the time can not be too long unless we want to test our engine in all aspects before a formal match.

Don · Post by **Don** » Sun Dec 04, 2011 12:33 pm

nkg114mc wrote:
Don wrote: Our own thinking on this has evolved over time.

We generally limit fixed depth testing to things that have little impact on the tree or simply to do a study to learn something about a change and it's impact on the speed of the program. Of course almost everything has some impact, but we always instrument any slowdown or speedup and any change in nodes.

Here is an example from a 12 ply test:
Code: Select all
Rank Name              Elo      +      -    games   score   oppo.   draws 
   1 kse-4325.00     3000.0    8.2    8.2    5738   50.4%  2997.7   56.9% 
   2 kse-4325.08-25  2998.8    8.2    8.2    5737   50.1%  2998.1   56.1% 
   3 kse-4325.08-30  2998.5    8.2    8.2    5736   50.0%  2998.2   56.0% 
   4 kse-4325.08-20  2995.7    8.3    8.3    5737   49.4%  2999.1   55.1% 


      TIME       RATIO    log(r)     NODES    log(r)  ave DEPTH    GAMES   PLAYER
 ---------  ----------  --------  --------  --------  ---------  -------   --------------
    0.3273       0.940    -0.062     0.272    -0.005    11.9989     5737   kse-4325.08-25
    0.3276       0.941    -0.061     0.272    -0.006    11.9988     5736   kse-4325.08-30
    0.3294       0.946    -0.056     0.274     0.003    11.9991     5737   kse-4325.08-20
    0.3483       1.000     0.000     0.273     0.000    11.9992     5738   kse-4325.00
So we don't just rate the change, we instrument it's impact in time as well as nodes.

We are trying a parameter of 25, 30, and 20 for some search feature we are testing here. Notice that for some versions there is a reduction in nodes and a speedup. So the slight reduction in ELO is easily compensated for. This will be followed up by a time control test later because we don't completely trust fixed depth.

In general evaluation weights do not affect the tree very much and we usually use fixed depth for this followed up by time control games. However we have discovered that even evaluation weights have funky scaling properties, I think we have proved to ourselves (empirically) that the best weight for a 3 ply search may be significantly different that the best weight for a deep search for some particularly positional feature. So for weight tuning we still avoid shallow depths and we can still get tens of thousands of games at intermediate depths such as 7-10 ply.

There is no testing that is completely valid as you can get minor differences in results at different time controls and of course there is always the scaling issue, something that is starting to plague chess programs today. As we reach higher and higher levels chess programs are more sensitive to this. But time control games are the most realistic way to test and how we verify everything, because that is how programs are actually used.
Thanks for such a detailed reply, Mr. Dailey! It makes me more clear about which method should we choose in testing.
By the way, Don, how long time do you think to be proper for the fixed time test for one game? I guess the time can not be too long unless we want to test our engine in all aspects before a formal match.

We don't use fixed time, we use Fischer time controls.

There is no particular time we consider best. The longer the time control the more relevant but the tradeoff between quantity and statistical validity is severe. So it depends a lot on the type of change we are testing. We do use 1 minute + 1 second Fischer fairly often for our own final test and we have testers who test for us at longer time controls.

Ideas and questions about how to test evaluation functions

Ideas and questions about how to test evaluation functions

Re: Ideas and questions about how to test evaluation functio

Re: Ideas and questions about how to test evaluation functio

Re: Ideas and questions about how to test evaluation functio

Re: Ideas and questions about how to test evaluation functio

Re: Ideas and questions about how to test evaluation functio

Re: Ideas and questions about how to test evaluation functio

Re: Ideas and questions about how to test evaluation functio