Problem suite testing - how to extract a useful number

Don · Post by **Don** » Sun May 25, 2008 6:30 am

I would like to be able to produce a series of simple tests that act as a kind of benchmark of the program. Similar to "unit tests" but instead of measuring correctness and finding bugs it would give a rough measurement of performance and would alert me to problems if things suddenly changed. Ideally, I want these tests to produce a single number, some kind of measurement of goodness.

One of these simple tests would be performance on 1 or more problem suites. There are all kinds of problems with getting a single "performance number" from such a test. You can run the test to N seconds per move and count how many you got correct. This does not take into account that if you solve a problem faster, your score should improve. Also, if you miss it by 1/10 of a second you get 0 instead of 1.

Ideally, you want to measure your total speedup over all the problems. This is fine if you can solve ALL the problems - you can sum the average speedup or slowdown or you can simply sum the times for a useful performance number. But if you cannot solve certain problems, it becomes difficult to compare one version to another, especially if each version of your program is getting different problems which is fairly common when testing extensions and many other ideas.

I think I've come up with something reasonable. There are two ways to look at this but it's the same thing. The simple way is that you just set a fixed time to run each problem in some set, and you just sum the total time to run all the problems, whether you solve them or not. The ones that get solved of course will reduce the total run time. And any improvement in the times to solve certain problems will be reflected in reduced time to run the set.

Another way to look at this, is to think of the calculation differently. I produce a "score" where if a problem is solved instantly, it gets full credit or 1.0 points. If it takes half the time I allocated it gets 0.5 points. As long as it solves the problem in the goal time it will get some credit, although it may not be much if it solves it at the last possible moment.

Both of these methods amount to the same calculation, one just produces numbers where low is good and the other produces numbers where big scores are better. I like the second method because I can view the score as being equivalent to the "number of problems solved instantly" even though that's not quite correct. I convert this to a percentage so that if ALL problems are solved instantly, I get 100% score and if NONE are solved at all, I get zero. It has a certain twisted logic to it, because ideally you want your program to solve every problem instantly, and only this deserves 100% score which is technically unattainable. But it's also good because there is no sudden jagged edges, if you barely miss solving a problem because you timed out a fraction of a second before finding the move, it will not change the score much from if you happened to find it just in the nic of time. In both cases your score will be zero or very close to zero for that problem.

The score does change based on how you set your goal time. If you allow a great deal of time to solve the problems, you will get better scores. If you allow infinite time to solve each problems (assuming your program can solve any problem eventually) you will get 100% score. So your "goal" is to solve each problems instantly, but your penalty for not doing so is smoothly transitions from 1.0 to 0.0 over some arbitrarily set time limit.

I am curious about what others are doing in this regard? Over the years we have seen sets used to predict ELO ratings and I have seen most common to just run to some fixed time and count how many you got correct, but I think this is not a very good way to do it.

Harald Johnsen · Post by **Harald Johnsen** » Sun May 25, 2008 10:19 am

You can process your test with infinite time and post process the result log file (if you use an external tool like arena to run the test). A simple evaluation of the time used for a position can be something like log(1+time_needed/scale). Then you can add all those numbers and get a global score (0 score is perfect).

Log() is a bit arbitrary, perhaps another function can do it better.

HJ.

Don · Post by **Don** » Sun May 25, 2008 2:43 pm

Harald Johnsen wrote:You can process your test with infinite time and post process the result log file (if you use an external tool like arena to run the test). A simple evaluation of the time used for a position can be something like log(1+time_needed/scale). Then you can add all those numbers and get a global score (0 score is perfect).

Log() is a bit arbitrary, perhaps another function can do it better.

HJ.

Yes, another problem with my linear method is that going from 20 to 19 seconds gives you the same improvement as going from 2 seconds to 1 second, when clearly 2 to 1 is a doubling in performance but 20 to 19 is only a small improvement. You log function is better for this.

There is also the issue of how each problem gets weighted. If you double the speed of your program, using my method and running at infinite time, the results are dominated by just a few of the long running problems. I would probably prefer to give all problems equal weight. Of course in this example it wouldn't matter if every problem showed the same exact speedup, but in general I don't want to run 100 positions and have the result of just 2 or 3 dominate the score which can happen if you simply sum the times (and my system only minimizes this, it doesn't solve it.)

I don't fully understand what you are suggesting however. One of the problems I'm trying to address is that I cannot run the whole set to infinite time. I WANT there to be some challenging positions in the suite that future improved versions of my program can be challenged by but I also want a quick test to run from time to time to measure general tactical progress on the program.

So nothing I have done or considered so far solves all of these issues. I'm not sure an ideal solution exists. What I'm doing right now is probably reasonable but has some of the problems I mentioned.

I also don't know what you are saying about "post processing" the log file. I run the test in my own harness, a special script I wrote that reads an EPD file and communicates with the engine using UCI. Why can't the script, or the chess program itself do what needs to be done in a special test mode? The thing I'm hung up on is why it needs to be "post processed." My script I guess does this, it does some calculations after the test has completed. Is that your point? That the calculation has to be done after the results are already known?

- Don

Harald Johnsen · Post by **Harald Johnsen** » Sun May 25, 2008 7:14 pm

You don't need post process if you are doing the test inside your engine of course.
I suggested infinite time, it's just that one can not stop as soon as the solution is found for a serious test. There are lots of positions where the correct move can be found by pure luck (but of course with the incorrect score). So it's often better to wait a few more plies before stoping the test, the engine must also play the same move at each ply after it found it or else the result is not valid.
So even if you have a test set where solution should be found in 10 seconds you must run it for like 30 sec or 1 minute to be sure that the engine does not change it's mind.

HJ.

Don · Post by **Don** » Sun May 25, 2008 8:02 pm

Harald Johnsen wrote:You don't need post process if you are doing the test inside your engine of course.
I suggested infinite time, it's just that one can not stop as soon as the solution is found for a serious test. There are lots of positions where the correct move can be found by pure luck (but of course with the incorrect score). So it's often better to wait a few more plies before stoping the test, the engine must also play the same move at each ply after it found it or else the result is not valid.
So even if you have a test set where solution should be found in 10 seconds you must run it for like 30 sec or 1 minute to be sure that the engine does not change it's mind.

HJ.

There is no general solution to the problem of changing it's mind, that can happen no matter how long you wait.

For my purposes this test is not intended to measure everything good and bad about a chess program, it's just a rough measurement of one aspect of the program and I use it more for documentation purposes and tracking anomalies and general tactical improvement, which is not to be taken too seriously (I really don't believe tactical performance can tell you with much accuracy how good a program is.) So it's good enough for me to choose problems that have relatively unambiguous solutions that normally wouldn't be chosen unless you understand the tactics involved. It's not possible to completely attain that goal, but in practice you can get close. The sets I've used in the past generally have key moves that the program will not play until the tactical theme is understood, and once that happens the program will not change it's mind (unless there are bugs of course.) Programs with really heavy positional scoring can sometimes flounder a bit, finding the move, then changing it's mind - but in general you can avoid most of that if you choose your problems carefully.

One thing I have always done is to not consider the position solved until the iteration completes. So I take the iteration time as the solution time. This gives more consistent results in my opinion and is a little extra sanity check against it changing it's mind right away.

I may change my formula a bit based on your feedback by using the log of the times or something like that so that a 1 second improvement means a lot more to a 1 second search than to a 100 second search.

bob · Post by **bob** » Sun May 25, 2008 11:04 pm

Don, I have tried lots of ways to measure improvements, but I have never found anything that worked. Except for relying on playing _lots_ of games against a pool of opponents...

Don · Post by **Don** » Sun May 25, 2008 11:38 pm

bob wrote:Don, I have tried lots of ways to measure improvements, but I have never found anything that worked. Except for relying on playing _lots_ of games against a pool of opponents...

I do not use problem sets to tune or improve the play of the program.

That's not the issue here. I am building a suite of benchmarking tests of various kinds just to document the characteristics of the program over time as I improve it.

So this serves a similar purpose to "unit tests" except that you don't pass or fail. Before I commit a "release" version I will run a suite of various kinds of tests to document the state of the program, but I won't necessarily try to "interpret" the results, they will simply stand on their own.

It will be most useful if suddenly the score drops in a big way for instance. Then I am alerted to a possible problem and I must either justify it, or fix it.

michiguel · Post by **michiguel** » Mon May 26, 2008 2:16 am

Don wrote:I would like to be able to produce a series of simple tests that act as a kind of benchmark of the program. Similar to "unit tests" but instead of measuring correctness and finding bugs it would give a rough measurement of performance and would alert me to problems if things suddenly changed. Ideally, I want these tests to produce a single number, some kind of measurement of goodness.

One of these simple tests would be performance on 1 or more problem suites. There are all kinds of problems with getting a single "performance number" from such a test. You can run the test to N seconds per move and count how many you got correct. This does not take into account that if you solve a problem faster, your score should improve. Also, if you miss it by 1/10 of a second you get 0 instead of 1.

Ideally, you want to measure your total speedup over all the problems. This is fine if you can solve ALL the problems - you can sum the average speedup or slowdown or you can simply sum the times for a useful performance number. But if you cannot solve certain problems, it becomes difficult to compare one version to another, especially if each version of your program is getting different problems which is fairly common when testing extensions and many other ideas.

I think I've come up with something reasonable. There are two ways to look at this but it's the same thing. The simple way is that you just set a fixed time to run each problem in some set, and you just sum the total time to run all the problems, whether you solve them or not. The ones that get solved of course will reduce the total run time. And any improvement in the times to solve certain problems will be reflected in reduced time to run the set.

Another way to look at this, is to think of the calculation differently. I produce a "score" where if a problem is solved instantly, it gets full credit or 1.0 points. If it takes half the time I allocated it gets 0.5 points. As long as it solves the problem in the goal time it will get some credit, although it may not be much if it solves it at the last possible moment.

Both of these methods amount to the same calculation, one just produces numbers where low is good and the other produces numbers where big scores are better. I like the second method because I can view the score as being equivalent to the "number of problems solved instantly" even though that's not quite correct. I convert this to a percentage so that if ALL problems are solved instantly, I get 100% score and if NONE are solved at all, I get zero. It has a certain twisted logic to it, because ideally you want your program to solve every problem instantly, and only this deserves 100% score which is technically unattainable. But it's also good because there is no sudden jagged edges, if you barely miss solving a problem because you timed out a fraction of a second before finding the move, it will not change the score much from if you happened to find it just in the nic of time. In both cases your score will be zero or very close to zero for that problem.

The score does change based on how you set your goal time. If you allow a great deal of time to solve the problems, you will get better scores. If you allow infinite time to solve each problems (assuming your program can solve any problem eventually) you will get 100% score. So your "goal" is to solve each problems instantly, but your penalty for not doing so is smoothly transitions from 1.0 to 0.0 over some arbitrarily set time limit.

I am curious about what others are doing in this regard? Over the years we have seen sets used to predict ELO ratings and I have seen most common to just run to some fixed time and count how many you got correct, but I think this is not a very good way to do it.

This is what I started to do with nodes rather than time to solution (but the concept should be the same if all the positions are solved):

Let's say we have two versions of a program, version A and B.
For each individual position, I measure the speedup tB/tA (which is time to solution B / time to solution A). If B is better than A, most of the numbers will be smaller than 1, but in fact we obtain a distribution. This distribution is not gaussian, but it is skewed. It looks more gaussian if we take the log of the speedup (this actually makes sense). So, log(tB/tA) distributes as bell curve, then we can apply regular statistical analysis. Averaging log(tB/tA) will give us a score that it will be better if it is negative. What I do is I take summatory of -log(tB/tA) and divide by n.
Because of the properties of log, we could just take summatory of -log(tB) and we get a number that the higher the better. This is relatively simple.
Of course, we need a set of positions that the programs solve in a reasonable time, but that is not a big issue.

I have another way to analyze that I have not tried yet, but I will continue later.

Miguel

bob · Post by **bob** » Mon May 26, 2008 2:30 am

Don wrote:
bob wrote:Don, I have tried lots of ways to measure improvements, but I have never found anything that worked. Except for relying on playing _lots_ of games against a pool of opponents...
I do not use problem sets to tune or improve the play of the program.

That's not the issue here. I am building a suite of benchmarking tests of various kinds just to document the characteristics of the program over time as I improve it.

So this serves a similar purpose to "unit tests" except that you don't pass or fail. Before I commit a "release" version I will run a suite of various kinds of tests to document the state of the program, but I won't necessarily try to "interpret" the results, they will simply stand on their own.

It will be most useful if suddenly the score drops in a big way for instance. Then I am alerted to a possible problem and I must either justify it, or fix it.

I always have a few sanity tests (WAC at 1 sec/move, etc). But I also use our cluster to "sanity-test" against a suite of opponents as well, so that if the score suddenly drops against one or more, then it requires analysis...

I'm more interested in answering the question "is A' (new version) better than A (previous version)?" which is a tough one to answer with any accuracy...

Don · Post by **Don** » Mon May 26, 2008 3:15 am

bob wrote:
Don wrote:
bob wrote:Don, I have tried lots of ways to measure improvements, but I have never found anything that worked. Except for relying on playing _lots_ of games against a pool of opponents...
I do not use problem sets to tune or improve the play of the program.

That's not the issue here. I am building a suite of benchmarking tests of various kinds just to document the characteristics of the program over time as I improve it.

So this serves a similar purpose to "unit tests" except that you don't pass or fail. Before I commit a "release" version I will run a suite of various kinds of tests to document the state of the program, but I won't necessarily try to "interpret" the results, they will simply stand on their own.

It will be most useful if suddenly the score drops in a big way for instance. Then I am alerted to a possible problem and I must either justify it, or fix it.
I always have a few sanity tests (WAC at 1 sec/move, etc). But I also use our cluster to "sanity-test" against a suite of opponents as well, so that if the score suddenly drops against one or more, then it requires analysis...

I'm more interested in answering the question "is A' (new version) better than A (previous version)?" which is a tough one to answer with any accuracy...

You can do this if you are willing to test at fast levels, which amounts to about 1 second per game on average. The very best programs can probably approach master strength at this time control so it's not that ridiculous. Overnight, you can get tens of thousands of games in at this time control and come within approximately 5 ELO or something like that with high confidence. If you have several CPU's lying around, you can get results relatively quickly. If someone were to pay you a million dollars to build a strong PC program, the first thing you would want to do is go out and buy a bunch of quad cores just for testing.

Of course I realize this is only an approximation because you normally don't play at 1 second per game! But it might work especially well for things like evaluation improvements.

- Don

Problem suite testing - how to extract a useful number

Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number

Re: Problem suite testing - how to extract a useful number