Quick Performance Test

Henk · Post by **Henk** » Thu May 30, 2013 10:39 am

Developing a program means generate and test.
If generating is fast testing should be fast too. Is there a quick way to test whether your chess program has improved ?

Search depth and number of nodes does not say much.
Playing games takes much time.
Setting up a database with chess positions is a lot of developing effort.

So what to do ?

Evert · Post by **Evert** » Thu May 30, 2013 11:11 am

Henk wrote:Developing a program means generate and test.
If generating is fast testing should be fast too.

That doesn't follow. At all.

Is there a quick way to test whether your chess program has improved ?

Depends on the computational resources you have available. Doing a distributed SPRT test on a cluster is probably the fastest way to do it.
You still need many games though.

Search depth and number of nodes does not say much.

Depends on what you're testing. For some changes time-to-depth is a perfectly fine way to test. For other changes, not so much.

Playing games takes much time.

Can't be helped, unfortunately.

Setting up a database with chess positions is a lot of developing effort.

Other people have already done so though, so just make use of what's available (the STS suites are good, the Arasan suite is also useful).

Henk · Post by **Henk** » Thu May 30, 2013 12:29 pm

With " if generating is fast testing should be fast". I mean if you don't want a testing bottleneck in your development chain testing should be fast too.

Evert · Post by **Evert** » Thu May 30, 2013 1:00 pm

Henk wrote:I mean if you don't want a testing bottleneck in your development chain testing should be fast too.

Unfortunately testing is the bottle neck. If I spent an hour on engine development I can implement half-a dozen (say) evaluation features. Tuning each of those will take a day or so, if not longer.

Getting access to more computing power is the only way to speed that up.

Don · Post by **Don** » Mon Jun 03, 2013 7:26 pm

Henk wrote:Developing a program means generate and test.
If generating is fast testing should be fast too. Is there a quick way to test whether your chess program has improved ?

Search depth and number of nodes does not say much.
Playing games takes much time.
Setting up a database with chess positions is a lot of developing effort.

So what to do ?

As Evert said, testing is the bottleneck and there is no fast way to test. I think over the decades people have tried problem sets and other things but they are way too crude.

The state of the art these days is to run thousands of games at a very fast time control. Right now I am running a test at a time control of 3s+0.3 fischer and getting about 36 games per minute using a 6 core i7 and utilizing all 6 cores. After about 2500 games your error margin is around +/- 10 ELO. To make a long story short, 2500 games is not enough to prove a change is helping your program unless the change is worth well over 10 ELO. We typically run 60,000 games to test a change that might only be worth a couple of ELO.

Larry has a 16 core machine and this test could be sped up by a factor of almost 3x. We could also test even faster than this. We get an average depth of almost 10.6 at this time control on this hardware.

On Larry's machine I am running a 1 minute + 1 second test of something else and getting a little over 3.5 games per minute. It would take over a week of constant testing to get my 50,000 games so by necessity we run less games when we test that deeply.

AlvaroBegue · Post by **AlvaroBegue** » Mon Jun 03, 2013 8:01 pm

You can look at the score returned by a shallow search as a predictor of the result of the game. If you make a database of positions, play them to the end a bunch of times using a strong program and store the winning rates in a database, you can later use it to measure how accurate that predictor is. This should be much faster than playing gazillions of games to test your changes, but it requires some massive initial effort in generating the database of positions with winning rates.

The best feature of this approach is that it allows automated tuning of evaluation parameters.

I would still consider playing many games as the gold standard for accepting a change to the program, but measuring evaluation accuracy has its place, and you are looking for alternatives.

This paper is relevant: http://www.ratio.huji.ac.il/node/2362 ("Automatic Learning of Evaluation, with Applications to Computer Chess", by Amir Ban)

Henk · Post by **Henk** » Mon Jun 03, 2013 8:33 pm

Maybe it's better to go for the pounds (50 ELO) and not for the pennies (10 ELO) if testing is so expensive.

Of course if you cannot find the pounds then only pennies remain. What to do then if you want create the best playing chess program but you are 200 ELO behind. [I'm obviously not talking about my terrible chess program]

Resign ? Or create a super computer ? Or spending more time in trying to find/invent the (50 ELO optimizations). I Like the last possibility. Tried to do things with neural networks a few years ago, predicting stock quotes. [Was a bad idea]But you can wait for infinity until a global maximum is found. There are many algorithms for finding local maxima. But global maxima: I know only simulated annealing and branch and bound. Well if testing is the bottleneck I think I certainly can forget neural networks.

Don · Post by **Don** » Mon Jun 03, 2013 8:39 pm

Henk wrote:Maybe it's better to go for the pounds (50 ELO) and not for the pennies (10 ELO) if testing is so expensive.

Of course if you cannot find the pounds then only pennies remain. What to do then if you want create the best playing chess program but you are 200 ELO behind. [I'm obviously not talking about my terrible chess program]

The answer is good old fashioned hard work. To produce a top program you need several things, testing is probably low on the list but still important.

You could probably go far with just a quad and very fast testing. It's always better to have more but due to the law of diminishing returns a quad would be a good start.

Resign ? Or create a super computer ? Or spending more time in trying to find/invent the (50 ELO optimizations). I Like the last possibility. Tried to do things with neural networks a few years ago, predicting stock quotes. [Was a bad idea]But you can wait for infinity until a global maximum is found. There are many algorithms for finding local maxima. But global maxima: I know only simulated annealing and branch and bound. Well if testing is the bottleneck I think I certainly can forget neural networks.

Henk · Post by **Henk** » Mon Jun 03, 2013 9:24 pm

If tuning is done by a tuner (algorithm) you can also try to find a better tuning algorithm. Or try to constrain the parameters to be tuned to a smaller domain. I guess not all combinations are allowed.

Don · Post by **Don** » Mon Jun 03, 2013 9:34 pm

Henk wrote:If tuning is done by a tuner (algorithm) you can also try to find a better tuning algorithm. Or try to constrain the parameters to be tuned to a smaller domain. I guess not all combinations are allowed.

There has been progress, but nobody has come up with anything better than just playing games for testing a single change. One optimization algorithm that has some merit is CLOP, but guess what? It's based on playing thousands of games.

Consider this. To measure a small ELO change, you much play thousands of games. This is using the most direct measure possible, playing actual games.

Now is it reasonable to expect that you could measure this just as accurately using some indirect methods that requires far less effort?

Of course it isn't. If you want to explore this further, try to figure out how to get more out of your testing procedure. CLOP is one such way and HG proposed another methods called orthogonal multi-tuning. Both of these are based on playing games but try to squeeze more information out of those games.

Quick Performance Test

Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test