A poor man's testing environment

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Rebel
Posts: 6997
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: A poor man's testing environment

Post by Rebel »

Don wrote: But you can also do tuning with a technique that H.G. posted on this forum a few years ago which he called orthogonal multi-testing. It's brilliant.
Looked for it:
HGM wrote:A more efficient way of testing would prepare all possible combinations of the four changes enabled or disabled. That is 2^4 = 16 engines. You let each engine play the same gauntlet of 400 games. That is only 6400 games in total. Now for any of the four changes there are 8 engines that had the change, and 8 that had not. So you have played 3200 games with the change enabled, and 3200 games with that change disabled. The statistical error in each of these 3200-game totalled results is sqrt(0.5)% (=40%/sqrt(3200)), the statistical error in the difference is 1%, as desired. The fact that not all of the 3200 games were obtained with the same setting of the other changes does not affect the outcome (if the changes are independent and lead to additive improvement), as the reference had exactly the same mix of settings of the other changes. This is similar to testing against a number of different opponents to get more independent games, you can just as well play with a few different versions of the engine-under-test to create more diversity.

With this method you test 4 changes to the same accuracy with the same number of games as you would have tested a single change! In other words, you tested 3 changes for free! And it does not stop there: you could just as easily have evaluated 6 changes to 1% accuracy in those 6400 games, by making 64 versions of your engine and have each play 100-game gauntlets.

This really is a big time saver!
Makes more sense than CLOP. In an evolved program most of the changes you try are below the +1% expectation anyway.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: A poor man's testing environment

Post by Michel »

Makes more sense than CLOP. In an evolved program most of the changes you try are below the +1% expectation anyway.
Well it is more or less what clop does if you turn all correlations off...

With correlations on clop will additionally compute axes such that parameter changes along those axes are independent.
User avatar
Rebel
Posts: 6997
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: A poor man's testing environment

Post by Rebel »

Kempelen wrote:
Rebel wrote:I like to present a page for starters how to test a chess engine with limited hardware. I am interested in some feedback for further improvement.

http://www.top-5000.nl/tuning.htm
Hi Ed,

Very valuable an nice insides in your page. Thanks you very much.

Reading it without going into deep think, I came with a few questions for you?
Hi Fermin,
a) At the end of the page you provide a set of pgns that all sum 6060 different possitions. I suppose you dont play all positions against all opponents and two side, beacuase it will be, assuming 6 opponent, 72720 games, and that are a lot which I suspect you dont play them. So, do you choose them randomly?? or what exactly are you doing?
95% of the time I use the 8 x 500 sets. The others (endgame) related are meant for specific endgame changes. For instance, when you make a change in your (say) rook-ending code the effect will be hardly measurable playing from middle game positions, 95% (or so) is just noise. So I will use the ROOK PGN's as a foretaste. If results are good a normal match should confirm that.
b) You mention the disadventage of playing eng-eng matches, being you loose "control" over the style of the program. This is one aspect of my engine I have always missed and would like to improve. One way is "manally one by one" see games and tune by hand, but do you think is there a mix method, or an authomatic one that could take also eng-eng result to improve result and style in the same time? I said this because I suspect only hand tune is not enought to make a reasonable strong engine even if it has a fantastic style.
That's an excellent question. There is this myth: the playing style of a chess program reflects the playing style of the programmer. A myth with probably some truth. If you as a human chess player like playing king attacks (who doesn't?) **and** with a little gambling you will tend to program your evaluation that way with large code and high values. Personally I like the Karpov balanced playing style and I feel my engine plays that way. From the programs I know from the past Chess System Tal was a reckless king attacker resulting in fascinating games with high peeks but also with deep falls. Another example is "The King", its specialty: sound R vs BP (or R vs NP) sacrifices.

As to your second question I have an example to offer. Before I got interested again in chess programming last year and started to join the circus of playing thousands of games my King Safety base parameter was set to 150%. This since the late 90's. I like my King Safety this way because it has little problems with manoeuvres like 1.Rf3 2.Rh3 3.Qh5 all very natural. However tuning this parameter last year a value of 75 performs clearly better in eng-eng matches, that's a division by 2 (!!). Manoeuvres like 1.Rf3 2.Rh3 3.Qh5 need much more depth to find now and speculative (but sound 90% of the cases) sacrifices are almost gone. This clearly is a loss in attractiveness.
User avatar
Rebel
Posts: 6997
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: A poor man's testing environment

Post by Rebel »

ilari wrote:
Rebel wrote: Tried "-concurrency 4"

Code: Select all

 c:\cc\cutechess-cli -engine name=MAIN1 cmd=yourengine.exe dir=C:\a\main1 proto=uci -engine name=WORK1 cmd=yourengine.exe dir=c:\a\work1 proto=uci -each tc=inf -draw 160 100 -resign 5 500 -rounds 1000 -repeat -pgnout c:\cc\all.pgn -pgnin c:\cc\1.pgn -pgndepth 20 -concurrency 4
What happens is that indeed 4 threads are started but... processor activity is only 25% playing only 1 game. Aborting the match kills only one thread leaving the other 6 executables idle in the task manager and I must remove them manually. Totally unusable for me.

Perhaps a WB2UCI problem?

Anyway, it's not very encouraging to modify my starter-page with a "-concurrency" option advice.
The main question is: why are you using WB2UCI? Does cutechess-cli's Xboard/Winboard support lack features that WB2UCI provides?

Some past versions of cutechess-cli had bugs in the concurrency implementation but the latest version should work fine. I'll have to try WB2UCI myself to find out if the problem is there.
Tried Xboard, behaves the same way, only 1 core active plating 1 game and the rest idle in the task manager.

I am using cutechess-cli (66.560 bytes) and the README states:

Code: Select all

CUTECHESS-CLI(6)
================
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: A poor man's testing environment

Post by hgm »

Strange, because XBoard should kill any engine process that does not terminate in response to 'quit' by sending it SIGKILL. Which in Linux never let me down so far.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: A poor man's testing environment

Post by Don »

Rebel wrote:
Don wrote: But you can also do tuning with a technique that H.G. posted on this forum a few years ago which he called orthogonal multi-testing. It's brilliant.
Looked for it:
HGM wrote:A more efficient way of testing would prepare all possible combinations of the four changes enabled or disabled. That is 2^4 = 16 engines. You let each engine play the same gauntlet of 400 games. That is only 6400 games in total. Now for any of the four changes there are 8 engines that had the change, and 8 that had not. So you have played 3200 games with the change enabled, and 3200 games with that change disabled. The statistical error in each of these 3200-game totalled results is sqrt(0.5)% (=40%/sqrt(3200)), the statistical error in the difference is 1%, as desired. The fact that not all of the 3200 games were obtained with the same setting of the other changes does not affect the outcome (if the changes are independent and lead to additive improvement), as the reference had exactly the same mix of settings of the other changes. This is similar to testing against a number of different opponents to get more independent games, you can just as well play with a few different versions of the engine-under-test to create more diversity.

With this method you test 4 changes to the same accuracy with the same number of games as you would have tested a single change! In other words, you tested 3 changes for free! And it does not stop there: you could just as easily have evaluated 6 changes to 1% accuracy in those 6400 games, by making 64 versions of your engine and have each play 100-game gauntlets.

This really is a big time saver!
Makes more sense than CLOP. In an evolved program most of the changes you try are below the +1% expectation anyway.
We have used the same principle to fine tune Komodo. We played about 200,000 games with minor modifications to all the weights. For example if the original weight was 20 we try 18, 19, 20, 21, 22 for this one single evaluation term. Each term is treated like this and the weights are selected randomly subject to this constraint. If you have 200 evaluation terms in your program you are testing variations of all them simultaneously.

I think CLOP actually does something similar in the sense that it can test a huge number of terms simultaneously. But it does not test a fixed number of them and tries to zero in on the correct weight.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Rebel
Posts: 6997
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: A poor man's testing environment

Post by Rebel »

Don wrote:
Rebel wrote:
Don wrote: But you can also do tuning with a technique that H.G. posted on this forum a few years ago which he called orthogonal multi-testing. It's brilliant.
Looked for it:
HGM wrote:A more efficient way of testing would prepare all possible combinations of the four changes enabled or disabled. That is 2^4 = 16 engines. You let each engine play the same gauntlet of 400 games. That is only 6400 games in total. Now for any of the four changes there are 8 engines that had the change, and 8 that had not. So you have played 3200 games with the change enabled, and 3200 games with that change disabled. The statistical error in each of these 3200-game totalled results is sqrt(0.5)% (=40%/sqrt(3200)), the statistical error in the difference is 1%, as desired. The fact that not all of the 3200 games were obtained with the same setting of the other changes does not affect the outcome (if the changes are independent and lead to additive improvement), as the reference had exactly the same mix of settings of the other changes. This is similar to testing against a number of different opponents to get more independent games, you can just as well play with a few different versions of the engine-under-test to create more diversity.

With this method you test 4 changes to the same accuracy with the same number of games as you would have tested a single change! In other words, you tested 3 changes for free! And it does not stop there: you could just as easily have evaluated 6 changes to 1% accuracy in those 6400 games, by making 64 versions of your engine and have each play 100-game gauntlets.

This really is a big time saver!
Makes more sense than CLOP. In an evolved program most of the changes you try are below the +1% expectation anyway.
We have used the same principle to fine tune Komodo. We played about 200,000 games with minor modifications to all the weights. For example if the original weight was 20 we try 18, 19, 20, 21, 22 for this one single evaluation term. Each term is treated like this and the weights are selected randomly subject to this constraint. If you have 200 evaluation terms in your program you are testing variations of all them simultaneously.

I think CLOP actually does something similar in the sense that it can test a huge number of terms simultaneously. But it does not test a fixed number of them and tries to zero in on the correct weight.
Actually I am fervent upholder of testing one term at the time although my experience with multiple terms are not so bad. What I currently do is combine those minor one term improvements in the range of 50.2-50.5% when I have 4 of 5 of them and see if they together can bring up a more convincing percentage. I am still learning, this stuff is still relative new to me, but fascinating it is.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: A poor man's testing environment

Post by Don »

Rebel wrote:
Don wrote:
Rebel wrote:
Don wrote: But you can also do tuning with a technique that H.G. posted on this forum a few years ago which he called orthogonal multi-testing. It's brilliant.
Looked for it:
HGM wrote:A more efficient way of testing would prepare all possible combinations of the four changes enabled or disabled. That is 2^4 = 16 engines. You let each engine play the same gauntlet of 400 games. That is only 6400 games in total. Now for any of the four changes there are 8 engines that had the change, and 8 that had not. So you have played 3200 games with the change enabled, and 3200 games with that change disabled. The statistical error in each of these 3200-game totalled results is sqrt(0.5)% (=40%/sqrt(3200)), the statistical error in the difference is 1%, as desired. The fact that not all of the 3200 games were obtained with the same setting of the other changes does not affect the outcome (if the changes are independent and lead to additive improvement), as the reference had exactly the same mix of settings of the other changes. This is similar to testing against a number of different opponents to get more independent games, you can just as well play with a few different versions of the engine-under-test to create more diversity.

With this method you test 4 changes to the same accuracy with the same number of games as you would have tested a single change! In other words, you tested 3 changes for free! And it does not stop there: you could just as easily have evaluated 6 changes to 1% accuracy in those 6400 games, by making 64 versions of your engine and have each play 100-game gauntlets.

This really is a big time saver!
Makes more sense than CLOP. In an evolved program most of the changes you try are below the +1% expectation anyway.
We have used the same principle to fine tune Komodo. We played about 200,000 games with minor modifications to all the weights. For example if the original weight was 20 we try 18, 19, 20, 21, 22 for this one single evaluation term. Each term is treated like this and the weights are selected randomly subject to this constraint. If you have 200 evaluation terms in your program you are testing variations of all them simultaneously.

I think CLOP actually does something similar in the sense that it can test a huge number of terms simultaneously. But it does not test a fixed number of them and tries to zero in on the correct weight.
Actually I am fervent upholder of testing one term at the time although my experience with multiple terms are not so bad. What I currently do is combine those minor one term improvements in the range of 50.2-50.5% when I have 4 of 5 of them and see if they together can bring up a more convincing percentage. I am still learning, this stuff is still relative new to me, but fascinating it is.
There is a certain assumption that terms are independent with the kind of tuning but if you test one term at a time you are making the same assumption anyway so there is nothing to lose.

A good example for illustration is the bishop, knight and bishop pair weights your program uses. Most terms have some correlation (all the piece terms do) but these 3 terms are especially correlated. How do you go about turning those 3 terms? Suppose that both the bishop and knight are too low? It may be than raising either term is a gain because at least it raises the average value of the minor pieces - even if raising one hurts the important bishop vs knight difference. In such cases you end up finding some local hill (local optimum) where you cannot improve the bishop and knight value unless you change BOTH. And even if you get the minor pieces right finally you have to start all over when you start with the bishop pair! Oh crap! If you raise the bishop pair you implicitly increase the value of the bishop so now you have to lower that a little!

I think that tuning several weights simultaneously is MORE robust than tuning one at a time. In the bishop/knight/pair example there are still dependencies with the other pieces but if you tune them simultaneously the winning weights will tend to be more robust because you must find values that work well with a variety of values of other terms. I don't have a formal proof but I suspect that correlated terms when tuned together will come out better than if they were tuned separately.

I believe the problem with correlated terms is probably fairly minor but Larry and I have often wondered if Komodo is stuck in some local optimum that would require major changes to the entire set of evaluation terms. I think this is likely but the local optimum where we are at at least has Larry's own guidance (and perhaps his biases.) We also always give the benefit of the doubt to theory - if we can go one way or the other we try to optimize Komodo to score the position like a human (or like theory if this is an opening position.) In fact many times Larry has noticed that chess programs (and Komodo specifically) might score some opening much higher or lower than theory suggests and we will spend a significant amount of time trying to understand why and determining what corrections to make - which sometimes involve inventing more evaluation features.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Tom Likens
Posts: 303
Joined: Sat Apr 28, 2012 6:18 pm
Location: Austin, TX

Re: A poor man's testing environment

Post by Tom Likens »

Don wrote:I think that tuning several weights simultaneously is MORE robust than tuning one at a time. In the bishop/knight/pair example there are still dependencies with the other pieces but if you tune them simultaneously the winning weights will tend to be more robust because you must find values that work well with a variety of values of other terms. I don't have a formal proof but I suspect that correlated terms when tuned together will come out better than if they were tuned separately.
Don,

I tend to agree with this, but I think a lot of people try to use CLOP to tune 4, 5, 6+ terms simultaneously, without running enough games to let things really converge. If you're tuning one parameter you might be able to get away with 10,000 or so games to get a decent result, but as the number of terms go up you need to run a whole lot more (in the 100,000+ range). I haven't run the math in a while, but the numbers become fairly daunting, fairly quickly. Most people simply don't have enough respect for the Law of Large Numbers.

regards,
--tom
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: A poor man's testing environment

Post by Don »

Tom Likens wrote:
Don wrote:I think that tuning several weights simultaneously is MORE robust than tuning one at a time. In the bishop/knight/pair example there are still dependencies with the other pieces but if you tune them simultaneously the winning weights will tend to be more robust because you must find values that work well with a variety of values of other terms. I don't have a formal proof but I suspect that correlated terms when tuned together will come out better than if they were tuned separately.
Don,

I tend to agree with this, but I think a lot of people try to use CLOP to tune 4, 5, 6+ terms simultaneously, without running enough games to let things really converge. If you're tuning one parameter you might be able to get away with 10,000 or so games to get a decent result, but as the number of terms go up you need to run a whole lot more (in the 100,000+ range). I haven't run the math in a while, but the numbers become fairly daunting, fairly quickly. Most people simply don't have enough respect for the Law of Large Numbers.

regards,
--tom
You got that right. I have ranted about how people run 100 game matches and think they have proved something.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.