An objective test process for the rest of us?

nczempin · Post by **nczempin** » Wed Sep 12, 2007 9:49 am

Now that I have got Prof. Bob to agree with me

on one (small) thing (or merely have become able to point out what I had been saying all along), I'm more capable to come out of my defensive stance and in doing so, take a step back.

While I consider my test process sufficient for the current situation with Eden, this will not stay this way forever. And since I'm doing a lot of things and using my intuitive judgement (that I still consider to be working, i. e. not producing "release decisions" that would turn out to be incorrect after 20000 games), I would like to pick up on his suggestion to objectify my tests.

However, running 20,000 games is simply beyond the resources of most of us. So what I would like to do is to find out relevant parts of the process I'm using and formalize them, and then utilize the help of some of you who are more schooled in Statistics than I am to find out the precise conditions under which my process would yield repeatable results.

I'm only suggesting to use mine because it has been discussed a little, plus I think it could serve as a baseline for other engines that have reached the minimal maturity that Eden has.

So in the name of science I'll try and describe the process in terms that can actually analyzed. (In another post, later)

bob · Post by **bob** » Wed Sep 12, 2007 10:03 pm

I view my cluster testing as very similar to other scientific discoveries. For example, the tunneling electron microscope. Until it came along, many things were invisible and we didn't know how some things worked. Now we do. Ditto for the regular microscope. And now the ability to test with large numbers of games, to actually measure the variance / standard deviation, rather than just guess based on how the same statistical analysis applies to human chess games.

I'm learning new things daily about this. It isn't an easy thing to work with, and I am not sure there is any real short-cut other than to just accept reduced accuracy. The issue is, how much is the accuracy reduced? My tests show it is far more than anyone guesses...

hgm · Post by **hgm** » Thu Sep 13, 2007 9:42 am

Well, I don't have a tunneling microscope either.

What I dislike about excessively leaning on emperical testing for program improvements and tuning is that development becomes effectively becomes an evolutionary algorithm, where you have little idea what you are doing and only are making trial-and-error-based fitness improvements. And this makes you susceptible of getting trapped in a local optimum.

As in any scientific measurement, getting a better signal-to-noise ratio can improve your sensitivity above what you could ever do by just collecting more data and trying to average out the noise. This is why I like Uri's approach, of eliminating all noise that is not absolutely necessary or intrinsic. Like testing at a fixed number of nodes, and clearing all tables between moves.

nczempin · Post by **nczempin** » Thu Sep 13, 2007 9:58 am

Okay, here's a first shot at getting to a more formalized description, we can always use a more mathematical language once we have some agreement on the content:

I have an engine E and another engine E'. I want to determine whether E' is (staticstically) significantly stronger than E.

What does stronger mean in this context (some assumptions that I am making for my personal situation, YMWV)?

Stronger means that E' has a better result than E, one which is unlikely to be due to random factors alone (at some given confidence factor, I think 95 % is a good starting point).

For me, the goal is to test the engines "out of the box", which means their own book (or none, if they don't have one) is included. I'm looking to evaluate the whole package, not just the "thinking" part.
Actually I think using random or Nunn positions could skew the results, because they would show you how your engine plays in those positions. There may very well be positions that your engine doesn't "like", and that a sensible opening book would avoid. Of course, someone whose goal is the performance under those positions (or under "all likely positions", of which such a selected group is then assumed to be a good proxy) would need to use a lot more games. So in the discussion I would like this factor to be separated.
If Eden A achieves a certain result with a certain book, I would like to know if Eden B would achieve a better result with the same book.

I have already discussed why I consider opponents of far higher and of far lower strength to be inadequate indicators.
Just playing E against E' is also not sufficient. Even just playing both against some other engine D is not sufficient, because as has been mentioned elsewhere engine strength is not a transitive property (i. e. it may well be that A would beat B consistently, B would beat C and C would beat A. So to find out which is the strongest within a theoretical universe where only these three engines exist, a complete tournament would have to be played).

So it would make sense to play a sufficient number of games against a sufficient number of opponents of approximately the same strength.

Thus we have a set of opponents O, and E should play a number of games against each opponent, and E' should play the same number of games against the same opponents. For practical reasons the "other version" could be included in each set of opponents. This presumably does skew the results, but arguably it is an improvement over just playing E against E'.

So what we have now is a RR tournament which includes E, E' and the set O.

What I'd like to know is, assuming E' achieves more points than E, what result should make me confident that E' is actually stronger, and this is not merely a random result?

Without loss of generality, let's set the size of O to 8 (which would make the tourney have 10 contestants).

My first conjecture is that there is an inverse relationship between the number of games per match and the points differential between E and E'.

Opinions?

(Feel free to improve on my pseudo-mathematical terminology, and of course on other things).

hgm · Post by **hgm** » Thu Sep 13, 2007 12:42 pm

Unfortunately this is an inverse-square relationship. The standard error in the total score over an N-game match against a roughly equal opponent is about 0.4*sqrt(N). So for 100 games you would have a 4 point = 4% standard error, and each 1% score is about 7 Elo. So after playing 100 games against opponents of known strength you have a standard error of 28 Elo, and a 95% confidence of ~55 Elo in the rating of E.

Now if you play E' also for 100 games, it has its own, independent rating estimate with a 95%-confidence range of +/- 55 Elo. The 95% confidence of the difference of the two is then sqrt(2)*55 = 77 Elo. So if the difference between E and E' is more than 77 Elo, you could identify the better one by letting each play 100 games, with 95% confidence.

If you want to get 95% confidence for a smaller difference, say 11 Elo, this is 7 times smaller, so you need 49 times as many games. That is 4900 games for E, and 4900 games for E'. To as reliably resolve 5 Elo you would have to do ~20,000 games each. (Ouch!)

Playing a round-robin is an enormous waste of time, as you would also be playing engines against each other that you don't want to test. These results don't contribute any information to discriminating E vs E'. Never do it, just play gauntlets.

The only way out of this inverse-square trap is making the games less independent. As such, playing with a book (where the engines randomly select opening lines from the book) is a bad idea. The games will certainly be independent then, nothing you can do about it. Even two identical, deterministically playing engines would play completely different and independent games, just because the opponents selected different openings. Playing from fixed positions with deterministic engines and deterministic opponents would at least allow you to see in a small number of games that E and E' are identical. If you would have made a modification that only affects the end-game (e.g. let the evaluation recognize KBK and KNK, and score them as draws) they would end in the same set of end-games, as the games would be move for move the same until they could make / refrain from that fatal mistake of 'winning' the opponent's piece at the expense of their last pawn. The early moves would just be a way to generate a representative sample of and-games, so that you can see how important it is to recognize this in practice.

So play from a (large) number of positions that are representative for early middle game.

If your opponents or your own engine is not deterministic enough to end up in the same end-game from the same initial position, force it to do so: Play a game between E and O, take the first position from that game where there are only 8 pieces (incl. kings, excl. pawns) on the board, and use that as a starting point for the game between E' and O. (If you want to test the above end-game eval.) If you are worried that a set of early middle-game positions is not representative for your engine, generate the positions yourself, by applying the same startegy on early middle game: play a game between E and O with own books until they are out of book, and use those positions as a starting point for E' vs O as well. You want to test the effect of your change, not compare the quality of two book lines, if your change was in the engine, rather than the book.

Test the quality of your evaluation and search under time controls that do not interfere with determinism. Even if this would not be the best way to manage time, it is extremely unlikely that changes that give an improvement at deterministic time controls (fixed depth or fixed number of nodes, clear hash, exact-depth hash hits only) would not give a similar improvement at more advanced (but noisier) time controls.

If opponents do not support a fully deterministic mode, try to force them to play equal games by selection. Play both E and E' against O simultaneously, letting O only think once, and use its move against both E amd E' for as long as E and E' play the same. Play E and E' to finish their iterations, and let E determine the number of iterations from reading the clock, and then (for as long as they play the same game) force E' (and E"and ...) to use the same number of iterations on a move-by-move basis. This is not so difficult as it seems: you would have one 'baseline version' of your engine that you used as a reference, and for every game in its gauntlet you would store moves and depth (like is usually done in a PGN file). You would program the test versions of your engine to read that PGN, and as long as the moves are the same as in any game in the PGN, use the same number of iterations (rather than a preset time) to search their next move ans were used for the next move in that game. Measure the time they use, rather than enforcing it, and correct for small differences according to an empirical formula (e.g. 1 Elo penalty per 1% extra total time usage).

After the two versions branch off, you can play multiple games starting from the first position that was different to account for opponent variability (to make ot less likely you think E' did the better move just because the continuation of the opponent was untypically poor). You could again play E and E' simultaneously against O (a number of times) from positions just after their original games diverged ('tree-wise comparison').

There are really lots of things one can do to improve testing accuracy without having a super-cluster, other than just throwing your hands up in the air (a condition know as MIPS addiction

)!

nczempin · Post by **nczempin** » Thu Sep 13, 2007 2:09 pm

hgm wrote: Playing a round-robin is an enormous waste of time, as you would also be playing engines against each other that you don't want to test. These results don't contribute any information to discriminating E vs E'. Never do it, just play gauntlets.

It is a waste of time if you look at just one iteration of this process. However, when I go from E' to E", I don't have to repeat any of these games, so the effort converges towards zero with each E version.

I am not using rating as the discriminator, but relative position in the tournament plus the points differential (although I have very recently started looking at EloStat).

The RR is a reflection of my observation that engine strength may indeed be non-transitive.

nczempin · Post by **nczempin** » Thu Sep 13, 2007 2:10 pm

hgm wrote:Unfortunately this is an inverse-square relationship. The standard error in the total score over an N-game match against a roughly equal opponent is about 0.4*sqrt(N). So for 100 games you would have a 4 point = 4% standard error, and each 1% score is about 7 Elo. So after playing 100 games against opponents of known strength you have a standard error of 28 Elo, and a 95% confidence of ~55 Elo in the rating of E.

What is the standard error in the overall score against m opponents that are close in strength but not identical?

hgm · Post by **hgm** » Thu Sep 13, 2007 2:40 pm

nczempin wrote:What is the standard error in the overall score against m opponents that are close in strength but not identical?

Same thing. It is not dependent on the number of opponents you play against, just the total number of games you play. Provided that they are not all too different from your own strength (within +/- 300 Elo, i.e. score percentages 15%-85%).

How many other engines there are in between the two in ranking seems to have nothing to do with the relative strength. The latter is purely determined from the score difference. To be 95% certain that this score difference cannot be reached by a version that was equally strong, it needs to be about 2 standard deviations. So with 8 opponents, and 20 games per match, you would have 160 games, and the standard error would be 0.4*sqrt(160) ~ 5 games. So you would need to score 10 points better to be 95% certain that it actually is better. With only 2 games against each opponent it would have to score roughly 3 points better.

Uri Blass · Post by **Uri Blass** » Thu Sep 13, 2007 2:48 pm

nczempin wrote:Okay, here's a first shot at getting to a more formalized description, we can always use a more mathematical language once we have some agreement on the content:

I have an engine E and another engine E'. I want to determine whether E' is (staticstically) significantly stronger than E.

What does stronger mean in this context (some assumptions that I am making for my personal situation, YMWV)?

Stronger means that E' has a better result than E, one which is unlikely to be due to random factors alone (at some given confidence factor, I think 95 % is a good starting point).

For me, the goal is to test the engines "out of the box", which means their own book (or none, if they don't have one) is included. I'm looking to evaluate the whole package, not just the "thinking" part.
Actually I think using random or Nunn positions could skew the results, because they would show you how your engine plays in those positions. There may very well be positions that your engine doesn't "like", and that a sensible opening book would avoid. Of course, someone whose goal is the performance under those positions (or under "all likely positions", of which such a selected group is then assumed to be a good proxy) would need to use a lot more games. So in the discussion I would like this factor to be separated.
If Eden A achieves a certain result with a certain book, I would like to know if Eden B would achieve a better result with the same book.

I have already discussed why I consider opponents of far higher and of far lower strength to be inadequate indicators.
Just playing E against E' is also not sufficient. Even just playing both against some other engine D is not sufficient, because as has been mentioned elsewhere engine strength is not a transitive property (i. e. it may well be that A would beat B consistently, B would beat C and C would beat A. So to find out which is the strongest within a theoretical universe where only these three engines exist, a complete tournament would have to be played).

So it would make sense to play a sufficient number of games against a sufficient number of opponents of approximately the same strength.

Thus we have a set of opponents O, and E should play a number of games against each opponent, and E' should play the same number of games against the same opponents. For practical reasons the "other version" could be included in each set of opponents. This presumably does skew the results, but arguably it is an improvement over just playing E against E'.

So what we have now is a RR tournament which includes E, E' and the set O.

What I'd like to know is, assuming E' achieves more points than E, what result should make me confident that E' is actually stronger, and this is not merely a random result?

Without loss of generality, let's set the size of O to 8 (which would make the tourney have 10 contestants).

My first conjecture is that there is an inverse relationship between the number of games per match and the points differential between E and E'.

Opinions?

(Feel free to improve on my pseudo-mathematical terminology, and of course on other things).

I still did not get into the high level when book is very important and rybka without book can easily win against movei with book so I prefer to improve based on testing with no book.

I think that you can improve more by other means than improving your book.

I expect a change that is productive without book to be productive later with the right book so you can clearly ignore the book factor and add book only when your program become stronger.

If I improve without book I probably have a basis for improvement with the right book.

Movei is not close to be finished so it will clearly be a mistake to spend much energy on book.

Uri

bob · Post by **bob** » Thu Sep 13, 2007 5:54 pm

hgm wrote:Well, I don't have a tunneling microscope either.

What I dislike about excessively leaning on emperical testing for program improvements and tuning is that development becomes effectively becomes an evolutionary algorithm, where you have little idea what you are doing and only are making trial-and-error-based fitness improvements. And this makes you susceptible of getting trapped in a local optimum.

As in any scientific measurement, getting a better signal-to-noise ratio can improve your sensitivity above what you could ever do by just collecting more data and trying to average out the noise. This is why I like Uri's approach, of eliminating all noise that is not absolutely necessary or intrinsic. Like testing at a fixed number of nodes, and clearing all tables between moves.

Uri's approach doesn't eliminate anything. If you make one line change, the nodes you search are different, and that moves you into a different sample where the results can change. So testing for a fixed number of nodes is no better than testing for a fixed amount of time, unless all you are doing is speeding things up. And in that case a fixed-node search is pointless as well.

Been there, tried it. It's just no good. You _still_ need a large number of games, but each sample needs to be with a different number of nodes now, to be sure your sample is representative of reality and not some local maximum.

An objective test process for the rest of us?

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?