Statistical Interpretation

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Statistical Interpretation

Post by D Sceviour »

Here is an interesting turning point. The elo difference is now greater than the SD +/-. Can any conclusion be made at this point?

Code: Select all

CuteChess GUI 0.9.4 tournament
Rank Name                          ELO     +/-   Games   Score   Draws
   1 Counter_10                    119      49     149     66%     28%
   2 Blackburne1.9.0               -57      47     148     42%     31%
   3 Blackburne1.9.0a              -59      48     149     42%     27%

223 of 600 games finished.
Level: Tournament 60/1
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Statistical Interpretation

Post by D Sceviour »

This should be enough tests to discuss. Combined with the initial 100 games, 500 games indicate the History move is useless or detrimental to the search. Obviously, there is a big difference between 12"+.012" and longer time controls.

Code: Select all

Rank Name                          ELO     +/-   Games   Score   Draws
   1 Counter_10                     90      30     400     63%     27%
   2 Blackburne1.9.0a              -40      28     400     44%     32%
   3 Blackburne1.9.0               -48      28     400     43%     32%

600 of 600 games finished.
Level: Tournament 60/1
Are there any other reasons the longer time control would produce different results? How about this for a reason? The history buffer is only the size of 16 pieces x 64 squares = 1024. In most cases, the best history move is still the same as the first killer. The effect of the second history move is supposed to increase the chance of a cutoff. Internal debug statistics do show hits for the history move, so it is doing something. Faster time controls produce a shallow search so the effective window for the history move is better than for a deeper search that produces more unique positions. With longer time controls, the history buffer becomes saturated and begins to produce bad moves instead of good ones.
flok

Re: Statistical Interpretation

Post by flok »

Sorry to hijack the thread but if I want to run a lot of games, is it required to:

- run all games on the same kind of hardware (processor, ram)

- if not; is it required to run the same number of games on systems with spec a and the same number of games in systems with spec b?
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Statistical Interpretation

Post by D Sceviour »

flok wrote:Sorry to hijack the thread but if I want to run a lot of games, is it required to:

- run all games on the same kind of hardware (processor, ram)

- if not; is it required to run the same number of games on systems with spec a and the same number of games in systems with spec b?
It has been a very long time since I studied formal statistics, and I am sure I have forgotten more than I can remember. However, my approach has been to base the test on the number of degrees of freedom (DOF = V + 1) for making a decision. Thus for a one-variable history test the minimum number of engines required would be two engines. Further, to account for any unknown variables and the null hypothesis, I use three engines. The extra variables for the history test appears to be game time and depth of search.

When running a test to compare a change in a variable, it is assumed there is only one variable and it is orthogonal to all other variables. Such an assumption can be questionable, particularly when depth of search can influence the saturation of tables. Sometimes for comparing two engines only, I use a fixed depth of search. Thus, different hardware would make no difference with a fixed depth of search, but it might make a difference with a fixed time of search. Further, it is impossible for the system to steal clock cycles from a fixed depth of search.

I would imagine the more games the better, and there is no reason the number of games should be the same.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Statistical Interpretation

Post by Evert »

Laskos wrote:When your difference between engines becomes significantly larger than error margins shown (2SD), you can say that the patch is either good or bad. Then you can stop the test and admit/reject the patch. It may take 100 games or 100,000 games, but if it needs 100,000 games, the patch is not important strength-wise.
It is of course important to remember that you should not use "reaching n-sigma" as a stopping condition, since that introduces a bias.

I use SPRT to decide whether patches are good or not, which I think has be ome fairly standard.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Statistical Interpretation

Post by Laskos »

Evert wrote:
Laskos wrote:When your difference between engines becomes significantly larger than error margins shown (2SD), you can say that the patch is either good or bad. Then you can stop the test and admit/reject the patch. It may take 100 games or 100,000 games, but if it needs 100,000 games, the patch is not important strength-wise.
It is of course important to remember that you should not use "reaching n-sigma" as a stopping condition, since that introduces a bias.

I use SPRT to decide whether patches are good or not, which I think has be ome fairly standard.
3-sigma is fairly good stopping rule, it has below 5% Type I error on the practical number of games (100-100,000). Only that it has unbounded Type I error in theory, for example, above 5% for 10^8 games, but that is unpractical. SPRT should be the standard indeed, but I was trying to simplify the procedure for those not very familiar with stopping rules.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Statistical Interpretation

Post by D Sceviour »

Here are the results for a fixed depth search using Arena 3.0. It seems to be a reliable method. One thing noticed was a consistency in the score ratio throughout the tournament. 1.9.0a took a lead in the first 10 games and never varied from a 10-15% lead. Other types of tournaments often have a neck-and-neck race type score where it is difficult to tell which side stands better. Also observe that more games were necessary to establish the 2SD spread for two engines, than was needed for three engines. It might be interesting if a second variable could be tested at the same time with the patch in the third engine. Although this observation does not make anything conclusive yet, it seems worth mentioning.

Code: Select all

750 games played / Tournament is finished
Name of the tournament: Arena 3.0 tournament
Level: 8 Half moves
  Program               Elo    +   -   Games   Score               Draws

1 Blackburne1.9.0a    : 2342   20  20   750    404.0/750  53.9 %   33.1 %
2 Blackburne1.9.0     : 2316   20  20   750    346.0/750  46.1 %   33.1 %
One thing nice about a fixed depth search tournament is that different background processes can be run without any fear of interference or creation of bias. CuteChess does not appear to handle fixed depth searches. What objections are there to using a fixed depth of search for testing patches?

With the exception of the questionable 12"+.12" 1000 game tournament, all the results so far indicate the same conclusion - that the refutation history move is useless. I would rather estimate strength from 100 games of dependable quality, than 10,000 games of dubious quality.

Next has been recommended the SPRT command in cutechess as a method of statistical interpretation. This was posted as suggested values for cutechess in a previous topic:

Code: Select all

 -sprt elo0=10 elo1=25 alpha=0.05 beta=0.05

The wrong use of the null hypothesis used to be a problem with these types of statistical studies, but I admit I am not up to date on the latest techniques. Another problem appears that only two engines (and one variable patch) can be tested at a time. A quick "sprt stats for dummies" lesson is needed here. Will these figures be expected to yield a reasonable result before it is started?
kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

Re: Statistical Interpretation

Post by kbhearn »

The problem with fixed depth testing is it bears no relation to time at all. i.e. an engine that does a relatively wide depth 12 search no matter what the time it takes will have an advantage over an engine that would get to depth 24 in the same time using aggressive pruning because the depth 24 engine would have to stop at 12. While this is a little extreme, Null Move Pruning for instance would be a loss under fixed depth. Furthermore positions where you'd usually get more depth than others such as some endgames don't get to take advantage of that to provide a realistic view of how strong your engine would actually be at a time control where it reaches your fixed depth picked in the middlegame.

Using nodes as a time control is a bit better - still can't be used fairly between different engines, but changes in your own engine won't typically affect NPS much and it still has the same advantage for non-SMP tests of being able to compare results generated on different hardware or unfair running conditions. It also wouldn't reflect NPS speedups/slowdowns in various phases of the game and would thus create a bit of skew that way but less so than ignoring the branching factor changes in phases of the game. Both winboard and uci protocols have extensions to support such a mode i believe.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Statistical Interpretation

Post by D Sceviour »

kbhearn wrote:The problem with fixed depth testing is it bears no relation to time at all. i.e. an engine that does a relatively wide depth 12 search no matter what the time it takes will have an advantage over an engine that would get to depth 24 in the same time using aggressive pruning because the depth 24 engine would have to stop at 12.
Hello Kevin,
The fixed depth method is best used for engines being tested against themselves. Fixed depth is a very good way of exposing bad bugs in engine patches, not to produces good games.
While this is a little extreme, Null Move Pruning for instance would be a loss under fixed depth.
Good point. It depends on how the engine treats fixed depth. For my engines, the fixed depth is tested as the root iterative depth level, and is tested only at the root. The depth limitation does not affect extensions or null move search and there is no depth test inside the search itself (except the maximum depth of 96).
Furthermore positions where you'd usually get more depth than others such as some endgames don't get to take advantage of that to provide a realistic view of how strong your engine would actually be at a time control where it reaches your fixed depth picked in the middlegame.
True, but of course fixed depth search should never be expected to test anything related to time control. The games produced are corrupt. The purpose of the test is not to produce strong games, but to expose bad bugs. The games should never be entered as part of a database for openings, or for calculating ratings. However, small fixed depths (depth=4) are much easier to trace for errors. It is easy to see how the engines "plan" went wrong somewhere.
Using nodes as a time control is a bit better - still can't be used fairly between different engines, but changes in your own engine won't typically affect NPS much and it still has the same advantage for non-SMP tests of being able to compare results generated on different hardware or unfair running conditions. It also wouldn't reflect NPS speedups/slowdowns in various phases of the game and would thus create a bit of skew that way but less so than ignoring the branching factor changes in phases of the game. Both winboard and uci protocols have extensions to support such a mode i believe.
I have not experimented with nodes searched yet. Why would it expose the differences in two engines any better than a fixed depth search? A node count cutoff means that up to 90% of the search could be wasted if the search cuts off before the score is completed. A fixed depth search uses 100% of the nodes and there is no waste.
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: Statistical Interpretation

Post by Robert Pope »

D Sceviour wrote:
kbhearn wrote:The problem with fixed depth testing is it bears no relation to time at all. i.e. an engine that does a relatively wide depth 12 search no matter what the time it takes will have an advantage over an engine that would get to depth 24 in the same time using aggressive pruning because the depth 24 engine would have to stop at 12.
Hello Kevin,
The fixed depth method is best used for engines being tested against themselves. Fixed depth is a very good way of exposing bad bugs in engine patches, not to produces good games.
I have not experimented with nodes searched yet. Why would it expose the differences in two engines any better than a fixed depth search? A node count cutoff means that up to 90% of the search could be wasted if the search cuts off before the score is completed. A fixed depth search uses 100% of the nodes and there is no waste.
Think about his example of null-move pruning. A depth 10 search with null-move pruning is always going to perform worse than a depth 10 search without null-move pruning because the first one intentionally ignores parts of the tree so that it can go deeper faster.

So if you are doing fixed depth testing, it will always show null-move as a loser and you will never incorporate it. But if you do fixed node testing, the picture changes. Instead of 10M nodes giving you a depth 10 search, maybe it gives you depth 11. You go a full ply deeper in the same amount of time and uncover new tactics that will be missed if you can only get to 10.

At the extreme, fixed depth testing will conclude that alpha-beta searching is no better than pure mini-max. The engine performs identically with both algorithms at fixed depths, and alpha-beta is more complicated. But obviously, sticking with mini-max would be a dumb actual choice to make.