Statistical Interpretation

D Sceviour · Post by **D Sceviour** » Sat Dec 10, 2016 3:06 am

Help is needed interpreting the statistics from a test. The purpose of the test was to determine the significance of a history move. This is a refutation history called countermove in Stockfish, and counter_move[ply - 1] in Crafty. The history move is stored in (piece type, to square) format.

Engine Blackburne 1.9.0 is a 32-bit GAS assembled program with a lot of debug code, and has the history move installed. 1.9.0a has no history installed. Popochin is a public domain third program of comparable strength to establish a baseline comparison. Here, a 100 game test was performed between the two identical engines, and Popochin as the comparison. The results are strange where the history move performs well against the same engine, but performs badly against the comparison engine.

From the results, can it be concluded the refutation history counter_move is useless or detrimental to the search?

Code: Select all

   Engine           Score      S-B
1&#58; Blackburne1.9.0a 53.5/100   2591.0
2&#58; Popochin         50.0/100   2468.5
3&#58; Blackburne1.9.0  46.5/100   2416.0

Name of the tournament: Arena 3.0 tournament
Level: Blitz 2/0

-----------------Blackburne1.9.0-----------------
Blackburne1.9.0 - Blackburne1.9.0a : 26.0/50 18-16-16 52% +14
Blackburne1.9.0 - Popochin : 20.5/50 17-26-7 41% -63
-----------------Blackburne1.9.0a-----------------
Blackburne1.9.0a - Blackburne1.9.0 : 24.0/50 16-18-16 48% -14
Blackburne1.9.0a - Popochin : 29.5/50 23-14-13 59% +63
-----------------Popochin-----------------
Popochin - Blackburne1.9.0 : 29.5/50 26-17-7 59% +63
Popochin - Blackburne1.9.0a : 20.5/50 14-23-13 41% -63

Games : 150 (finished)

White Wins : 66 (44.0 %)
Black Wins : 48 (32.0 %)
Draws : 36 (24.0 %)

White Perf. : 56.0 %
Black Perf. : 44.0 %

Laskos · Post by **Laskos** » Sat Dec 10, 2016 5:13 am

D Sceviour wrote:Help is needed interpreting the statistics from a test. The purpose of the test was to determine the significance of a history move. This is a refutation history called countermove in Stockfish, and counter_move[ply - 1] in Crafty. The history move is stored in (piece type, to square) format.

Engine Blackburne 1.9.0 is a 32-bit GAS assembled program with a lot of debug code, and has the history move installed. 1.9.0a has no history installed. Popochin is a public domain third program of comparable strength to establish a baseline comparison. Here, a 100 game test was performed between the two identical engines, and Popochin as the comparison. The results are strange where the history move performs well against the same engine, but performs badly against the comparison engine.

From the results, can it be concluded the refutation history counter_move is useless or detrimental to the search?
Code: Select all
   Engine           Score      S-B
1&#58; Blackburne1.9.0a 53.5/100   2591.0
2&#58; Popochin         50.0/100   2468.5
3&#58; Blackburne1.9.0  46.5/100   2416.0
Name of the tournament: Arena 3.0 tournament
Level: Blitz 2/0

-----------------Blackburne1.9.0-----------------
Blackburne1.9.0 - Blackburne1.9.0a : 26.0/50 18-16-16 52% +14
Blackburne1.9.0 - Popochin : 20.5/50 17-26-7 41% -63
-----------------Blackburne1.9.0a-----------------
Blackburne1.9.0a - Blackburne1.9.0 : 24.0/50 16-18-16 48% -14
Blackburne1.9.0a - Popochin : 29.5/50 23-14-13 59% +63
-----------------Popochin-----------------
Popochin - Blackburne1.9.0 : 29.5/50 26-17-7 59% +63
Popochin - Blackburne1.9.0a : 20.5/50 14-23-13 41% -63

Games : 150 (finished)

White Wins : 66 (44.0 %)
Black Wins : 48 (32.0 %)
Draws : 36 (24.0 %)

White Perf. : 56.0 %
Black Perf. : 44.0 %

Too few games. These 3 engines fall inside 2 standard deviations error margins. Play at shorter time control, maybe 10 times as many games.

D Sceviour · Post by **D Sceviour** » Sat Dec 10, 2016 1:30 pm

Laskos wrote:Too few games. These 3 engines fall inside 2 standard deviations error margins. Play at shorter time control, maybe 10 times as many games.

Hello Laskos,
This is an interesting point. The number of games depends on what is being tested. If one were testing an opening book or a piece-square table value then the number of games would be significant. For the history move, the number of nodes searched or different positions tested seems more important. Thus, it is more important to have a longer time control to explore as many unique positions as possible. Peter Österlund (Texel) has developed a tuning method that considers the number of positions searched rather than the number of games.

The time control could be set to a shorter time control (Blitz 12 seconds per game) for 1000 games but the number of unique positions explored would be less given the same amount of time for the test. 100 games seem sufficient for the test. There are at least a hundred different parameters in a chess engine. It would be nice to test all the 100! factorial combinations but there is not enough time to do this. There are claims that 10,000 games are sometimes played to test a parameter, but how is this achieved and at what time control? Are 100 cores or different PC's available? What parameters are being tested?

Laskos · Post by **Laskos** » Sat Dec 10, 2016 5:03 pm

D Sceviour wrote:
Laskos wrote:Too few games. These 3 engines fall inside 2 standard deviations error margins. Play at shorter time control, maybe 10 times as many games.
Hello Laskos,
This is an interesting point. The number of games depends on what is being tested. If one were testing an opening book or a piece-square table value then the number of games would be significant. For the history move, the number of nodes searched or different positions tested seems more important. Thus, it is more important to have a longer time control to explore as many unique positions as possible. Peter Österlund (Texel) has developed a tuning method that considers the number of positions searched rather than the number of games.

The time control could be set to a shorter time control (Blitz 12 seconds per game) for 1000 games but the number of unique positions explored would be less given the same amount of time for the test. 100 games seem sufficient for the test. There are at least a hundred different parameters in a chess engine. It would be nice to test all the 100! factorial combinations but there is not enough time to do this. There are claims that 10,000 games are sometimes played to test a parameter, but how is this achieved and at what time control? Are 100 cores or different PC's available? What parameters are being tested?

If you want your search change to be beneficial, I see no way other than checking it by strength improvement in games. If, for example, a change doesn't affect the outcome of the search, but brings higher NPS, that's easy, just check NPS. But from what you say it seems you are affecting the playing of the engine, and your goal is strength. So, a statistic based on game statistic w/d/l is essential. With 100 parameters, first, most of them are orthogonal, so, independent tuning of them is fine. Then, for non-orthogonal there are tuning methods like CLOP (which is hard to use and seem to not bring the optimum) or other methods, but I am completely ignorant on them. Hope some better informed people will reply. As for your test, I consider it statistically irrelevant for what you want to get (strength with a different search). Play 1000 games at 12''+0.12'', show the results.

D Sceviour · Post by **D Sceviour** » Sat Dec 10, 2016 5:46 pm

Laskos wrote: Play 1000 games at 12''+0.12'', show the results.

There are doubts about the quality of results. System interrupts have a tendency to disrupt one or the other engine with very short time controls. Two identical engines will not be offered the same amount of resources. This results in uneven and biased scores. Longer time controls seem to avoid this. It is possible to adjust system time priorities with individual processes but earlier experiments indicated this made no difference.

A 1000 game 12"+0.12" is being tested now, but it will be several hours before results are available. However, the control engine Popochin is forfeiting on time which will obviously spoil results.

Laskos · Post by **Laskos** » Sat Dec 10, 2016 5:57 pm

D Sceviour wrote:
Laskos wrote: Play 1000 games at 12''+0.12'', show the results.
There are doubts about the quality of results. System interrupts have a tendency to disrupt one or the other engine with very short time controls. Two identical engines will not be offered the same amount of resources. This results in uneven and biased scores. Longer time controls seem to avoid this. It is possible to adjust system time priorities with individual processes but earlier experiments indicated this made no difference.

A 1000 game 12"+0.12" is being tested now, but it will be several hours before results are available. However, the control engine Popochin is forfeiting on time which will obviously spoil results.

What communication protocol or interface use the engines? You can use Cutechess-Cli for testing, if Xboard/Winboard or UCI. I don't have any problems with this time control in it. You can use move overhead or time margin, if possible, to avoid time losses. Use fixed time per move say 0.2 seconds, with overhead or time margin of 20ms, if time control management is on the extremes with these engines.

D Sceviour · Post by **D Sceviour** » Sat Dec 10, 2016 6:19 pm

Laskos wrote:What communication protocol or interface use the engines?

Blackburne is XBoard, Popochin is UCI.

You can use Cutechess-Cli for testing, if Xboard/Winboard or UCI.

I was not aware that Cutechess-Cli could be used with XBoard protocol. Maybe with a WB to UCI adaptor?

You can use move overhead or time margin, if possible, to avoid time losses. Use fixed time per move say 0.2 seconds, with overhead or time margin of 20ms, if time control management is on the extremes with these engines.

Blackburne does not support fixed time per move ("st" command), but does support fixed depth per move ("sd" command). Popochin time controls are unknown but the engine can be replaced with something else, or ignored.

Laskos · Post by **Laskos** » Sat Dec 10, 2016 6:32 pm

D Sceviour wrote:
Laskos wrote:What communication protocol or interface use the engines?
Blackburne is XBoard, Popochin is UCI.
You can use Cutechess-Cli for testing, if Xboard/Winboard or UCI.
I was not aware that Cutechess-Cli could be used with XBoard protocol. Maybe with a WB to UCI adaptor?
You can use move overhead or time margin, if possible, to avoid time losses. Use fixed time per move say 0.2 seconds, with overhead or time margin of 20ms, if time control management is on the extremes with these engines.
Blackburne does not support fixed time per move ("st" command), but does support fixed depth per move ("sd" command). Popochin time controls are unknown but the engine can be replaced with something else, or ignored.

Yes, Cutechess-Cli does support XBoard. Use 12''+0.12'', it is among the safest if there are no move overhead or timemargin options. I guess it will work with minimal number or no time losses, if the engine can use increments. You can also use something like 40/12'' in Cutechess.

D Sceviour · Post by **D Sceviour** » Sun Dec 11, 2016 3:42 pm

Here are the combined results of 1000 games at 12"+.012". Arena only saved the pgn files and debug code for one-half of the tournament, but Popochin time forfeited approximately 24 games. There are still doubts about the quality of results. Would it be useful to complete a larger test for the Blitz 2/0 as a comparison? It may take several days, and that cannot be done every time an engine adjustment is made.

Code: Select all

   Engine           Score
1&#58; Popochin         540/1000
2&#58; Blackburne1.9.0  524/1000
3&#58; Blackburne1.9.0a 436/1000
Name of the tournament&#58; Arena 3.0 tournament
Level&#58; Blitz 0&#58;12/0.12

Laskos · Post by **Laskos** » Sun Dec 11, 2016 5:18 pm

D Sceviour wrote:Here are the combined results of 1000 games at 12"+.012". Arena only saved the pgn files and debug code for one-half of the tournament, but Popochin time forfeited approximately 24 games. There are still doubts about the quality of results. Would it be useful to complete a larger test for the Blitz 2/0 as a comparison? It may take several days, and that cannot be done every time an engine adjustment is made.
Code: Select all
   Engine           Score
1&#58; Popochin         540/1000
2&#58; Blackburne1.9.0  524/1000
3&#58; Blackburne1.9.0a 436/1000
Name of the tournament&#58; Arena 3.0 tournament
Level&#58; Blitz 0&#58;12/0.12

Here it IS statistically significant that 1.9.0 performs better than 1.9.0a. Significantly so, some 40 ELO points.
Don't run 2 minutes games if not necessary for some particular reasons related to scaling. Try to stabilize the testing framework, for example in Cutechess.

Statistical Interpretation

Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation