SPRT question

gladius · Post by **gladius** » Thu Nov 13, 2014 11:46 pm

bob wrote:You can clearly make a change that helps against yourself, but which hurts against either stronger or weaker opponents. I don't particularly buy that argument.

Yes, of course. The point is that you want to play better against stronger opponents, not only optimize scores against weaker opponents. How do you do this when there are no stronger engines available (ie. not free)? You could time handicap other engines, but that comes at an increasing overhead in testing resources, for a dubious gain.

Error bars don't increase/decrease based on self-play, as the number of games can be increased to choose whatever error bar you deem acceptable. I just went through a round of both self-testing and gauntlet testing while working on singular extensions and threat extensions, and I got way too many false positives with self-test that were promptly exposed with gauntlet testing...

Feel free to continue using your testing methodology, I was just pointing out the reasoning that the SF testing framework went with self-play vs gauntlets.

lucasart · Post by **lucasart** » Fri Nov 14, 2014 12:24 am

There are very good reasons to choose self-testing in general:
* you need 4 times fewer games to achieve the same precision (2x because of error bar compounding and 2x because the gauntlet needs to be played by both versions)
* self-testing magnifies elo differences in our experience. even if it's by a small factor like 1.2, that's still 1.2^2=1.44 so it's equivalent to gaining 44% on resources for free!

With SPRT and multiple opponents, it's possible, but again completely wasteful and uselessly complicated. Hypothesis is that you have iid variables Xi representing a game result {-1,0,1}. Well if you have a gauntley with 3 opponents, then you have to play a single gauntlet and you observe a variable Xi in {-3,-2,-1,0,1,2,3} instead, which will be the gauntlet score difference. Easy enough to write the likelyhood, and all the rest is the same.

Vinvin · Post by **Vinvin** » Fri Nov 14, 2014 12:43 am

Uri Blass wrote:...
I do not know about which weakness you talk.

I talk about games like this one : http://www.talkchess.com/forum/viewtopi ... 359#580359

Uri Blass wrote:All the rating lists(that test against other engines) show that stockfish become stronger with every new version so the strategy of testing only against previous version works.

It becomes stronger in chosen areas. But not in all areas of the chess game.

Vinvin · Post by **Vinvin** » Fri Nov 14, 2014 12:55 am

bob wrote:I decided to play around with SPRT as an early termination idea, since the StockFish guys were using it and it seemed quite reasonable. But the way they use it is in self-testing. My question concerns using it as an early termination methodology for gauntlets as well.

The code was easy enough to write, and the primary inputs to my function are simply wins, draws and losses, extracted directly from the PGN as the match progresses. The other terms (confidence interval for H0/H1 and the lower/upper Elo bounds for failure/success are straightforward.

My question is this: Is there anything wrong with the idea of using the above in a gauntlet? IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?

It seems logical that it should work just as well, but I wonder since I learned a lot about Elo testing from Remi when I started cluster testing. Any comments? experience? suggestions? warnings? Etc/??

As I understand, SPRT is a mix of eval of the strength (Elo) and uncertainty evaluation. If you want to be more sure that your test version is stronger than the previous version, you've to reduce the uncertainty by playing more games. As I see, the target accuracy is often 95%, that means 1/20 of the test the conclusion is wrong.
There's no reason this system would be bad for your tests !

bob · Post by **bob** » Fri Nov 14, 2014 3:28 am

gladius wrote:
bob wrote:You can clearly make a change that helps against yourself, but which hurts against either stronger or weaker opponents. I don't particularly buy that argument.
Yes, of course. The point is that you want to play better against stronger opponents, not only optimize scores against weaker opponents. How do you do this when there are no stronger engines available (ie. not free)? You could time handicap other engines, but that comes at an increasing overhead in testing resources, for a dubious gain.

Error bars don't increase/decrease based on self-play, as the number of games can be increased to choose whatever error bar you deem acceptable. I just went through a round of both self-testing and gauntlet testing while working on singular extensions and threat extensions, and I got way too many false positives with self-test that were promptly exposed with gauntlet testing...
Feel free to continue using your testing methodology, I was just pointing out the reasoning that the SF testing framework went with self-play vs gauntlets.

That would imply that once a human is the strongest player around, he can't get any stronger by playing in additional GM tournaments? Do you REALLY believe that? You can play against weaker opponents and widen the gap, which means you are getting better.

bob · Post by **bob** » Fri Nov 14, 2014 3:33 am

lucasart wrote:There are very good reasons to choose self-testing in general:
* you need 4 times fewer games to achieve the same precision (2x because of error bar compounding and 2x because the gauntlet needs to be played by both versions)
* self-testing magnifies elo differences in our experience. even if it's by a small factor like 1.2, that's still 1.2^2=1.44 so it's equivalent to gaining 44% on resources for free!

With SPRT and multiple opponents, it's possible, but again completely wasteful and uselessly complicated. Hypothesis is that you have iid variables Xi representing a game result {-1,0,1}. Well if you have a gauntley with 3 opponents, then you have to play a single gauntlet and you observe a variable Xi in {-3,-2,-1,0,1,2,3} instead, which will be the gauntlet score difference. Easy enough to write the likelyhood, and all the rest is the same.

only 2x. Once you play version N, you just keep the results. Then run version N+1 and compare. Then keep those results. Then run version N+2. Etc. You don't have to replay those games over and over when the old version didn't change.

Yes, self-testing (a) magnifies the difference between two close versions; (b) requires fewer games; but it also gives what I consider to be an unacceptable number of false positives in that "exaggeration".

One idea I did think about was to find exactly how much better or worse Crafty is than the gauntlet, then use that Elo as the center of the test range. IE rather than -1.5, +4.5 or 0,6, I don't see anything invalid with using a window that is offset more than that...

I might run some tests there just to see how it looks...

I could even take all the PGN and fudge the names so that one opponent is crafty, the other is named "gauntlet" and see what BayesElo concludes, just for curiosity.

bob · Post by **bob** » Fri Nov 14, 2014 3:35 am

Here's an interesting question. Person A always walks every where he goes, regardless of distance. He always gets to where he wants to get, on time. This works. I suppose there is no reason for him to ever try something that might work _better_, using this logic? IE an automobile? Or for longer trips perhaps an airplane?

Just because something works doesn't always mean it is the best that can be done.

gladius · Post by **gladius** » Fri Nov 14, 2014 3:52 am

bob wrote:
gladius wrote:
bob wrote:You can clearly make a change that helps against yourself, but which hurts against either stronger or weaker opponents. I don't particularly buy that argument.
Yes, of course. The point is that you want to play better against stronger opponents, not only optimize scores against weaker opponents. How do you do this when there are no stronger engines available (ie. not free)? You could time handicap other engines, but that comes at an increasing overhead in testing resources, for a dubious gain.

Error bars don't increase/decrease based on self-play, as the number of games can be increased to choose whatever error bar you deem acceptable. I just went through a round of both self-testing and gauntlet testing while working on singular extensions and threat extensions, and I got way too many false positives with self-test that were promptly exposed with gauntlet testing...
Feel free to continue using your testing methodology, I was just pointing out the reasoning that the SF testing framework went with self-play vs gauntlets.
That would imply that once a human is the strongest player around, he can't get any stronger by playing in additional GM tournaments? Do you REALLY believe that? You can play against weaker opponents and widen the gap, which means you are getting better.

Comparing improving a chess engine to improving a human GM is a bit strange. It is a totally different world.

Yes, you can play weaker engines and improve your strength vs them, but why do so if there are stronger engines available? In this case, SF can play vs itself without any issues. Yes, it's not optimal, but IMO, it's far better than playing against weaker engines.

Uri Blass · Post by **Uri Blass** » Fri Nov 14, 2014 5:50 am

Vinvin wrote:
Uri Blass wrote:...
I do not know about which weakness you talk.
I talk about games like this one : http://www.talkchess.com/forum/viewtopi ... 359#580359
Uri Blass wrote:All the rating lists(that test against other engines) show that stockfish become stronger with every new version so the strategy of testing only against previous version works.
It becomes stronger in chosen areas. But not in all areas of the chess game.

I think that the weakness that you talk about is not related to not playing against other engines and other engines have exactly the same weakness.

I copied the game and
I tested other engines and they saw decisive advantage for black in the game that you posted before black lost.

I think that all engines suggests Nxg3 with a winning score at blitz here

[D]r2b1rk1/1b1n1pp1/1q2p3/p2pPnP1/2pP1PN1/P1P1B1R1/4B1P1/R3Q1K1 b - - 0 22

stockfish can play better not at blitz and avoid Nxg3

Gull3 and Houdini3 are faster in avoiding Nxg3 but they still need near 20 seconds on my hardware and stockfish5 clearly needs more time to find Re8 so claiming that there is no improvement is simply wrong

[D]r2b1rk1/1b1n1pp1/1q2p3/p2pPnP1/2pP1PN1/P1P1B1R1/4B1P1/R3Q1K1 b - - 0 22

Stockfish_14111222_x64_modern:
1/1 00:00 98 98k +3.95 22. ... Nxg3 23.Qxg3
2/2 00:00 332 332k +3.95 22. ... Nxg3 23.Qxg3
3/3 00:00 1k 1,050k +4.42 22. ... Nxg3 23.Qxg3 Qb2
4/4 00:00 2k 1,702k +4.19 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1
5/5 00:00 2k 2,277k +4.42 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Bb6
6/7 00:00 4k 4,199k +4.05 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf2
7/7 00:00 5k 5,132k +4.28 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf2 Bb6
8/8 00:00 6k 6,315k +4.17 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf2 Bb6 26.Qh3
9/12 00:00 9k 748k +4.25 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Bb6 25.Nf6+ Nxf6 26.gxf6 g6
10/12 00:00 21k 1,164k +4.04 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Bb6 25.Nf6+ Nxf6 26.gxf6 g6 27.Bc1 Qc2
11/14 00:00 56k 1,475k +4.69 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Bc6 25.Rb1 Qxa3 26.Ra1 Qb2 27.Rb1 Qc2 28.Bf3
12/17 00:00 78k 1,620k +4.68 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Bc6 25.Rb1 Qxa3 26.Nf6+ Bxf6 27.gxf6 Rfb8 28.Rc1 g6
13/18 00:00 135k 1,826k +4.50 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Rb8 25.Rb1 Qa2 26.Ra1 Qc2 27.Bd2 Be7 28.Ne3 Qb2 29.a4
14/18 00:00 250k 2,383k +4.44 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Rb8 25.Rb1 Qa2 26.Ra1 Qc2 27.Bd2 Be7 28.Ne3 Qe4 29.Bf3 Qd3 30.Be2 Qh7
15/19 00:00 457k 2,670k +4.44 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Rb8 25.Rb1 Qa2 26.Ra1 Qc2 27.Bd2 Be7 28.Ne3 Qe4 29.Bf3 Qd3 30.Be2
16/22 00:00 517k 2,706k +4.50 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Rb8 25.Rb1 Qa2 26.Ra1 Qc2 27.Bd2 Be7 28.Ne3 Qe4 29.Bf3 Qd3 30.g3 Bc6
17/22 00:00 811k 2,844k +4.46 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Rb8 25.Rc1 Re8 26.a4 Nf8 27.Bd1 Ng6 28.Bc2 Bc7 29.Bxg6 fxg6 30.Rb1 Qc2 31.Ra1
18/25 00:00 1,874k 3,160k +4.27 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf3 Qh7 26.Rc1 Bc6 27.Bd1 Be7 28.Bc2 g6 29.a4 Rfb8 30.Nf6+ Nxf6 31.gxf6 Ba3 32.Re1 Rb3 33.Bxb3
19/26 00:00 2,286k 3,211k +4.19 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf3 Qh7 26.Rc1 Bc6 27.Bd1 Be7 28.Bc2 g6 29.a4 Rfb8 30.Re1 Ba3 31.Nh6+ Kf8 32.Re2 Bb2 33.Bf2
20/29 00:00 3,327k 3,364k +4.08 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf3 Qh7 26.Rc1 Bc6 27.Bd1 Be7 28.Bc2 g6 29.Nh6+ Kh8 30.Ra1 Rab8 31.a4 Rb3 32.Bxb3 cxb3 33.Rb1 Bxa4 34.Ng4 Ba3 35.Nf6 Nxf6 36.gxf6
21/35 00:01 6,662k 3,436k +3.86 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Bc6 27.Rb2 g6 28.Bc2 Be7 29.Nf6+ Bxf6 30.gxf6 Rab8 31.Rxb8 Rxb8 32.f5 exf5 33.e6 fxe6 34.Qd6 Bb5 35.Qxe6+ Kh8
22/37- 00:03 10,785k 3,538k +3.50 22. ... Nxg3 23.Qxg3
22/37- 00:04 14,190k 3,486k +3.29 22. ... Nxg3 23.Qxg3
22/37+ 00:04 15,000k 3,465k +3.45 22. ... Nxg3
22/37 00:04 15,836k 3,434k +3.40 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kh2 Qh7+ 28.Kg1 Rb6 29.Bc2 Rb1+ 30.Kf2 Qh1 31.Bxb1 Qxb1 32.Re2 Qh7 33.Qh3 Qxh3 34.gxh3 Be7 35.a4
23/37- 00:04 17,131k 3,427k +3.34 22. ... Nxg3 23.Qxg3
23/37- 00:05 18,243k 3,450k +3.28 22. ... Nxg3 23.Qxg3
23/37- 00:05 20,142k 3,474k +3.18 22. ... Nxg3 23.Qxg3
23/37- 00:06 22,841k 3,494k +3.05 22. ... Nxg3 23.Qxg3
23/37 00:06 24,426k 3,504k +3.08 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Be7 27.Bc2 g6 28.Qe1 Bc6 29.g3 Rfb8 30.Rh2 Qg7 31.Nf6+ Kf8 32.Qd1 Rb2 33.a4 Rab8 34.Bc1 Ra2 35.Nh7+ Ke8 36.Nf6+ Kd8 37.Be3
24/37+ 00:07 26,463k 3,514k +3.14 22. ... Nxg3
24/37- 00:07 27,236k 3,513k +3.08 22. ... Nxg3 23.Qxg3
24/37- 00:08 31,256k 3,527k +2.98 22. ... Nxg3 23.Qxg3
24/37 00:09 34,757k 3,544k +2.97 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Be7 27.Bc2 g6 28.Qe1 Bc6 29.g3 Rfb8 30.Rh2 Qg7 31.Nf6+ Kf8 32.a4 Rb2 33.Qc1 Rab8 34.Qa1 R8b3 35.Bc1 Rxc2 36.Rxc2 Nb6 37.Ra2
25/37+ 00:10 35,893k 3,548k +3.03 22. ... Nxg3
25/37 00:10 37,218k 3,554k +3.08 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kh2 Qh7+ 28.Kg1 Be7 29.Bc2 g6 30.Qe1 Rb8 31.g3 Kf8 32.Rh2 Qg8 33.Nh6 Qg7 34.Ng4 Ke8 35.Nf6+ Kd8 36.Ba4
26/37- 00:11 40,099k 3,561k +3.01 22. ... Nxg3 23.Qxg3
26/37- 00:11 42,214k 3,568k +2.95 22. ... Nxg3 23.Qxg3
26/37- 00:12 45,376k 3,575k +2.86 22. ... Nxg3 23.Qxg3
26/37- 00:14 51,001k 3,567k +2.72 22. ... Nxg3 23.Qxg3
26/37- 00:16 59,669k 3,551k +2.51 22. ... Nxg3 23.Qxg3
26/48- 00:23 82,681k 3,511k +2.20 22. ... Nxg3 23.Qxg3
26/48 00:40 139,797k 3,444k +1.81 22. ... Re8 23.Rh3 Nf8 24.Rb1 Qa7 25.Nf6+ gxf6 26.g4 fxg5 27.gxf5 exf5 28.Qg3 Ng6 29.Rh1 gxf4 30.Bxf4 Re6 31.Rh5 Rb8 32.Rf1 Bc6 33.Rxf5 Rb2 34.Qg4 Rxe2 35.Qxe2 Nxf4 36.R1xf4 Rg6+ 37.Rg4
27/48+ 00:41 143,774k 3,441k +1.87 22. ... Re8
27/48+ 00:41 144,156k 3,441k +1.94 22. ... Re8
27/48+ 00:42 144,687k 3,441k +2.03 22. ... Re8
27/48- 00:44 154,523k 3,442k +1.95 22. ... Re8 23.Rh3
27/48+ 00:45 155,093k 3,442k +2.06 22. ... Re8
27/48- 00:47 162,978k 3,441k +1.90 22. ... Re8 23.Rh3
27/48+ 00:47 164,066k 3,441k +2.14 22. ... Re8
27/48 00:49 170,488k 3,439k +1.90 22. ... Re8 23.Rh3 Nf8 24.Rb1 Qa7 25.Nf6+ gxf6 26.g4 fxg5 27.gxf5 exf5 28.Qg3 Ng6 29.Rh1 gxf4 30.Bxf4 Re6 31.Rf1 Bb6 32.Rf2 Bxd4 33.cxd4 Qxd4 34.Qg5 Rb6 35.Be3 Qa1+ 36.Rf1 Rb1 37.e6 Rxf1+ 38.Bxf1 d4 39.Qxf5
28/48+ 00:50 173,271k 3,438k +1.96 22. ... Re8
28/48+ 00:50 174,086k 3,437k +2.02 22. ... Re8
28/48- 00:52 182,084k 3,440k +1.96 22. ... Re8 23.Rh3
28/48- 00:56 195,408k 3,446k +1.82 22. ... Re8 23.Rh3
28/48- 01:10 240,795k 3,432k +1.61 22. ... Re8 23.Rh3
28/48+ 01:19 273,532k 3,423k +1.77 22. ... Re8
28/48- 01:36 330,618k 3,432k +1.53 22. ... Re8 23.Rh3
28/48 01:48 372,842k 3,435k +1.45 22. ... Re8 23.Rh3 Bc8 24.Rb1 Qc6 25.Nf6+ gxf6 26.g4 fxg5 27.fxg5 Ra6 28.gxf5 exf5 29.Qh4 Kf8 30.Rf1 Qg6 31.Bd1 Qg8 32.Rhf3 Nb8 33.Bc2 Rg6 34.Bxf5 Bxf5 35.Rxf5 Re7 36.Qg3 Nc6 37.Qg4 Rb7 38.Bd2 Qg7 39.Qg2 Rd7
29/48- 02:04 429,618k 3,440k +1.39 22. ... Re8 23.Rh3
29/48+ 02:19 479,032k 3,432k +1.45 22. ... Re8
29/48+ 02:21 486,573k 3,432k +1.54 22. ... Re8
29/48- 02:25 497,934k 3,430k +1.46 22. ... Re8 23.Rh3
29/48- 03:20 680,647k 3,393k +1.25 22. ... Re8 23.Rh3
29/48 04:04 823,701k 3,373k +1.27 22. ... Re8 23.Rh3 Rb8 24.Rb1 Qc6 25.Bd1 Kf8 26.Bc2 Ke7 27.Bc1 Bc7 28.Ne3 Nxe3 29.Qxe3 Rg8 30.a4 Kd8 31.Ba3 Nf8 32.f5 exf5 33.Bxf5 Bc8 34.Rxb8 Bxb8 35.Bxc8 Kxc8 36.Qf3 Qd7 37.Bc5
30/48- 04:27 903,015k 3,379k +1.21 22. ... Re8 23.Rh3
30/48- 04:42 957,321k 3,384k +1.15 22. ... Re8 23.Rh3
30/48 05:01 1,020,308k 3,385k +1.14 22. ... Re8 23.Rh3 Kf8 24.Rb1 Qc6 25.Bc1 Ke7 26.Nf2 Bb6 27.a4 Ra6 28.g4 Nxd4 29.Ba3+ Kd8 30.cxd4 Bxd4 31.Rb5 Ra7 32.Qd2 Bb6 33.Bd1 Ba6 34.Rb1 Rc7 35.Bb2 Bb7 36.Kf1 Bc5 37.Qxa5 d4 38.Bf3 Qa6 39.Qxa6
31/48- 05:22 1,091,175k 3,388k +1.08 22. ... Re8 23.Rh3
31/48+ 05:24 1,098,921k 3,387k +1.14 22. ... Re8
31/48- 05:27 1,108,422k 3,385k +1.08 22. ... Re8 23.Rh3
31/48+ 05:31 1,122,148k 3,382k +1.15 22. ... Re8
31/48+ 05:53 1,188,181k 3,362k +1.36 22. ... Re8
31/48- 05:59 1,207,213k 3,359k +1.20 22. ... Re8 23.Rh3
31/48 06:19 1,271,263k 3,349k +1.13 22. ... Re8 23.Rh3 Kf8 24.Rb1 Qc6 25.Bc1 Ke7 26.Bd1 Bc7 27.a4 Ra6 28.Ne3 Nxe3 29.Qxe3 Rg8 30.Qg3 Nf8 31.g6 Nxg6 32.Ba3+ Kd7 33.f5 exf5 34.Bc2 Ke8 35.Bxf5 Rb6 36.Rxb6 Qxb6 37.Bd6 Ne7 38.Bxe7 Kxe7
32/48+ 06:31 1,310,067k 3,345k +1.19 22. ... Re8
32/48+ 06:38 1,330,764k 3,343k +1.25 22. ... Re8
32/48- 06:51 1,376,799k 3,343k +1.19 22. ... Re8 23.Rh3
32/49+ 07:16 1,456,206k 3,338k +1.27 22. ... Re8
32/49- 07:26 1,489,333k 3,339k +1.16 22. ... Re8 23.Rh3
32/53+ 07:58 1,594,672k 3,332k +1.32 22. ... Re8
32/53 08:04 1,613,619k 3,332k +1.29 22. ... Re8 23.Rh3 Kf8 24.Rb1 Qc6 25.Bc1 Ke7 26.Bd1 Rg8 27.Bc2 Bc8 28.a4 Ke8 29.Bxf5 exf5 30.Ne3 Rb8 31.Rb5 Rxb5 32.axb5 Qe6 33.Ba3 Nf8 34.g3 Qb6 35.Qb1 Bd7 36.Kf2 Ng6 37.Rh2 Qxb5 38.Qxb5 Bxb5
33/53- 08:30 1,702,913k 3,333k +1.23 22. ... Re8 23.Rh3
33/53+ 08:50 1,765,378k 3,330k +1.29 22. ... Re8
33/53- 08:55 1,781,523k 3,330k +1.23 22. ... Re8 23.Rh3
33/53- 09:36 1,922,939k 3,334k +1.09 22. ... Re8 23.Rh3
33/53+ 11:09 2,235,534k 3,340k +1.20 22. ... Re8
33/53 12:28 2,484,029k 3,319k +1.27 22. ... Re8 23.Rh3 Kf8 24.Rb1 Qc6 25.Bc1 Ke7 26.Bd1 Rg8 27.Bc2 Bc8 28.a4 Ke8 29.Rb5 Ne7 30.Ba3 Ba6 31.Rb2 Nf8 32.Qf2 Ra7 33.Ne3 Bc8 34.g6 Nfxg6 35.g4 Rb7 36.f5 exf5 37.Rxb7 Bxb7 38.gxf5 Nf8 39.e6 fxe6 40.Bxe7 Bxe7

analysis by stockfish5 for comparison(version from 31.05.2014) show that there is a big improvement since stockfish5 because stockfish5 still play the blunder 22...Nxg3 even at tournament time control(maybe it can avoid it at depth 33 thanks to fail low but it seems to be more than 10 times slower than latest stockfish in avoiding Nxg3(I may test again to see if it is parallel search luck because I did not test stockfish with 1 core)

[D]r2b1rk1/1b1n1pp1/1q2p3/p2pPnP1/2pP1PN1/P1P1B1R1/4B1P1/R3Q1K1 b - - 0 22

Stockfish_14053109_x64_modern:
1/1 00:00 96 96k +3.93 22. ... Nxg3 23.Qxg3
2/2 00:00 195 195k +3.93 22. ... Nxg3 23.Qxg3
3/3 00:00 846 846k +4.33 22. ... Nxg3 23.Qxg3 Qb2
4/5 00:00 3k 139k +4.07 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1
5/5 00:00 3k 113k +4.29 22. ... Nxg3 23.Qxg3 Qb2 24.Qe1 Be7
6/7 00:00 5k 174k +3.97 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf2
7/7 00:00 6k 153k +4.17 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf2 Bb6
8/8 00:00 8k 186k +4.14 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf2 Bb6 26.Qh3
9/11 00:00 13k 260k +3.93 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bd1 Qd3 26.Be2 Qh7 27.Bf3
10/15 00:00 25k 417k +4.13 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bd1 Qd3 26.Be2 Qh7 27.Qh2 Be7
11/17 00:00 57k 630k +4.14 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf2 Bc6 26.Qf3 Be7 27.Bxc4 Bxa3 28.Be2 Rfb8
12/17 00:00 76k 755k +4.21 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf2 Bc6 26.Qf3 Nb6 27.Qh3 Qh7 28.Qg3 Rb8
13/20 00:00 137k 970k +4.12 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qxc3 25.Nh6+ Kh8 26.Nxf7+ Rxf7 27.g6 Rf5 28.Qh3+ Kg8 29.Rb1 Bb6 30.Bg4 Qd3 31.Bxf5 Qxf5
14/26 00:00 267k 1,263k +4.03 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bd1 Qd3 26.Kf2 Qxc3 27.Nh6+ Kh8 28.Ng4 Qb2+ 29.Be2 Kg8 30.Nf6+ Nxf6 31.gxf6
15/26 00:00 366k 1,459k +3.93 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bd1 Qd3 26.Kf2 Qxc3 27.Nh6+ Kh8 28.Ng4 Qb2+ 29.Be2 Kg8 30.Nf6+ Nxf6 31.gxf6 g6 32.Qh3
16/26 00:00 440k 1,564k +3.90 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bd1 Qxc3 26.Nh6+ Kh8 27.Nxf7+ Rxf7 28.g6 Rxf4 29.Bxf4 Qxg3 30.Bxg3 Bb6 31.Bf2 Rf8 32.Bg4 Rxf2 33.Kxf2 Bxd4+ 34.Ke2
17/27 00:00 548k 1,698k +3.96 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bd1 Qxc3 26.Nh6+ Kh8 27.Nxf7+ Rxf7 28.g6 Rxf4 29.Bxf4 Qxd4+ 30.Kh1 Nf8 31.Bc2 Qb2
18/27 00:00 2,025k 2,654k +4.07 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf3 Rb8 26.Rc1 Qh7 27.Bd1 Bc6 28.Bc2 g6 29.Nh6+ Kh8 30.Qh3 Rb6 31.Ra1 Rb2 32.Bd1 Be7 33.Bc1 Rbb8 34.Bc2 Rb6
19/27 00:01 3,814k 3,069k +3.89 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bd1 Qh7 26.Qf2 Bc6 27.Bc2 Qh5 28.Bd1 Qh8 29.Bc2 Bb6 30.f5 exf5 31.Bxf5 Qh5 32.Rb1 Rab8 33.Rb2
20/31 00:01 6,303k 3,309k +4.07 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bd1 Qd3 26.Be2 Qh7 27.Rc1 Bc6 28.Bd1 Rb8 29.Bc2 g6 30.Nh6+ Kh8 31.Qh3 Rb6 32.Ra1 Rb2 33.Bd1 Be7 34.Bc1 Rbb8 35.Bc2 Rb6
21/31 00:02 8,309k 3,383k +4.03 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf3 Qh7 26.Rc1 Be7 27.Bd1 Bc6
22/31 00:02 9,843k 3,447k +4.05 22. ... Nxg3 23.Qxg3 Qb2 24.Re1 Qc2 25.Bf3 Qh7 26.Rc1 Kh8 27.Bd1 Bc6 28.Bc2 g6 29.Nh6 Be7 30.Ra1 Rab8 31.Qf2 Rb6 32.a4 Bd8 33.Re1 Be7 34.Ng4
23/40- 00:04 14,707k 3,488k +3.98 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Nxe5 25.dxe5 Qxe2
23/40- 00:04 15,673k 3,502k +3.92 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Nxe5 25.dxe5
23/40- 00:04 16,392k 3,528k +3.83 22. ... Nxg3 23.Qxg3 Bc7 24.Nh6+ Kh8 25.Qh3
23/40- 00:04 17,552k 3,542k +3.69 22. ... Nxg3 23.Qxg3 Bc7 24.Nh6+ Kh8 25.Qh3 gxh6 26.Qxh6+ Kg8 27.g6 fxg6
23/40 00:05 18,766k 3,550k +3.75 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Bc6 27.Bc2 g6 28.Nh6+ Kh8 29.Qh3 Rc8 30.Rd1 Be7 31.a4 Nb6 32.Ra1 Rb8 33.Bc1 Nc8 34.Ba3 Bxa3 35.Rxa3
24/40- 00:05 20,328k 3,568k +3.68 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Bc6 27.Bc2 g6 28.Nh6+ Kh8 29.Qh3 Rc8 30.Rd1 Be7 31.a4 Nb6 32.Ra1 Rb8
24/40- 00:06 21,708k 3,595k +3.62 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Bc6 27.Bc2 g6 28.Nh6+ Kh8 29.Qh3 Rc8 30.Rd1 Be7 31.a4 Nb6 32.Ra1 Rb8 33.Bc1 Nc8 34.Qg3 Nb6
24/40- 00:06 22,574k 3,619k +3.53 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Bc6 27.Bc2 g6 28.Nh6+ Kh8 29.Qh3 Rc8 30.g3 Be7 31.a4
24/40- 00:07 26,489k 3,639k +3.39 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Bc6 27.Bc2 g6 28.Nh6+ Kh8 29.Qh3 Rc8 30.g3 Be7 31.a4 Nb6 32.Rh2 Bxa4
24/40- 00:08 31,161k 3,645k +3.18 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qh7 26.Bd1 Bc6 27.Bc2 g6 28.Nh6+ Kh8 29.Qh3 Rc8 30.g3 Be7 31.a4 Nb6
24/41 00:10 37,631k 3,685k +3.16 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 g6 29.Nh6+ Kg7 30.Qf3 Rb6 31.Kg1 Rb3 32.Bxb3 cxb3 33.g4 Be7 34.Rh2 Bxa3 35.f5 a4 36.f6+ Kh8 37.Nxf7+ Rxf7 38.Rxh7+ Rxh7
25/41- 00:11 42,211k 3,689k +3.10 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 g6 29.Nh6+ Kg7 30.Qf3 Rb6 31.Kg1 Rb3 32.Bxb3 cxb3 33.g4
25/41- 00:12 46,014k 3,700k +3.03 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 Qh5 29.Bd1 Qh1 30.Bc2 Bb6 31.Nf6+ Nxf6 32.gxf6 g6 33.Bxg6 Kh8
25/41 00:12 47,673k 3,699k +3.19 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 Qh5 29.Bd1 Qh1 30.Bc2 Rb6 31.Nf6+ Nxf6 32.gxf6 g6 33.Bxg6 Kh8 34.Bc2 Rg8 35.Qh3+ Qxh3 36.gxh3 Bc8 37.Bd1 Rb7 38.Bh5 Kh7 39.Rd1 Rb2+ 40.Kf3
26/41 00:13 49,413k 3,699k +3.19 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 Qh5 29.Bd1 Qh1 30.Bc2 Rb6 31.Nf6+ Nxf6 32.gxf6 g6 33.Bxg6 Kh8 34.Bc2 Rg8 35.Qh3+ Qxh3 36.gxh3 Bc8 37.Bd1 Rb7 38.Bh5 Kh7 39.Rd1 Rb2+ 40.Kf3
27/41- 00:15 57,627k 3,692k +3.13 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 Qh5 29.Bd1 Qh1 30.Bc2 Rb6
27/41- 00:16 62,898k 3,704k +3.07 22. ... Nxg3 23.Qxg3 Qb3 24.Nh6+ Kh8 25.Qh3 gxh6 26.Qxh6+ Kg8 27.g6 fxg6
27/41- 00:18 69,789k 3,705k +2.98 22. ... Nxg3 23.Qxg3 Qb3 24.Nh6+ Kh8 25.Qh3 gxh6 26.Qxh6+ Kg8 27.g6 fxg6 28.Qxg6+ Kh8 29.Qh6+
27/41- 00:21 79,947k 3,688k +2.84 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 g6 29.Kg1 Rb6 30.Qf3 Kg7 31.Nf6 Nxf6 32.gxf6+ Kg8 33.g3 Qh5
27/41- 00:26 98,583k 3,668k +2.63 22. ... Nxg3 23.Qxg3 g6 24.Kf2 Be7 25.Rh1
27/41 00:28 104,085k 3,669k +2.81 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 g6 29.Kg1 Rb6 30.Qf3 Kg7 31.Nf6 Nxf6 32.gxf6+ Kg8 33.g3 Qh5 34.Qxh5 gxh5 35.Rh2 Bc7 36.Rxh5 Ra8 37.Rg5+ Kf8 38.Rh5 Ke8
28/41- 00:31 115,608k 3,657k +2.75 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 g6 29.Kg1 Rb6 30.Qf3 Kg7 31.Nf6 Nxf6 32.gxf6+ Kg8 33.g3 Qh5 34.Qxh5 gxh5 35.Rh2 Bc7 36.Rxh5 Ra8
28/43- 00:34 124,862k 3,660k +2.68 22. ... Nxg3 23.Qxg3 Qb2
28/43 00:36 133,953k 3,669k +2.62 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Ra6 27.Kf2 Qh7 28.Bc2 g6 29.Kg1 Bc7 30.Qf3 Rd8 31.g3 Kf8 32.Rh2 Qg8 33.Qh1 Ke7 34.Rh7 Re8 35.Nf6 Nxf6 36.exf6+ Kd8 37.Rg7 Qf8 38.Qh7 Qxa3 39.Rxf7 Qa1+
29/43+ 00:39 145,033k 3,655k +2.68 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1
29/43+ 00:39 145,617k 3,655k +2.75 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 f4 30.Qxf4 Qxg6 31.Nf5 Bc7 32.Bc2 Qh5 33.g4 Qh8 34.Rf2
29/43+ 00:40 146,468k 3,653k +2.84 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8
29/43+ 00:40 148,447k 3,652k +2.98 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8
29/43 00:42 156,412k 3,646k +2.75 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7 Be7 31.Kh2 f4 32.Qxf4 Qe4 33.Qxe4 dxe4 34.Rf2 Ke8 35.Bg4 Rb6 36.Ng5 Nf8 37.Bf5 Bxg5 38.Bxg5 Rb3 39.Re2
30/43+ 00:44 163,326k 3,640k +2.82 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7 Be7 31.Kh2 f4 32.Qxf4 Qe4 33.Qxe4 dxe4 34.Rf2 Ke8 35.Bg4 Rb6 36.Ng5
30/49- 00:50 182,723k 3,625k +2.69 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7
30/49 00:52 190,434k 3,625k +2.69 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7 Be7 31.Kh2 f4 32.Bxf4 Qxg6 33.Ng5 Qh6+ 34.Nh3 g5 35.Be3 Kg8 36.Bg4 Nf8 37.Rb2 Rb6 38.Rxb6 Qxb6 39.Bxg5 Bxa3 40.Be3 Ng6 41.Ng5
31/49+ 00:53 193,044k 3,623k +2.75 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1
31/49- 00:57 208,268k 3,623k +2.63 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7
31/49 01:02 227,497k 3,616k +2.76 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7 Be7 31.Kh2 f4 32.Bxf4 Qxg6 33.Ng5 Rea6 34.Qg4 Bc8 35.Qf3 Nb6 36.Rf2 Kg8 37.Bc1 Bd7 38.Bc2 Qh6+ 39.Nh3 Qxc1 40.Qf7+ Kh8 41.Qxe7
32/49- 01:18 280,617k 3,579k +2.70 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7 Be7 31.Kh2 f4 32.Bxf4 Qxg6 33.Ng5 Rea6 34.Qg4 Bc8 35.Qf3 Nb6 36.Rf2 Kg8 37.Bc1 Bd7
32/50+ 01:20 285,846k 3,572k +2.82 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6
32/50+ 01:20 288,458k 3,570k +2.91 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5
32/50 01:23 297,587k 3,560k +2.82 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7 Be7 31.Kh2 f4 32.Qxf4 Qe4 33.Qxe4 dxe4 34.Rf2 Ke8 35.Bg4 Rxg6 36.Bh5 Rb6 37.Nd6+ Kd8 38.Nxc4 Rb3 39.Bd2 Nb6
33/50- 01:36 341,357k 3,537k +2.76 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nh6+ Kf8 29.g6 Re6 30.Nxf7 Be7 31.Kh2 f4 32.Qxf4 Qe4 33.Qxe4 dxe4 34.Rf2 Ke8 35.Bg4 Rxg6 36.Bh5 Rb6 37.Nd6+ Kd8 38.Nxc4 Rb3 39.Bd2 Nb6
33/50- 01:49 384,348k 3,524k +2.70 22. ... Nxg3 23.Qxg3 Bc7 24.Nf6+ Nxf6 25.gxf6 g6 26.Kf2 Qb2 27.Rh1
33/50- 02:01 425,754k 3,514k +2.60 22. ... Nxg3 23.Qxg3 Qb2 24.Rd1 Qc2 25.Rd2 Qb1+ 26.Bd1 Re8 27.f5 exf5 28.Nf6+ Bxf6 29.gxf6 g6
33/53- 02:34 537,542k 3,487k +2.46 22. ... Nxg3 23.Qxg3 Qb2 24.Nf6+ Bxf6 25.gxf6 Qxa1+ 26.Kf2 g6 27.Qh3
33/53- 03:10 659,113k 3,455k +2.25 22. ... Nxg3 23.Qxg3 Bc7 24.Nf6+ Nxf6 25.gxf6 g6 26.Kf2 Qb2 27.Rh1
33/53- 04:01 823,808k 3,405k +1.94 22. ... Nxg3 23.Qxg3 Qb2 24.Nf6+ Nxf6 25.gxf6 g6 26.Re1 Qxc3 27.Kf2 Bb6 28.Rd1 Qb2 29.Qh3 c3 30.Rh1
33/58- 07:55 1,616,762k 3,397k +1.47 22. ... Nxg3 23.Qxg3 Ra7 24.Kf2 Qc6 25.Nf6+ Bxf6 26.gxf6 g6 27.Rh1

wgarvin · Post by **wgarvin** » Fri Nov 14, 2014 6:03 am

Just to speculate for a moment:

I wonder if self-play testing is more problematic for weaker engines, because they are less "well-rounded" in general and they might still have major blind spots. A gauntlet of weak-but-similar-strength engines will have a varied assortment of blind spots--probably very different ones from the engine under test.

So when you test a change to a weaker engine, it is more likely to be a change that addresses one of these "major" blind spots. In self-play, the new version might be able to exploit that against the old version, even if gauntlet opponents would not. Also, whatever other blind-spots it has, will be part of both versions and they won't know how to exploit them against each other, even if gauntlet opponents would. So thats at least two reasons why self-play test results might vary from results against other opponents.

But by the time an engine gets up to the strength of Stockfish, it doesn't have much in the way of "major" weaknesses left! And maybe the character of the changes being tested, and the kind of effects they have, is different too. Suppose the strong engines are all pretty well-rounded, and the changes being tested on them are mostly "small tweaks" that help a little bit in a broad variety of positions. Unless a change is so bad as to cripple the engine somehow, it seems likely that it would help or hurt about the same against Komodo or Houdini as it does against Stockfish itself.

Obviously Stockfish could still get different results from self-play compared to doing a gauntlet, but maybe it doesn't happen often.

SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question