SPRT when not used for self testing

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
AndrewGrant
Posts: 472
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

SPRT when not used for self testing

Post by AndrewGrant » Fri Oct 21, 2016 7:27 am

Hello all,

I've been working on coming up with a reasonable way of testing changes made to my engine. I now have the resources to queue up many possible changes for testing.

I've been looking to build my own testing framework, similar to what Stockfish uses, but of my own creation. Due to, in part, my desire to learn the things I must learn to accomplish this, and also because I am not interested in self-testing.

Anyway, I've been trying to read the threads around here on SPRT to get an idea of what is going on. I think I am at the point now where I understand the process for implementing SPRT when self testing. However, for gauntlet testing, this posses two new problems.

1) How would I alter elo0 and elo1 bounds for individual tests? The stockfish guys run a load of tests on [0,5], which makes sense, and then non-algorithm changes on [-3, 1], which makes sense. But lets say I have Ethereal, rated ~2600, and an old version of DiscoCheck rated ~2730. So, a good change to Ethereal would show a rating difference of, say only 120, whereas a failed change would show a rating difference of 140. Should the bounds (elo0, elo1) simply be [-130, -125], if I were to mirror the [0, 5] that stockfish uses for self testing?

2) Clearly the SPRT calculations would have to be calculated for each of the engine matchups. Say Ethereal plays against A, B, and C. If all 3 [a,b,c] pass, or fail, the result is clear. However, what happens when A and B pass, but C fails? Has anyone here tinkered around with this configuration?

Additionally, I would need to keep tracking of the improvements Ethereal makes against each engine. Say I pass a test against A, B, and C, and commit the changes and release a new version. Now, in order to continue testing, I would need to determine the difference between Ethereal and A again, same for B + C.

So this looks like, If done correctly for gauntlets, provides a way to terminate failed tests early. However, terminating a passed test requires another bench marking to begin testing again.

Is there a better method for Gauntlet testing?

Thanks,
Andrew Grant

Sven
Posts: 3822
Joined: Thu May 15, 2008 7:57 pm
Location: Berlin, Germany
Full name: Sven Schüle
Contact:

Re: SPRT when not used for self testing

Post by Sven » Fri Oct 21, 2016 10:29 am

I would consider combining the results against A, B and C into one calculation for simplification. If I test my engine based on a gauntlet (I actually do so, where one of the opponents is a previous version of my own engine but there are also a few others) then I always calculate ratings from the games against all opponents. Results against one single opponent are not interesting for me. I think this should also be possible with SPRT.

AndrewGrant
Posts: 472
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: SPRT when not used for self testing

Post by AndrewGrant » Fri Oct 21, 2016 3:25 pm

Say I take the approach of combining them. Let's say the average of all engines in the gauntlet has a rating of 50elo more than Ethereal. I run a test with the bound [-50, -45]. The test passes and is therefore proven that this patch was worth at least 5 ELO.

Now I want to run a new test, so I use the bounds [-45, -40]. However, the test that passed was worth AT LEAST 5 ELO. maybe it was actually worth 30. Well, unless I make a horrible patch, this new bound should pass. Even if I make a small regression.

Stockfish avoids this because they only test versus themselves, so the engines are always assumed to be the same energy. But here, I choose my opponentz to have fixed elos and my engine variable.

So what is the solution? If a test passes, keep shifting the bound until it fails?

User avatar
Laskos
Posts: 9412
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: SPRT when not used for self testing

Post by Laskos » Sat Oct 22, 2016 6:27 am

AndrewGrant wrote:Say I take the approach of combining them. Let's say the average of all engines in the gauntlet has a rating of 50elo more than Ethereal. I run a test with the bound [-50, -45]. The test passes and is therefore proven that this patch was worth at least 5 ELO.

Now I want to run a new test, so I use the bounds [-45, -40]. However, the test that passed was worth AT LEAST 5 ELO. maybe it was actually worth 30. Well, unless I make a horrible patch, this new bound should pass. Even if I make a small regression.

Stockfish avoids this because they only test versus themselves, so the engines are always assumed to be the same energy. But here, I choose my opponentz to have fixed elos and my engine variable.

So what is the solution? If a test passes, keep shifting the bound until it fails?
This gauntlet against different engines is a bit tricky. Would you use it at one and the same time control and hardware? The engines might scale differently, and markedly so with ultra-fast games.

Then, for the problem you raised in quoted post, I would use 3 standard deviations stopping rule, not SPRT. In ideal for SPRT conditions like SF testing framework, SPRT is faster by 20-40% than 3SD stopping. But here you would combine engines to have a unified score, and you seem to be unsure about the magnitude of the patch. For example, if your SPRT window is [-50,-45] and the patch is really 30 points improvement, 3SD will stop an order of magnitude faster. Use 3SD stop for this testing with many unknown, see the value of the ELO difference, use the new ELO value for your engine, use 3SD for the next patch gain against the new ELO difference. And so on.

Say your initial engine is 50 ELO points weaker. With the patch, you stop when you see that now it is 41+/- 9 3SD (or usual 41+/- 6 2SD shown by rating programs like BayesElo or Ordo), assume that the new rating of your program is 41 ELO points weaker now than the adversaries. Do the same in the next test with 41 value instead of 50.

Note: the 3SD stopping rule (p-value stopping rule) has pretty similar Type I error to SPRT alpha=5% in the range of 100-100,000 games, usual range of testing. Type I error however is theoretically unbounded for larger number of games, it diverges slowly, logarithmically.

Post Reply