SPRT when not used for self testing
Posted: Fri Oct 21, 2016 9:27 am
Hello all,
I've been working on coming up with a reasonable way of testing changes made to my engine. I now have the resources to queue up many possible changes for testing.
I've been looking to build my own testing framework, similar to what Stockfish uses, but of my own creation. Due to, in part, my desire to learn the things I must learn to accomplish this, and also because I am not interested in self-testing.
Anyway, I've been trying to read the threads around here on SPRT to get an idea of what is going on. I think I am at the point now where I understand the process for implementing SPRT when self testing. However, for gauntlet testing, this posses two new problems.
1) How would I alter elo0 and elo1 bounds for individual tests? The stockfish guys run a load of tests on [0,5], which makes sense, and then non-algorithm changes on [-3, 1], which makes sense. But lets say I have Ethereal, rated ~2600, and an old version of DiscoCheck rated ~2730. So, a good change to Ethereal would show a rating difference of, say only 120, whereas a failed change would show a rating difference of 140. Should the bounds (elo0, elo1) simply be [-130, -125], if I were to mirror the [0, 5] that stockfish uses for self testing?
2) Clearly the SPRT calculations would have to be calculated for each of the engine matchups. Say Ethereal plays against A, B, and C. If all 3 [a,b,c] pass, or fail, the result is clear. However, what happens when A and B pass, but C fails? Has anyone here tinkered around with this configuration?
Additionally, I would need to keep tracking of the improvements Ethereal makes against each engine. Say I pass a test against A, B, and C, and commit the changes and release a new version. Now, in order to continue testing, I would need to determine the difference between Ethereal and A again, same for B + C.
So this looks like, If done correctly for gauntlets, provides a way to terminate failed tests early. However, terminating a passed test requires another bench marking to begin testing again.
Is there a better method for Gauntlet testing?
Thanks,
Andrew Grant
I've been working on coming up with a reasonable way of testing changes made to my engine. I now have the resources to queue up many possible changes for testing.
I've been looking to build my own testing framework, similar to what Stockfish uses, but of my own creation. Due to, in part, my desire to learn the things I must learn to accomplish this, and also because I am not interested in self-testing.
Anyway, I've been trying to read the threads around here on SPRT to get an idea of what is going on. I think I am at the point now where I understand the process for implementing SPRT when self testing. However, for gauntlet testing, this posses two new problems.
1) How would I alter elo0 and elo1 bounds for individual tests? The stockfish guys run a load of tests on [0,5], which makes sense, and then non-algorithm changes on [-3, 1], which makes sense. But lets say I have Ethereal, rated ~2600, and an old version of DiscoCheck rated ~2730. So, a good change to Ethereal would show a rating difference of, say only 120, whereas a failed change would show a rating difference of 140. Should the bounds (elo0, elo1) simply be [-130, -125], if I were to mirror the [0, 5] that stockfish uses for self testing?
2) Clearly the SPRT calculations would have to be calculated for each of the engine matchups. Say Ethereal plays against A, B, and C. If all 3 [a,b,c] pass, or fail, the result is clear. However, what happens when A and B pass, but C fails? Has anyone here tinkered around with this configuration?
Additionally, I would need to keep tracking of the improvements Ethereal makes against each engine. Say I pass a test against A, B, and C, and commit the changes and release a new version. Now, in order to continue testing, I would need to determine the difference between Ethereal and A again, same for B + C.
So this looks like, If done correctly for gauntlets, provides a way to terminate failed tests early. However, terminating a passed test requires another bench marking to begin testing again.
Is there a better method for Gauntlet testing?
Thanks,
Andrew Grant