SPRT question

bob · Post by **bob** » Thu Nov 13, 2014 7:16 pm

I decided to play around with SPRT as an early termination idea, since the StockFish guys were using it and it seemed quite reasonable. But the way they use it is in self-testing. My question concerns using it as an early termination methodology for gauntlets as well.

The code was easy enough to write, and the primary inputs to my function are simply wins, draws and losses, extracted directly from the PGN as the match progresses. The other terms (confidence interval for H0/H1 and the lower/upper Elo bounds for failure/success are straightforward.

My question is this: Is there anything wrong with the idea of using the above in a gauntlet? IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?

It seems logical that it should work just as well, but I wonder since I learned a lot about Elo testing from Remi when I started cluster testing. Any comments? experience? suggestions? warnings? Etc/??

Vinvin · Post by **Vinvin** » Thu Nov 13, 2014 8:12 pm

bob wrote:I decided to play around with SPRT as an early termination idea, since the StockFish guys were using it and it seemed quite reasonable. But the way they use it is in self-testing. My question concerns using it as an early termination methodology for gauntlets as well.

The code was easy enough to write, and the primary inputs to my function are simply wins, draws and losses, extracted directly from the PGN as the match progresses. The other terms (confidence interval for H0/H1 and the lower/upper Elo bounds for failure/success are straightforward.

My question is this: Is there anything wrong with the idea of using the above in a gauntlet? IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?

It seems logical that it should work just as well, but I wonder since I learned a lot about Elo testing from Remi when I started cluster testing. Any comments? experience? suggestions? warnings? Etc/??

IMHO, it's a good idea to play against other engines (family).

SF already shown weaknesses in king safety and the current testing methodology avoid to show such weaknesses. To understand one have a weakness it's a good to have an opponent who is able to play against it.

Evert · Post by **Evert** » Thu Nov 13, 2014 8:14 pm

By no means an expert, but one obvious thing to watch out for might be that the distribution of games over the opponents should be uniform. If you alternate opponents in the inner loop that shouldn't be a problem in practice.

It might be interesting to monitor or look at the results per opponent. This will tell you if the outcome is mainly due to one opponent or not, but I guess you'd do the same thing for regular gauntlet testing.

bob · Post by **bob** » Thu Nov 13, 2014 8:31 pm

Vinvin wrote:
bob wrote:I decided to play around with SPRT as an early termination idea, since the StockFish guys were using it and it seemed quite reasonable. But the way they use it is in self-testing. My question concerns using it as an early termination methodology for gauntlets as well.

The code was easy enough to write, and the primary inputs to my function are simply wins, draws and losses, extracted directly from the PGN as the match progresses. The other terms (confidence interval for H0/H1 and the lower/upper Elo bounds for failure/success are straightforward.

My question is this: Is there anything wrong with the idea of using the above in a gauntlet? IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?

It seems logical that it should work just as well, but I wonder since I learned a lot about Elo testing from Remi when I started cluster testing. Any comments? experience? suggestions? warnings? Etc/??
IMHO, it's a good idea to play against other engines (family).

SF already shown weaknesses in king safety and the current testing methodology avoid to show such weaknesses. To understand one have a weakness it's a good to have an opponent who is able to play against it.

I've always tested against gauntlets, only using self-play for debugging/testing performance issues. But the question I have is if the SPRT idea will work equally well using a group of opponents as opposed to just one opponent.

bob · Post by **bob** » Thu Nov 13, 2014 8:33 pm

Evert wrote:By no means an expert, but one obvious thing to watch out for might be that the distribution of games over the opponents should be uniform. If you alternate opponents in the inner loop that shouldn't be a problem in practice.

It might be interesting to monitor or look at the results per opponent. This will tell you if the outcome is mainly due to one opponent or not, but I guess you'd do the same thing for regular gauntlet testing.

The games are certainly done as equally as possible, but there are issues. For example, the first N games that finish are usually quickly decisive, whether they are drawn/won/lost. I see some substantial "wobble" in the Elo until a significant number of games have been played, and on quite a few occasions, the Elo for Crafty at 5000 games is quite a bit different from the final Elo after 30K games.

And yes, I look at results per opponent and results overall...

Evert · Post by **Evert** » Thu Nov 13, 2014 9:48 pm

bob wrote: The games are certainly done as equally as possible, but there are issues. For example, the first N games that finish are usually quickly decisive, whether they are drawn/won/lost. I see some substantial "wobble" in the Elo until a significant number of games have been played, and on quite a few occasions, the Elo for Crafty at 5000 games is quite a bit different from the final Elo after 30K games.

Ah, but the SPRT test as done by Stockfish doesn't give you an Elo measurement. It just tells you whether the statement "Crafty is stronger than the gauntlet" is true or not. So it also cannot tell you whether version A or version B is better, unless one of them is weaker than the gauntlet and the other is not.

However, it seems to me that the hypothesis "version A performs better against the gauntlet than version B" probably can be tested using SPRT (someone who has thought about this more may want to say something about this), but needs to be done differently. Perhaps having both version A and version B play out a given position against a particular opponent and then recording whether A did better, equal or worse in this short match. That measures performance of version A against version B and maps the score to win/draw/loss so the SPRT test can be done in exactly the same way as is done in self-play. The noise would be larger though, so you'd probably want a tighter bound.

gladius · Post by **gladius** » Thu Nov 13, 2014 10:25 pm

Vinvin wrote:
bob wrote:I decided to play around with SPRT as an early termination idea, since the StockFish guys were using it and it seemed quite reasonable. But the way they use it is in self-testing. My question concerns using it as an early termination methodology for gauntlets as well.

The code was easy enough to write, and the primary inputs to my function are simply wins, draws and losses, extracted directly from the PGN as the match progresses. The other terms (confidence interval for H0/H1 and the lower/upper Elo bounds for failure/success are straightforward.

My question is this: Is there anything wrong with the idea of using the above in a gauntlet? IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?

It seems logical that it should work just as well, but I wonder since I learned a lot about Elo testing from Remi when I started cluster testing. Any comments? experience? suggestions? warnings? Etc/??
IMHO, it's a good idea to play against other engines (family).

SF already shown weaknesses in king safety and the current testing methodology avoid to show such weaknesses. To understand one have a weakness it's a good to have an opponent who is able to play against it.

I'm not sure about this. It depends on the pool of engines that are available with the same strength or stronger than you are.

It doesn't help to include weaker engines in the pool, as you will potentially accept changes that optimize against them, instead of helping vs stronger opponents.

In the SF case, there is really only one current opponent at the TC we are testing at that fits this criteria (Komodo), and it's not free, so it becomes a logistical problem

.

Also, that doesn't even factor in the decrease in testing resolution when not using self-play, as well as the increased error bars.

bob · Post by **bob** » Thu Nov 13, 2014 10:38 pm

Evert wrote:
bob wrote: The games are certainly done as equally as possible, but there are issues. For example, the first N games that finish are usually quickly decisive, whether they are drawn/won/lost. I see some substantial "wobble" in the Elo until a significant number of games have been played, and on quite a few occasions, the Elo for Crafty at 5000 games is quite a bit different from the final Elo after 30K games.
Ah, but the SPRT test as done by Stockfish doesn't give you an Elo measurement. It just tells you whether the statement "Crafty is stronger than the gauntlet" is true or not. So it also cannot tell you whether version A or version B is better, unless one of them is weaker than the gauntlet and the other is not.

However, it seems to me that the hypothesis "version A performs better against the gauntlet than version B" probably can be tested using SPRT (someone who has thought about this more may want to say something about this), but needs to be done differently. Perhaps having both version A and version B play out a given position against a particular opponent and then recording whether A did better, equal or worse in this short match. That measures performance of version A against version B and maps the score to win/draw/loss so the SPRT test can be done in exactly the same way as is done in self-play. The noise would be larger though, so you'd probably want a tighter bound.

Aha, you are exactly correct. I had not thought about this.

bob · Post by **bob** » Thu Nov 13, 2014 10:40 pm

gladius wrote:
Vinvin wrote:
bob wrote:I decided to play around with SPRT as an early termination idea, since the StockFish guys were using it and it seemed quite reasonable. But the way they use it is in self-testing. My question concerns using it as an early termination methodology for gauntlets as well.

The code was easy enough to write, and the primary inputs to my function are simply wins, draws and losses, extracted directly from the PGN as the match progresses. The other terms (confidence interval for H0/H1 and the lower/upper Elo bounds for failure/success are straightforward.

My question is this: Is there anything wrong with the idea of using the above in a gauntlet? IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?

It seems logical that it should work just as well, but I wonder since I learned a lot about Elo testing from Remi when I started cluster testing. Any comments? experience? suggestions? warnings? Etc/??
IMHO, it's a good idea to play against other engines (family).

SF already shown weaknesses in king safety and the current testing methodology avoid to show such weaknesses. To understand one have a weakness it's a good to have an opponent who is able to play against it.
I'm not sure about this. It depends on the pool of engines that are available with the same strength or stronger than you are.

It doesn't help to include weaker engines in the pool, as you will potentially accept changes that optimize against them, instead of helping vs stronger opponents.

In the SF case, there is really only one current opponent at the TC we are testing at that fits this criteria (Komodo), and it's not free, so it becomes a logistical problem .

Also, that doesn't even factor in the decrease in testing resolution when not using self-play, as well as the increased error bars.

You can clearly make a change that helps against yourself, but which hurts against either stronger or weaker opponents. I don't particularly buy that argument.

Error bars don't increase/decrease based on self-play, as the number of games can be increased to choose whatever error bar you deem acceptable. I just went through a round of both self-testing and gauntlet testing while working on singular extensions and threat extensions, and I got way too many false positives with self-test that were promptly exposed with gauntlet testing...

Uri Blass · Post by **Uri Blass** » Thu Nov 13, 2014 11:31 pm

Vinvin wrote:
bob wrote:I decided to play around with SPRT as an early termination idea, since the StockFish guys were using it and it seemed quite reasonable. But the way they use it is in self-testing. My question concerns using it as an early termination methodology for gauntlets as well.

The code was easy enough to write, and the primary inputs to my function are simply wins, draws and losses, extracted directly from the PGN as the match progresses. The other terms (confidence interval for H0/H1 and the lower/upper Elo bounds for failure/success are straightforward.

My question is this: Is there anything wrong with the idea of using the above in a gauntlet? IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?

It seems logical that it should work just as well, but I wonder since I learned a lot about Elo testing from Remi when I started cluster testing. Any comments? experience? suggestions? warnings? Etc/??
IMHO, it's a good idea to play against other engines (family).

SF already shown weaknesses in king safety and the current testing methodology avoid to show such weaknesses. To understand one have a weakness it's a good to have an opponent who is able to play against it.

I do not know about which weakness you talk.

All the rating lists(that test against other engines) show that stockfish become stronger with every new version so the strategy of testing only against previous version works.

SPRT question

SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question