bob wrote:
OK, I certainly don't like that kind of testing. I've tried it in the past and found lots of unexpected "issues".
At the end testing is a compromise between speed and accuracy and it is the available hardware that sets the bar.
If resources are very limited you end up even testing from a set of tactic positions (as you stated in another post you were doing with Cray blitz)
Regarding our development methodology we cannot afford to spend more then 2 days to validate a patch except for very important and difficult ones like the tweaks to LMR or futility pruning parameters that we take extra care with and normally we change only once per release.
And testing against an engines pool doubles the error at given number of games. For instance suppose you play a gauntlet of 1000 games with SF_A against 4 opponents, 250 games each opponent. At the end you have a score of say 51% +-2
Now you want to test if SF_B is better then SF_A, so you repeat the test with SF_B and you get 52%+-2.
What can you understund from the result ? You have to consider _both_ the error ranges in the first _plus_ in the second match.
Instead in self play, after 1000 games you get that, for instance, SF_B scores at 51%+-2 against SF_A
In the second case the probability that SF_B is better then SF_A is bigger then in the first case and you have played half the games !
Yes, but you have little idea about how it plays against _other_ engines. Which is hopefully the goal of testing. Adding a new feature to A to produce A' gives you two almost identical engines. That one change might hurt more than expected, or it might help more than expected, only because that is the _only_ difference in the two opponents. We tested like this early in the cluster testing, and then compared the results to using a common set of opponents, and found _much_ better accuracy with a group of opponents.
Just my $.02 from working out the details of our cluster testing approach.
Also note that you don't have to re-run test A every time you make a new A'. Save the old PGN and all you do each time is run a new A' test, until you find one that is better. Rehame this a A, rename the A' pgn to A, and now make a new A' and test the next change...
Sven Schüle wrote:
You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.
Apart that is not fair to compare a testing session of 1000 games against another of 2000, does it change something in the overall results ?
I put it down more clear:
Hypotesis: Given two version of an engine, say SF_A and SF_B, and given a fixed number of games to perform (say 1000) what is the testing method that maximizes the reliability of the judgment of which of the two engines is stronger ?
1) Self play of 1000 games
2) Gauntlet play with other engines for a total numer of 1000 games
I am quite confident (but no prof because you need knowledge of statistic) that is (1).
Note that in (1) you could also add the probability (supposedly low) of SF_A to be stronger then SF_B given that SF_A is weaker against other engines then SF_B. But even with this correction I guess (1) is still the best way, i.e. the way that minimize the probability of taking a bad decision regarding which engine is stronger.
It would be interesting if Rémi could give some insight in this problem (I think is the only one among us that could give a contribuition apart from "guessing")
(2) is more accurate. Because I have actually done this comparison many times when we started cluster testing, as we tried to figure out the best way to measure changes.
If you don't combine all the PGN, you can't compare SFA to SFB at all, because the SFA sample and the SFB sample sets are different. If you play against the same set of opponents, combine everything into one file as Remi suggested when we had this discussion (here) a couple of years ago, you get really reliable comparisons.
Sven Schüle wrote:
You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.
Apart that is not fair to compare a testing session of 1000 games against another of 2000, does it change something in the overall results ?
I put it down more clear:
Hypotesis: Given two version of an engine, say SF_A and SF_B, and given a fixed number of games to perform (say 1000) what is the testing method that maximizes the reliability of the judgment of which of the two engines is stronger ?
1) Self play of 1000 games
2) Gauntlet play with other engines for a total numer of 1000 games
I am quite confident (but no prof because you need knowledge of statistic) that is (1).
Note that in (1) you could also add the probability (supposedly low) of SF_A to be stronger then SF_B given that SF_A is weaker against other engines then SF_B. But even with this correction I guess (1) is still the best way, i.e. the way that minimize the probability of taking a bad decision regarding which engine is stronger.
It would be interesting if Rémi could give some insight in this problem (I think is the only one among us that could give a contribuition apart from "guessing")
(2) is more accurate. Because I have actually done this comparison many times when we started cluster testing, as we tried to figure out the best way to measure changes.
If you don't combine all the PGN, you can't compare SFA to SFB at all, because the SFA sample and the SFB sample sets are different. If you play against the same set of opponents, combine everything into one file as Remi suggested when we had this discussion (here) a couple of years ago, you get really reliable comparisons.
2 may be better to measure changes in elo points but I am not sure if it is better to decide if a change is good or bad.
The target is not to know the exact elo difference but to improve the engine and it is not clear to me that testing against different opponents
cause bigger improvement in elo points relative to testing against the previous version.
Sven Schüle wrote:
You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.
Apart that is not fair to compare a testing session of 1000 games against another of 2000, does it change something in the overall results ?
I put it down more clear:
Hypotesis: Given two version of an engine, say SF_A and SF_B, and given a fixed number of games to perform (say 1000) what is the testing method that maximizes the reliability of the judgment of which of the two engines is stronger ?
1) Self play of 1000 games
2) Gauntlet play with other engines for a total numer of 1000 games
I am quite confident (but no prof because you need knowledge of statistic) that is (1).
Note that in (1) you could also add the probability (supposedly low) of SF_A to be stronger then SF_B given that SF_A is weaker against other engines then SF_B. But even with this correction I guess (1) is still the best way, i.e. the way that minimize the probability of taking a bad decision regarding which engine is stronger.
It would be interesting if Rémi could give some insight in this problem (I think is the only one among us that could give a contribuition apart from "guessing")
(2) is more accurate. Because I have actually done this comparison many times when we started cluster testing, as we tried to figure out the best way to measure changes.
If you don't combine all the PGN, you can't compare SFA to SFB at all, because the SFA sample and the SFB sample sets are different. If you play against the same set of opponents, combine everything into one file as Remi suggested when we had this discussion (here) a couple of years ago, you get really reliable comparisons.
2 may be better to measure changes in elo points but I am not sure if it is better to decide if a change is good or bad.
The target is not to know the exact elo difference but to improve the engine and it is not clear to me that testing against different opponents
cause bigger improvement in elo points relative to testing against the previous version.
Uri
It seems obvious to me that testing against a variety of opponents is better if you have enough time to do it.
The strengths and weaknesses of different programs will be different and will expose flaws compared to a single change against self.
I think also that in chess programming the goal is to beat other programs that are currently at or above the strength of your program. So to increase stength of a program against itself may or may not have value but to increase strength against the opponents that you want to beat has clear and obvious value.
Sven Schüle wrote:
You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.
Apart that is not fair to compare a testing session of 1000 games against another of 2000, does it change something in the overall results ?
I put it down more clear:
Hypotesis: Given two version of an engine, say SF_A and SF_B, and given a fixed number of games to perform (say 1000) what is the testing method that maximizes the reliability of the judgment of which of the two engines is stronger ?
1) Self play of 1000 games
2) Gauntlet play with other engines for a total numer of 1000 games
I am quite confident (but no prof because you need knowledge of statistic) that is (1).
Note that in (1) you could also add the probability (supposedly low) of SF_A to be stronger then SF_B given that SF_A is weaker against other engines then SF_B. But even with this correction I guess (1) is still the best way, i.e. the way that minimize the probability of taking a bad decision regarding which engine is stronger.
It would be interesting if Rémi could give some insight in this problem (I think is the only one among us that could give a contribuition apart from "guessing")
(2) is more accurate. Because I have actually done this comparison many times when we started cluster testing, as we tried to figure out the best way to measure changes.
If you don't combine all the PGN, you can't compare SFA to SFB at all, because the SFA sample and the SFB sample sets are different. If you play against the same set of opponents, combine everything into one file as Remi suggested when we had this discussion (here) a couple of years ago, you get really reliable comparisons.
2 may be better to measure changes in elo points but I am not sure if it is better to decide if a change is good or bad.
The target is not to know the exact elo difference but to improve the engine and it is not clear to me that testing against different opponents
cause bigger improvement in elo points relative to testing against the previous version.
Uri
FIrst, it doesn't matter _who_ you play against. The error bar is based solely on the number of games and the results. A vs A' is painful because the difference is not great, you end up with excessive draws because the two programs are very close.
As I said, if you look back at the beginning of my cluster testing experiments, we tried A vs A' and were _not_ happy with the overall results. I then went to the gauntlet approach and more often than I expected, A vs A' said either "better or worse" while A vs gauntlet and A' vs gauntlet said the opposite.
I'm not looking for an "exact" elo difference. But I do need an _accurate_ difference or I can't compare. Whether you use A vs A' or A and A' vs a gauntlet, you need the same number of games for the same "accuracy" in terms of Elo. Although I don't think it is actually very representative of how the change affects you vs other programs.
Sven Schüle wrote:
You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.
Apart that is not fair to compare a testing session of 1000 games against another of 2000, does it change something in the overall results ?
I put it down more clear:
Hypotesis: Given two version of an engine, say SF_A and SF_B, and given a fixed number of games to perform (say 1000) what is the testing method that maximizes the reliability of the judgment of which of the two engines is stronger ?
1) Self play of 1000 games
2) Gauntlet play with other engines for a total numer of 1000 games
I am quite confident (but no prof because you need knowledge of statistic) that is (1).
Note that in (1) you could also add the probability (supposedly low) of SF_A to be stronger then SF_B given that SF_A is weaker against other engines then SF_B. But even with this correction I guess (1) is still the best way, i.e. the way that minimize the probability of taking a bad decision regarding which engine is stronger.
It would be interesting if Rémi could give some insight in this problem (I think is the only one among us that could give a contribuition apart from "guessing")
(2) is more accurate. Because I have actually done this comparison many times when we started cluster testing, as we tried to figure out the best way to measure changes.
If you don't combine all the PGN, you can't compare SFA to SFB at all, because the SFA sample and the SFB sample sets are different. If you play against the same set of opponents, combine everything into one file as Remi suggested when we had this discussion (here) a couple of years ago, you get really reliable comparisons.
2 may be better to measure changes in elo points but I am not sure if it is better to decide if a change is good or bad.
The target is not to know the exact elo difference but to improve the engine and it is not clear to me that testing against different opponents
cause bigger improvement in elo points relative to testing against the previous version.
Uri
It seems obvious to me that testing against a variety of opponents is better if you have enough time to do it.
The strengths and weaknesses of different programs will be different and will expose flaws compared to a single change against self.
I think also that in chess programming the goal is to beat other programs that are currently at or above the strength of your program. So to increase stength of a program against itself may or may not have value but to increase strength against the opponents that you want to beat has clear and obvious value.
I'm quite sure testing against a variety of other opponents is superior to self-testing given the same number of games.
However, the key phrase in my statement is "given the same number of games" and I mean the same number of games for the individual programs being tested.
My tester only manages round robins but I could easily fix that. But with round robin testing, imagine that I am testing Komodo against 9 other opponents. That means 90% of the testing time is spent on those 9 other program and only 10% is spent testing my own program. I have to do significantly more work to get the same statistical confidence.
A modification I could make to my tester is a mode where I can identify programs by family. I would play the usual round robins but have a rule that family never competes with family. This would be a big improvement over round robin as I could put all the other programs into the same family and Komodo into the second family. So Komodo would be being throughly tested in this way.
I'm not so sure that playing other programs is really giving you as much variety as you think however. Stylistically, programs are much closer to other programs in playing style than they are humans. Nevertheless I agree that playing a variety of computer opponents is better than nothing.
Sven Schüle wrote:
You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.
Apart that is not fair to compare a testing session of 1000 games against another of 2000, does it change something in the overall results ?
I put it down more clear:
Hypotesis: Given two version of an engine, say SF_A and SF_B, and given a fixed number of games to perform (say 1000) what is the testing method that maximizes the reliability of the judgment of which of the two engines is stronger ?
1) Self play of 1000 games
2) Gauntlet play with other engines for a total numer of 1000 games
I am quite confident (but no prof because you need knowledge of statistic) that is (1).
Note that in (1) you could also add the probability (supposedly low) of SF_A to be stronger then SF_B given that SF_A is weaker against other engines then SF_B. But even with this correction I guess (1) is still the best way, i.e. the way that minimize the probability of taking a bad decision regarding which engine is stronger.
It would be interesting if Rémi could give some insight in this problem (I think is the only one among us that could give a contribuition apart from "guessing")
(2) is more accurate. Because I have actually done this comparison many times when we started cluster testing, as we tried to figure out the best way to measure changes.
If you don't combine all the PGN, you can't compare SFA to SFB at all, because the SFA sample and the SFB sample sets are different. If you play against the same set of opponents, combine everything into one file as Remi suggested when we had this discussion (here) a couple of years ago, you get really reliable comparisons.
2 may be better to measure changes in elo points but I am not sure if it is better to decide if a change is good or bad.
The target is not to know the exact elo difference but to improve the engine and it is not clear to me that testing against different opponents
cause bigger improvement in elo points relative to testing against the previous version.
Uri
It seems obvious to me that testing against a variety of opponents is better if you have enough time to do it.
The strengths and weaknesses of different programs will be different and will expose flaws compared to a single change against self.
I think also that in chess programming the goal is to beat other programs that are currently at or above the strength of your program. So to increase stength of a program against itself may or may not have value but to increase strength against the opponents that you want to beat has clear and obvious value.
I'm quite sure testing against a variety of other opponents is superior to self-testing given the same number of games.
However, the key phrase in my statement is "given the same number of games" and I mean the same number of games for the individual programs being tested.
My tester only manages round robins but I could easily fix that. But with round robin testing, imagine that I am testing Komodo against 9 other opponents. That means 90% of the testing time is spent on those 9 other program and only 10% is spent testing my own program. I have to do significantly more work to get the same statistical confidence.
Why would you even consider doing that? I only test my program against the others. If you want to get a better calibration of the gauntlet players, you can play that RR exactly one time and save the PGN. Then each time you play a new version vs the gauntlet, you can use two readpgn commands to BayesElo to read the RR data + your new data and get the same accuracy. And never have to run those other games again. And in fact, the RR data is not critical to determining which of your two versions is better, it only refines the final Elo values, but the spread should be the same.
A modification I could make to my tester is a mode where I can identify programs by family. I would play the usual round robins but have a rule that family never competes with family. This would be a big improvement over round robin as I could put all the other programs into the same family and Komodo into the second family. So Komodo would be being throughly tested in this way.
I'm not so sure that playing other programs is really giving you as much variety as you think however. Stylistically, programs are much closer to other programs in playing style than they are humans. Nevertheless I agree that playing a variety of computer opponents is better than nothing.