Uri Blass wrote: ↑Sat Mar 11, 2023 9:32 am
1)I disagree about the information I need.
I think that I need also some unbiased estimate for the value of a change in order to have more knowledge.
You'll have to run your own tests to get the information you desire...
Not everything is about getting elo as fast as possible in a short time.
SF development is about improving the engine as well as possible with the finite resources that are available. The available resources may be significant, but they are finite.
I'm sure resource usage is not optimal, but the use of SPRT is not the problem.
I prefer to get less elo and better understanding because better understanding may help later for better decisions what to test later.
There is no way a human can "understand" why, say, a parameter tweak improves play, other than just accepting the test results which show it.
There are patches that deserve special attention, and those usually get special attention.
I am always afraid there may be errors that are not related to statistical noise(for example suppose in testing stockfishA against stockfishB in some move stockfishA is 10 times slower in nodes per seconds relative to the normal case of using stockfish because some different process run on the computer in the same time then results are not reliable)
But those things are supposed to be averaged out by fishtest. Whether fishtest sufficiently randomizes things to average out such noise I do not know for sure because i did not look into the fishtest code, but that is where one should look for this. Much better than redoing all the tests and then still wondering whether there may have been some noise, again redoing the tests, still wondering about noise, infinite loop.
Or do you think your "fixed number of games" are immune to noise?
I think that it is possible to detect these type of problem automatically but I do not know if it is done.
Fisthest does check for machines that produce results that deviate too much from the norm and purges those games.
3)It will always be the case that part of the patches are not good but if we use fixed number of games then we can have a better estimate about the percentage of the good patches(out of accepted patches) and the value of every patch.
And it will always be far more efficient to just tighten the SPRT error margin to whatever you are comfortable with.
What are you going to do if your fixed number of games don't clearly confirm the SPRT result? Right, you will rerun things. So the number of games will not be "fixed" at all. And when you're finally done, you will worry about noise and redo everything from scratch, and so on.
A patch applied to SF is never final. Almost everything gets revisited regularly. Patches that in reality lower Elo but somehow made it into SF will eventually be overwritten.
The system today will always find "improvements" even if there is no improvement and every patch reduce the level of the engine by 0.1 elo
because after enough patches that reduce the rating by 0.1 elo and fail SPRT
one test may be lucky to pass SPRT.
Once SF has reached the theoretical max Elo, then every patch will only lose Elo. But after some of these patches have been applied, there will be room for improvement again. So nothing to worry about. We'll just have to accept that a ceiling will be reached eventually. Maybe this year, maybe in 2187.