zullil wrote:
I understand that without a large number of games one cannot decide if a patch is positive, negative or neutral. (And part of me is thinking: "so what, no lives will be lost if a bad change is made. Life is short. Take risks. Have fun!) Of course, anyone is free to fork Stockfish and do whatever he wants, free from the constraints imposed by the fishtest protocols.
Quite so. But it takes a lot of resources to do this sort of development. Why else are there so many stockfish clones that keep rebasing themselves to current master?
It's perhaps no surprise that the creative forks, like sting, focus on test position solving, since that doesn't require so much raw CPU time. Just keep running through your EPD collection.
Komodo uses a lot of self-play testing (and I bet testing against other major engines too) - Mark spoke about this during TCEC and about the large financial investment in hardware to be able to keep up with fishtest. IIRC they run longer TC games, so they need even *more* hardware..
zullil wrote:
I just find myself wondering more and more if the current testing constraints almost preclude improving aspects of Stockfish's play that involve positional play and "long-term planning". Maybe Stockfish has now developed to the point where a new testing protocol is needed, in order for the engine to reach its true potential?
It depends on your development model. I don't think the importance of having a quantifiable criteria for patch testing and acceptance in a multi-developer project like SF should be underestimated. Fishcooking can get crabby, for sure, but it would be 100 times worse if the only way to decide which patch went in was via arguments over individual positions, games, etc.
At the same time I don't think you are going to get a paradigmatic shift via the SF model. At least not by a series of incremental changes. Probably in the future someone who has worked alone will present a mega-patch with solid local test results to back it up (after all, if you did achieve a paradigm shift it would be worth many Elo, so you would demonstrate 99% LOS quite quickly)
And the good thing is, if the patch truly did increase SF's strength, it would pass the framework test and be accepted.