I don't really think Stockfish way of testing is in any way bad. The very fact that it works, and that other methods do not work as well is enough proof, but you can also look at it that way:AlexChess wrote: ↑Sun Aug 22, 2021 6:24 pm Stockfish criteria to test progresses are often criticized. Eg: in the first phase, 1 second to verify a potential improvement is ridiculous. Strategic understanding of some positions is still very weak for all top programs. Playing against the previous version gives only a partial view. A GM could judge much better an engine, playing more games @40/2h. The contribute of the largest panel of engines is very important, also if they are only a simple fine tuning of the same source code.
But I really don't like to interact with you... although I still love Ethereal and I really cannot ignore it
Kind regards, Alex
- Stockfish patches are really tiny, changing some tuning constants and testing that constant alone, some new network that is stronger by 3 Elo (or weaker by 0.5 elo), changing some tiny detail in extensions etc. These changes will produce no meaningful changes in most of the positions. It might flip two best moves in another position (assuming same number of nodes searched), however both moves will still be very good. In some cases, the new best move will actually be worse. However, since there's no all-knowing entity in chess, we simply don't know if the new best move is better or worse. A GM sitting for hours on Stockfish trying to spot the 0.9 elo improvement will just waste their hours, just to find out that the engine just gives the same best moves, or that all moves are still good. Adding the fact that using different hash size can change the moves as well, that threads introduce some randomness that changes the moves as well, there's just no way for anyone to spot a difference.
- for eval patches, any positive change will make it play better positionally in general. It's just giving the score to a position at the end of PV, no matter how long it is. For search patches you would obviously want the engine to search as much as possible. However, it takes milliseconds to get to depth 9/10. At that point, you will still have cut way more nodes than not, and so there is already a "large sample size" of nodes pruned/not pruned and extensions done/not done to test the search changes. In some positions it might not make a difference. In some positions you might find that the patch is worse at depth 40. But more often than not, testing many many different positions just turns out to be better than doing depth 40 searches. If we had a way to determine whether a position might be "problematic", a way to detect potential positions that might become worse at depth 40, while leaving the rest out at the low depths from fast play, then we could search those very few positions to depth 40, while leaving the rest as is to make the tests give better results. However, I can't even imagine something like this being possible, and I would argue that it's impossible.